Oct 31, 2017

In any business there will always be an extensive list of tasks that must be repeated, whether it is paying your bills on time, processing an excel spreadsheet exported from a random program, or even accessing an outdated system just to manually copy certain values in a spreadsheet. It can be quite a mess, so for this post I wanted to focus on the latter situation, manual reentry of data from older systems to spreadsheets. This seems to be one of the most common tasks that is performed in any company in order to keep organized and share information.

Does the above picture remind you of anything? Here you are, staring at your nice computer copying values from a spreadsheet-esque web page into an Excel spreadsheet so that you can relay to everyone the status of certain documents, parts, and change orders. You get better at using this program and learn certain tricks, but for some reason all of these tricks never quite do what you want.

Break the Trend

This is where I propose that we break this trend and remove our eye from the transcription of web page to Excel and instead consider an alternative: automation.

Not too crazy, all you need to do is write a program that will do what your eyes have been doing all this time. This is where most people break down as it seems more complex than it really is.

Overview

Below you can see a general overview of what I will go over:

  1. Python Automation with Selenium
    • Environment Setup
    • Object Based Crawler
    • Prototyping Tips
  2. Smart Spreadsheets
    • Sheet Scripts
    • Google API
    • Python Tools
  3. Flask API Creation

Python Automation with Selenium

In order to automate some of these common tasks that require reading from or performing actions within a web browser, we will use a really cool tool called Selenium. Selenium is an amazing framework that enables you to emulate a browser. The typical use case of Selenium is for testing websites, but in this case we will use it to automate common actions as well as read information.

Setting up Selenium with Python is easy and I would suggest starting with the Chrome Webdriver. The Chrome Webdriver is nice because it opens a real browser and lets you see what actions are taking place. Once the initial development is complete and things are working well, you can switch to PhantomJS.

Environment Setup

As always, I recommend using Python 3.6 and setting up a virtual environment:

virtualenv env

I used to name my environments based on the project, but eventually I realized that makes no sense. For uniformity I like to name my environment env. Activate your environment and install Selenium:

.\env\scripts\activate #Windows command
source ./env/bin/activate #Unix command
pip install selenium

I have listed both the Windows and Unix commands for activating the virtual environment.

Object Based Crawler

My preferred method to create a crawler with Selenium or any other tool is to create a Crawler class that we can initialize as an object. I find that this is best because it wraps up all of the methods in a nice way and also enables you to reference the webdriver object as a class object rather than passing it from function to function. See what I mean below:

import time
from selenium import webdriver
class AgileHelper:
    def __init__(self, user, passw):
        self.user = user
        self.passw = passw
        self.options = webdriver.ChromeOptions()
        prefs = {"download.default_directory" : "C:\Crawler\Exports", 'profile.default_content_setting_values.automatic_downloads': 1}
        self.options.add_experimental_option("prefs",prefs)

    def login(self):
        driver = webdriver.Chrome('C:\chromedriver.exe',chrome_options=self.options) #Set driver and load from a specific location
        driver.get('interal.webpage.local'); #Load local webpage
        time.sleep(1)
        driver.find_element_by_xpath('//*[@id="j_username"]').send_keys(self.user)
        driver.find_element_by_xpath('//*[@id="j_password"]').send_keys(self.passw)
        driver.find_element_by_xpath('//*[@id="loginspan"]').click()
        window_after = driver.window_handles[1] #Find a popup window that appears
        driver.switch_to_window(window_after) #Switch to the popup window
        self.driver = driver
    def logout(self):
        self.driver.close()

This is a nice example that has a bit more than necessary. It shows you how you can create the skeleton for a Crawler which you can then add methods do. In this case, we are using a Chrome webdriver and we set some preferences such as the default download directory and allow automatic downloads. In the login method, the username and password are sent to HTML fields on a specific internal web page (could also be external). This specific example then enables you to switch to a window that pops up. I also specify the location of the chrome driver, but if it is in your PATH you do not need to. I have had some issues on Windows getting the drivers recognized in the path so I usually specify, but on Ubuntu it typically works pretty easily. Below you can see the same code brought down to the bare essentials:

import time
from selenium import webdriver
class AgileHelper:
    def __init__(self, user, passw):
        self.user = user
        self.passw = passw
    def login(self):
        driver = webdriver.Chrome()
        driver.get('website.local');
        time.sleep(1)
        driver.find_element_by_xpath('//*[@id="j_username"]').send_keys(self.user)
        driver.find_element_by_xpath('//*[@id="j_password"]').send_keys(self.passw)
        driver.find_element_by_xpath('//*[@id="loginspan"]').click()
        self.driver = driver
    def logout(self):
        self.driver.close()

Note how much less we really need to get a basic Crawler, but I hope the previous code will be helpful if you need some specific options for the browser. Calling this Crawler and telling it to login is easy:

c = Crawler()
c.login('username','password')

The only downside is passwords in plain text, but that can easily be fixed by using environment variables and accessing them via os.environ.

Prototyping Tips

In the prototyping phases where you are determining what works and what doesn’t, I highly recommend making small methods that do different little tasks and chaining them together as necessary. Rather than editing old code to add a new feature, just add a new method that can hopefully reference old methods as well:

class Crawler:
    ....
    def readInformation(self):
        ....
    def downloadDocuments(self,docs):
        for doc in docs:
            ....
        ....
    def initiateChange(self):
        ....

The idea here is you can create a different method for each type of action you want to do. The login() is one example method, and what is great about using classes / objects instead of the functional approach is that you can store certain variables within the state of the object.

Smart Spreadsheets

Now that we have a way to access the web application via Selenium and Python, we can use our Crawler to get information and store it within a Google Spreadsheet. You may be wondering why a Google Spreadsheet and not Excel? My reasoning is that Google Spreadsheets are much more powerful and offer up both a scripting language specifically for the spreadsheets that is much better than VBA (it is similar to Javascript) and you can also read and write to Sheets using the Google Sheets API.

Sheet Scripts & Commands

The first thing to start experimenting with is the Javascript-esque scripts you can create, for instance you can set an onEdit(e) function that is called every time the sheet is updated to perform calculations or other actions. Another big tip is to use the QUERY function within Google sheets, as it will enable you to query your tables as if they are a database which is pretty amazing.

Sheets API

In order to connect to the Google Sheets you need to set up an API key with the Sheets API. You’ll want to create a new Client ID JSON file for this specific API endpoint. Make sure to share your spreadsheet with the Client Email that is stored in the JSON file otherwise the API endpoint will not work.

Python Tools

In order to interact with Google Sheets from Python, I highly recommend using oauth2client and gspread. If you follow the gspread documentation it is pretty easy to get set up to connect to a specific Spreadsheet. Make sure you share the Sheet(s) with the client email from the JSON credential file! If you forget this step, you will have problems.

You should now have the tools to be able to set up a Smart Spreadsheet where you can read data from the spreadsheet, initialize a Selenium crawler, gather data, and write that data to the spreadsheet. But what if you don’t want to use a Spreadsheet or you want to create a service to an out of date system for many other users?

Flask API Creation

Flask is an amazing library that makes it really easy to make an API endpoint for any Python program. I like to use Flask-RESTful, as I described in a previous post where I made an API for my personal TV App. Setting up an API is easy, following my previous post you can see this bit of code where different end points are created:

from flask import Flask, request, jsonify
from flask_restful import reqparse, Resource, Api
from show_schedule import airing_shows, show_id, next_prev_ep
from db import DB
app = Flask(__name__)
api = Api(app)
db = DB()
def options(self):
    pass
  
class Airing(Resource):
  def get(self):
    episodes = db.getAllShows()
    ids = [id[0] for id in episodes]
    _episodes = airing_shows(ids)
    return jsonify(_episodes)
    
class ShowID(Resource):
  def get(self,show):
    return jsonify(db.get_show_id(show))
  def post(self,show):
    sid,name=show_id(show)
    db.addShow(sid,name)
    return jsonify(id=sid,show=name)
class allShows(Resource):
  def get(self):
    return jsonify(db.getAllShows())
class NextPrev(Resource):
  def get(self,show):
    id, = db.get_show_id(show)
    if (id != []):
      episodes = next_prev_ep(id[0])
      return jsonify(episodes)
api.add_resource(Airing,'/airing/')
api.add_resource(ShowID,'/show/<show>')
api.add_resource(allShows,'/show/')
api.add_resource(NextPrev,'/episodes/<show>')
if __name__ == '__main__':
     app.run(host='0.0.0.0', port=8080)

In this specific example, I imported functions from show_schedule and the db class. It is no different if you wanted to call a Crawler object and return a JSON response of the details read by the Crawler. Keep in mind the response could be very slow if in the prototyping phase, so there may need to be extra effort to ensure delays won’t time out. Whenever possible, use a library like requests instead of Selenium as it will increase the speed of your API up significantly. Sometimes even a hybrid approach can work if certain parts of the site cannot be accessed via requests.