Mar 15, 2018

Lately I have been very interested in content marketing and understanding how businesses drive traffic to their websites. While in the past, ranking on the top of Google required intricate keyword-stuffing, these days Google has grown up and no longer ranks you based on the number of keywords you stuff across hundreds of pages. Instead, Google ranks high quality content from websites that have high domain authority and post such quality content often, which honestly is great.

There are countless posts online that describe the infinite SEO techniques to bring people to your website, but be cautious while reading as things are constantly changing in this landscape. If anything, I would recommend instead just focusing on posting quality content that people want to read and genuinely helping people, rather than exactly what keywords to use.

Link Building

While posting high quality content is one of the most important aspects of of content marketing and bringing people to your website, it is also important to get referral links from other websites that have high domain authority for your particular niches. Many people use websites like Quora or Reddit to answer questions and simultaneously plug for their company/website. Even more interesting, though, is the idea of “Broken Link Building” which is where you find popular websites in your niche and specifically try to find links that are broken that lead to content that you also offer on your own site. You can quickly send the webmaster an email and let them know respectfully that the link is broken, and then offer up your own link as an alternative. Pretty cool. What you essentially want is below, but with as many sites as possible linking to your pages on different topics:

How can I find broken links?

While it is a great idea in practice, can you imagine actually going to a website and trying to click on every single link to check if there are 404 errors? Even worse, if you really want to cover your ground, you need to check all internal pages on a website and see if they are pointing towards any external pages giving 404 errors. Yuck, that doesn’t sound fun.

Automating the Process

In the past, I have written about Business Process Automation as well as Building an Instagram ‘Like’ Bot, which go over both emulating a browser as well as inserting your bot in the middle of a front end application and the backend API. Of both options, it is typically faster and less complex to insert yourself in the middle, because you can then perform HTTP requests directly to a server that spits data out in a really convenient format. On the flip side, it is also possible to directly render an HTML page rather than an API and then extract the data from the HTML on the page without emulating a browser.

Most niche websites that serve content marketing will NOT have a backend API, because they don’t need one. Instead, they are typically static multipage websites that are managed either via WordPress, Wix, or some other variant. What that means is we need to directly render those pages and then simply extract all of the links on the page and check if they are broken. It may not sound so bad, but it can actually become a bit complex as you consider the various formats of different links: some may point to a site like http://google.com while others are url fragments like index.php?get=page or #page.

Are there tools for this?

Yes, lots of them. Go ahead and search “Broken Link Building” on Google and you’ll find at least 5 tools, some free but most paid. Many of the free tools are also very basic and content marketing themselves, enticing you to buy their premium link building package. There’s nothing wrong with using these free tools or even a paid one, however you have to understand that by doing so you are also providing that website with all of the data that you are collecting! What that means is, not only do you know about these vulnerable and valuable broken links, but potentially so does the owner of the tool you are using, which they can actually use to further leverage their authority in the domain as their data grows and they acquire more users.

A free one-off Python approach

I am not going to be able to provide you with a silver bullet to getting quality links or necessarily finding the perfect broken links, because that is simply a very complex problem to solve. Instead, I will provide you with a free simple Python tool that you can start with as a base and build up on if you choose. What this tool can do is accept a single URL, extract all of the links from that page, and check if those links are broken or not. See below for a diagram I made that describes how the program works at a high level, keeping in mind that the “Stores” are really just Python lists. I have not made this tool completely recursive, instead I give you complete control to extract URLs from a page and then pass those URLs to a status checker, all of which could be confined to your own custom method if you choose.

The Demo

Before I get into the details, please read the following: Warning: crawling websites that you do not own may violate the Terms of Service of that website, so always check the TOS first or only use this tool on your own website. We are going to use Python’s Requests and BeautifulSoup to extract hrefs.

Let’s start by extracting all of the links from one page and printing them to the screen:

from Crawler import Crawler

ignore = ['#','mailto','tag']
c = Crawler("acostanza.com", ignore)
urls = c.getLinksFromURL("http://www.acostanza.com")
for url in urls:
   print(url)

Running python program.py (where program.py is the name of your program) will output something like below (just a few of the URLs, didn’t want to paste them all):

....
https://help.ubuntu.com/community/Screen
https://virtualenv.pypa.io/en/stable/userguide/
https://www.python.org/downloads/release/python-360/
http://phantomjs.org/
https://developers.google.com/apps-script/guides/triggers/events
https://github.com/adcostanza/express-typescript-auth-jwt
....

Now that we have done that, let’s take those links we extracted and get even more links from all of the internal URLs we have found and add them all together:

all_links = list(set(urls + more_urls))
for link in all_links:
   print(link)

Running the script again returns the following, which has some of the older URLs linked deeper in my website from when I first started posting:

....
https://flask-restful.readthedocs.io/en/latest/
http://console.particle.io/devices
https://auth0.com/
http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
https://learn.adafruit.com/photocells/using-a-photocell
https://github.com/adcostanza/Angular-4-Barebones-Authentication-with-PHP-Sessions
....

Finally, let’s check the status code of all of the links and filter based on status code using a list comprehension:

url_status_tuple = c.getStatusOfURLs(c.rebaseURLs(all_links))
errors = [x for x in url_status_tuple if not x[1] == 200]
print(errors)

Running the program again will now return the status codes of all the links crawled:

....
https://learn.sparkfun.com/tutorials/sik-experiment-guide-for-arduino---v32/experiment-3-driving-an-rgb-led 200
https://developers.google.com/sheets/api/ 200
http://acostanza.com 200
http://gk.site5.com/t/587 200
http://acostanza.com/2017/10/03/simple-sms-angular-app/ 200
https://i.imgur.com/84NGwff.mp4 200
https://support.google.com/docs/answer/3093343?hl=en 200
http://acostanza.com/json-form-generator-angular4/ 200
https://github.com/adcostanza/Angular-4-Barebones-Authentication-with-PHP-Sessions 200
http://phantomjs.org/ 200
http://chaijs.com/ 200

[]

And there you go, you have officially extracted many links from one page automatically, then retrieved all of the links on internal domain pages that you found, and then finally checked the status code of hundreds of links! Not bad. The reason for the empty list at the end [ ] is because all the pages returned a status code 200, which means they successfully loaded. Finally all of this in one shot:

from Crawler import Crawler

ignore = ['#','mailto','tag']
c = Crawler("acostanza.com", ignore)
urls = c.getLinksFromURL("http://www.acostanza.com")
for url in urls:
    print(url)

more_urls = c.getLinksFromURLs(c.filterInternalURLs(c.rebaseURLs(urls)))

all_links = list(set(urls + more_urls))
for link in all_links:
    print(link)

url_status_tuple = c.getStatusOfURLs(c.rebaseURLs(all_links))
errors = [x for x in url_status_tuple if not x[1] == 200]
print(errors)

Warning: This tool is experimental and some websites will block crawlers like this or persistent requests like we are making, so sometimes a proxy is required and not all websites can be crawled this way (i.e. Google will ban you if you try this on a Google search).

Installing locally

To install this locally, you will want to type the following:

git clone https://github.com/adcostanza/Broken-Link-Builder
virtualenv -p python3 env
source ./env/bin/activate 
pip install -r requirements.txt

The Code

So how does this all work? Let’s see the Crawler class implementation below:

from typing import List, Tuple
import requests
from bs4 import BeautifulSoup
import time


class Crawler:
    def __init__(self, domain: str, ignoreList: List[str]):
        """
        Domain should NOT have http:// in it or a / at the end
        """
        self.s = requests.session()
        self.domain = domain
        self.ignoreList = ignoreList
        self.visitedLinks = []

    def getURL(self, url: str) -> str:
        """
        Retrieve HTML from url that belongs to self.domain
        """
        try:
            if self.domain not in url:
                raise Exception("Cannot get links for external URL", url)
            response = self.s.get(url, timeout=2)
            status_code: int = response.status_code
            self.visitedLinks.append(url)

            if not status_code == 200:
                raise Exception("Status code", status_code, url)

            html: str = response.text

            return html
        except requests.exceptions.Timeout:
            print("Timeout on URL", url)
            print("Sleeping 5 seconds")
            time.sleep(5)
            pass
        except requests.exceptions.ConnectionError:
            print("Connection error on URL", url)
            pass

    def getLinksFromHTML(self, html: str) -> List[str]:
        """
        Returns list of links from HTML retrieved from requests and ignoring ignore phrases
        """
        try:
            soup = BeautifulSoup(html, "html.parser")
            hrefs = [a['href'] for a in soup.find_all('a')]
            return list(set(self.ignore(hrefs)))
        except:
            print(soup.find_all('a'))
            return []

    def getLinksFromURL(self, url: str) -> List[str]:
        """
        Returns list of links from a single URL by chaining methods
        """
        html = self.getURL(url)
        links = self.getLinksFromHTML(html)
        return links

    def getLinksFromURLs(self, urls: List[str]) -> List[str]:
        """
        Returns list of links from multiple URLs by using getLinksFromURL in list comprehension
        """
        return list(set([link for url in urls for link in self.getLinksFromURL(url)]))

    def ignore(self, links: List[str]) -> List[str]:
        """
        Ignore links that have phrases from ignore list and return new list
        """
        return [link for link in links if not any(ignorePhrase in link for ignorePhrase in self.ignoreList)]

    @staticmethod
    def filterIncompleteURLs(urls: List[str]) -> List[str]:
        """
        Return all URLs that don't start with HTTP or HTTPS
        """
        return [url for url in urls if not url.startswith("http://") and not url.startswith("https://")]

    def filterInternalURLs(self, urls: List[str]) -> List[str]:
        """
        Return all URLs that start with self.domain
        """
        urls = list(set(map(lambda x: self.domain+x if x.startswith("/") and not x.startswith("//") else x, urls)))
        return [url for url in urls if url.startswith("http://"+self.domain) or url.startswith("https://"+self.domain)]

    def filterExternalURLs(self, urls: List[str]) -> List[str]:
        """
        Return all URLs that don't start with self.domain
        """
        return [url for url in urls if self.domain not in url]

    def rebaseURLs(self, urls: List[str]) -> List[str]:
        """
        Take in a list of URLs and return them properly formatted.
        Links such as index.php?home will have the domain name added to the front.
        Links that start with / or // will be updated to also have the internal domian.
        """
        new_urls = map(lambda url: "http://"+self.domain+url if (url.startswith("/") or url.startswith("#")) and not url.startswith("//") else url, urls)
        new_urls1 = map(lambda url: "http://"+self.domain+url if not(url.startswith("http://") or url.startswith("https://")) else url, new_urls)
        new_urls2 = map(lambda url: url.replace("//", "http://") if url.startswith("//") else url, new_urls1)
        real_new_urls = map(lambda url: "http://"+url if not url.startswith("http://") and not url.startswith("https://") and "." not in url else url, new_urls2)
        return list(set(real_new_urls))

    def getStatusOfURL(self, url: str) -> int:
        """
        Given a single URL, return the status code of that URL with requests (200 is success)
        """
        try:
            response = self.s.get(url, timeout=5)
            status_code: int = response.status_code
            print(url, status_code)
            return status_code
        except requests.exceptions.Timeout:
            return 504

    def getStatusOfURLs(self, urls: List[str]) -> List[Tuple[str, int]]:
        """
        Given a list of URLs, get the status of each URL and return in a List of Tuples of statuses and urls
        """
        return [(url, self.getStatusOfURL(url)) for url in urls]

 

As you can see, there are a lot of different methods that you can use independently of each other if you so desire. I have explicitly put a 5 second timeout, which means that after 5 seconds requests will return a 504 (timeout) response. You should instead be searching for 404 responses, not 504.

Some Tests

I also wrote some tests for this crawler to ensure it was working properly which gives you an idea of how you may use some of the features. Note that there is not 100% test coverage.

import unittest
from Crawler import Crawler

class CrawlerTest(unittest.TestCase):
    def setUp(self):
        """Called before each test"""
        self.ignore = ["tag", "respond"]
        self.c = Crawler("acostanza.com", self.ignore)

    def test_getLinksFromURL_DoesntHaveIgnoreListItems(self):
        urls = self.c.getLinksFromURL("http://acostanza.com")
        self.assertTrue(len(urls) > 0)
        for url in urls:
            for ig in self.ignore:
                if ig in url:
                    self.fail('{} is in {} but is on ignore list'.format(ig, url))

        # Only visited the base link
        self.assertEqual(1, len(self.c.visitedLinks))

    def test_getLinksFromMultipleURLs_DoesntHaveIgnoreListItems_AndHasMoreURLsThanSingleURL(self):
        urls = self.c.getLinksFromURLs(["http://acostanza.com", "http://acostanza.com/page/2/"])
        self.assertTrue(len(urls) > 0)
        for url in urls:
            for ig in self.ignore:
                if ig in url:
                    self.fail('{} is in {} but is on ignore list'.format(ig, url))

        urls_singleInitial = self.c.getLinksFromURL("http://acostanza.com")
        self.assertTrue(len(urls) > len(urls_singleInitial))

        # Visited both base links and then the initial link one more time
        self.assertEqual(3, len(self.c.visitedLinks))

    def test_getURLsTwoLevelsDeep(self):
        urls = self.c.getLinksFromURL("http://acostanza.com")
        more_urls = self.c.getLinksFromURLs(self.c.filterInternalURLs(urls))
        self.assertTrue(len(more_urls) > len(urls))

        check_links = list(set(urls + more_urls))
        url_status_tuple = self.c.getStatusOfURLs(check_links)
        errors = [x for x in url_status_tuple if not x[1] == 200]
        print(errors)

    def test_internalLinksAreInternal(self):
        urls = ["http://acostanza.com", "http://google.com"]
        filtered = self.c.filterInternalURLs(urls)
        self.assertEqual(filtered, [urls[0]])

    def test_externalLinksAreExternal(self):
        urls = ["http://acostanza.com", "http://google.com"]
        filtered = self.c.filterExternalURLs(urls)
        self.assertEqual(filtered, [urls[1]])

    def test_incompleteLinksAreIncomplete(self):
        urls = ["http://acostanza.com", "/subreddit"]
        filtered = Crawler.filterIncompleteURLs(urls)
        self.assertEqual(filtered, [urls[1]])

    def test_getStatusOfURL200(self):
        status = self.c.getStatusOfURL("http://acostanza.com")
        self.assertEqual(200, status)

    def test_getStatusofURL404(self):
        status = self.c.getStatusOfURL("http://acostanza.com/potato")
        self.assertEqual(404, status)

    def tearDown(self):
        """Just for reference"""
        self.c.s.close()
        self.c = []


if __name__ == '__main__':
    unittest.main()

Questions?

And that’s it, if you have any questions don’t hesitate to send me a quick email at adam@acostanza.com.