Distributed Web Crawling with Tornado and Gearman

Welcome to Technology Tuesday – where the iAcquire development team will be sharing a little piece of our world with you once a week. Over the past 3 years we’ve been quietly behind the scenes building the technology that makes iAcquire’s services impossible to compete with. Today we’re going to share with you a method…

Welcome to Technology Tuesday – where the iAcquire development team will be sharing a little piece of our world with you once a week. Over the past 3 years we’ve been quietly behind the scenes building the technology that makes iAcquire’s services impossible to compete with. Today we’re going to share with you a method of high performance distributed web crawling using the Tornado asynchronous Python web framework along with Gearman – a high performance distributed task queue.

What Exactly Is A Distributed Web Crawler?

Let’s define the terms so we know what we’re talking about here:

  • Web Crawler – a computer program that pulls down a set of URLs and processes the pages in an automated fashion. A typical web crawler runs on a single computer and churns through a list of URLs in a linear fashion one by one, or in parallel using multiple simultaneous connections from a single computer.
  • Distributed Web Crawler – a web crawler that runs on multiple computers simultaneously substantially boosting crawler performance.

tweet rank visualization

There Are Many Ways To Build A Web Crawler But None This Simple And Powerful

Sure you can write yourself a Java crawler and deploy it on your own Hadoop cluster or use Amazon Elastic Map Reduce. You could write custom plugins for an existing solution. Nutch comes to mind, its a nice crawler, but it’s not something that you can use to gather live data with. We’ve made heavy use of Nutch and we know what it takes to implement ­solid solutions on top of it – time. Today we’re presenting a simple distributed crawler pattern we’ve affectionately named GearNado. It brings two easy to use, wonderfully designed systems together in harmony. GearNado enables you to build out real-time crawl/parse operations to test your theories out and get you results faster than you’ll ever need.

Demo Time – Introducing TweetHandler And TweetScout

We’ve built a simple proof-of-concept Twitter username crawler on top of GearNado. Lets say you have a list of authoritative URLs and you want to find the Twitter accounts that appear most frequently on the pages within the set. You don’t have time to sit around and wait for a desktop crawler app to crunch through the massive list of pages you’ve got. In comes our distributed real-time crawler – capable of fetching, parsing and analyzing over 50 pages per second on a single node and capable of being distributed to a nearly limitless number of nodes.

Get Yourself A Relevant URL List In Your Target Sector And Get Ready To Rock

There are numerous ways to obtain lists of authoritative URLs in the space you are researching. To get a good sample set for this post, I used a browser toolbar to export the top 400 results for the following Google keyword searches:

  • python web crawling
  • tornado web crawling
  • gearman web crawling

This resulted in a list of over 1,134 URLs after filtering out PDFs and other unwanted URLs (twitter.com, etc). Lets see what it looks like when we process these with 30 TweetScout workers:

OK, I Get It – This Distributed Web Crawler Kicks Ass… How Do I Use It?

First off, you’re going to need a Linux machine to get started. Gaining in popularity in this [web crawling] sector specifically is the Linux operating system that is used to host millions of websites around the world. The only problem with Linux is finding technical support for it as there are not many providers who can support Linux out there. I’m going to provide instructions for getting everything set up using Ubuntu, but just about any modern Linux Distribution will do.

Setup Instructions For Ubuntu:

$ sudo apt-get install git gearman libgearman-dev python-setuptools build-essential libxml2-dev libxslt-dev python-dev
$ sudo easy_install pyquery gearman tornado progressbar

Note: If you are looking to do more than 1024 simultaneous connections on a single machine make sure you edit /etc/security/limits.conf and increase the soft/hard nofile limits.

Clone the Git Repo:

$ git clone git://github.com/iAcquire/gearnado.git
$ cd gearnado

Launch 30 TweetScout workers in one terminal:

gearnado$ for i in `seq 1 30`; do ./TweetScout.py & done
[I 120501 15:32:57 TweetScout:24] TweetScout initialized and ready for work

Run the TweetHandler in another terminal – you can use any text file containing a list of urls one per line as input:

gearnado$ time ./TweetHandler.py --url_file=python_crawler_urls.txt
100% |##########################################|
Found 419 Twitter users in 1098 successfully parsed pages:

real    0m26.887s
user    0m0.172s
sys     0m0.036s

Clone GearNado And The TweetScout Example On GitHub: https://github.com/iAcquire/gearnado

If you made it this far, make sure you follow me: @jeffnappi

responses to “Distributed Web Crawling with Tornado and Gearman”

  1. Jeff Nappi says:

    Something quite important I didn’t address in the post – the tests shown here are in fact only running on a single node. In order to run it in a distributed manner one would just launch TweetScout workers on additional nodes with the –jobserver=master-server-ip:4730 parameter. I will be making a follow-up post about this in the coming weeks.

  2. Jon Hines says:

    so if i’m just using this on just 1 terminal, the setup above will be suffice? so if i have a list of url’s, how do I input this exactly?

    • Jeff Nappi says:

      Hi Jon, sorry for the delayed reply. I plan on providing an easy to use Amazon Machine Image with a web based interface to this in the future, but for now yes you only need a single Ubuntu instance to run this. One way to do this is to sign up for Amazon Web Services and start by launching this AMI: https://console.aws.amazon.com/ec2/home?region=us-east-1#launchAmi=ami-a29943cb

      Once you have everything set up by following the directions in the post, you would supply a list of URLs in a text file (one on each line) and pass the name of the text file to the TweetHandler.py script with the –url_file= parameter.

      Feel free to contact me via e-mail if you have more questions – jeff _a_ iacquire.com