Welcome to Technology Tuesday – where the iAcquire development team will be sharing a little piece of our world with you once a week. Over the past 3 years we’ve been quietly behind the scenes building the technology that makes iAcquire’s services impossible to compete with. Today we’re going to share with you a method of high performance distributed web crawling using the Tornado asynchronous Python web framework along with Gearman – a high performance distributed task queue.
What Exactly Is A Distributed Web Crawler?
Let’s define the terms so we know what we’re talking about here:
- Web Crawler – a computer program that pulls down a set of URLs and processes the pages in an automated fashion. A typical web crawler runs on a single computer and churns through a list of URLs in a linear fashion one by one, or in parallel using multiple simultaneous connections from a single computer.
- Distributed Web Crawler – a web crawler that runs on multiple computers simultaneously substantially boosting crawler performance.
There Are Many Ways To Build A Web Crawler But None This Simple And Powerful
Sure you can write yourself a Java crawler and deploy it on your own Hadoop cluster or use Amazon Elastic Map Reduce. You could write custom plugins for an existing solution. Nutch comes to mind, its a nice crawler, but it’s not something that you can use to gather live data with. We’ve made heavy use of Nutch and we know what it takes to implement solid solutions on top of it – time. Today we’re presenting a simple distributed crawler pattern we’ve affectionately named GearNado. It brings two easy to use, wonderfully designed systems together in harmony. GearNado enables you to build out real-time crawl/parse operations to test your theories out and get you results faster than you’ll ever need.
Demo Time – Introducing TweetHandler And TweetScout
We’ve built a simple proof-of-concept Twitter username crawler on top of GearNado. Lets say you have a list of authoritative URLs and you want to find the Twitter accounts that appear most frequently on the pages within the set. You don’t have time to sit around and wait for a desktop crawler app to crunch through the massive list of pages you’ve got. In comes our distributed real-time crawler – capable of fetching, parsing and analyzing over 50 pages per second on a single node and capable of being distributed to a nearly limitless number of nodes.
Get Yourself A Relevant URL List In Your Target Sector And Get Ready To Rock
There are numerous ways to obtain lists of authoritative URLs in the space you are researching. To get a good sample set for this post, I used a browser toolbar to export the top 400 results for the following Google keyword searches:
- python web crawling
- tornado web crawling
- gearman web crawling
This resulted in a list of over 1,134 URLs after filtering out PDFs and other unwanted URLs (twitter.com, etc). Lets see what it looks like when we process these with 30 TweetScout workers:
OK, I Get It – This Distributed Web Crawler Kicks Ass… How Do I Use It?
First off, you’re going to need a Linux machine to get started. Gaining in popularity in this [web crawling] sector specifically is the Linux operating system that is used to host millions of websites around the world. The only problem with Linux is finding technical support for it as there are not many providers who can support Linux out there. I’m going to provide instructions for getting everything set up using Ubuntu, but just about any modern Linux Distribution will do.
Setup Instructions For Ubuntu:
$ sudo apt-get install git gearman libgearman-dev python-setuptools build-essential libxml2-dev libxslt-dev python-dev
$ sudo easy_install pyquery gearman tornado progressbar
Note: If you are looking to do more than 1024 simultaneous connections on a single machine make sure you edit /etc/security/limits.conf and increase the soft/hard nofile limits.
Clone the Git Repo:
$ git clone git://github.com/iAcquire/gearnado.git $ cd gearnado
Launch 30 TweetScout workers in one terminal:
gearnado$ for i in `seq 1 30`; do ./TweetScout.py & done [I 120501 15:32:57 TweetScout:24] TweetScout initialized and ready for work ...
Run the TweetHandler in another terminal – you can use any text file containing a list of urls one per line as input:
gearnado$ time ./TweetHandler.py --url_file=python_crawler_urls.txt 100% |##########################################| Found 419 Twitter users in 1098 successfully parsed pages: getafreelancer,43 freelancer,43 elance,7 freecode_com,6 odesk,6 SlideShare,6 bestopensource,6 mongodb,6 scribd,5 theanti9,4 bytes,4 databigbang,4 devcomments,4 japerk,3 brutuscat,3 apponic,3 rap4,3 github,3 brainpicker,3 wowcity,3 OLTW,3 boxmonkey,3 donanza,3 sfnet_ops,3 lanyrd,3 zomobo,3 developerIT,3 Dzone,3 bwhitman,2 xntrik,2 repos_io,2 securityninja,2 newitfarmer,2 drchuck,2 CBM,2 deviantART,2 codinghorror,2 ordonnancement,2 srirangan,2 mpmorningsun,2 TNWapps,2 tedxsomerville,2 nkpart,2 MPRNews,2 dreamincode,2 looprecursion,2 ... real 0m26.887s user 0m0.172s sys 0m0.036s
Clone GearNado And The TweetScout Example On GitHub: https://github.com/iAcquire/gearnado
If you made it this far, make sure you follow me: @jeffnappi