How We Built Our 60-Node (Almost) Distributed Web Crawler



Web crawling is one of those tasks that is so easy in theory (well you visit some pages, figure out the outgoing links, figure out which haven’t been visited, queue them up, pop the queue, visit the page, repeat), but really hard in practice, especially at scale. Launching a badly written webcrawler is equivalent to unleashing a DoS attack (and in a distributed crawler, it constitutes a DDoS attack!)

At Semantics3, we aggregate a large number of products from quite a few websites. Doing so requires some heavy duty web crawling and we have built a distributed web crawler to suit our needs. In this post I will be describing the design architecture of our web crawler, implementation details and finally some future improvements.

tl;dr:

  1. 60-node cluster (bulk of them are spot micro-instances) running on Amazon AWS. We crawl 1-3 million pages a day at a cost of ~$3 a day (excluding storage costs).
  2. Built using the Gearman map-reduce framework based on a supervisor-worker architecture.
  3. Stack is redis (supervisor)+gearman (coordinator)+perl workers on aws. Chef and Capistrano are used to manage the server herd.
  4. Future plans are to use bloom filters to keep track of visited URLs and switch to a distributed db like Riak or Cassandra for storing the state of the crawl.

Evolution

Let me begin with how our crawling has evolved over the past few months. We started off with a single threaded, single process crawler architecture written in 200 lines of Perl. As the number of requests increased, we rewrote it to run in a multi-threaded, multi-process manner. When even that was not enough, we went for bigger machines (more CPU power and more memory), like the extra-large high CPU EC2 instances. Finally even this didn’t seem adequate. It was pretty obvious that we had to figure out a way for our web crawler to run on multiple machines.

The original plan was to go with something out of the box. There were two solutions which we investigated. One was 80Legs, a web-crawler-as-a-service API. The other was running Nutch on a Hadoop cluster. The former was very restrictive as we had no control on how the pages were crawled, while the latter required a lot of grokking of internals to customize crawling to our needs.

We came to a conclusion that building our own web crawler architecture made a lot of sense. Another reason was that we had quite a few other tasks (processing our data, resizing images, etc.) that needed to be run in a distributed manner. Our web crawler is written as a specific type of job as part of a distributed job work system, built to suit our needs.

Architecture

High level overview

Our architecture is one based on a supervisor-worker model and follows the map-reduce pattern. The supervisor controls the state of each of the web crawls and determines which page to crawl next.

The workers are the ones that do the actual work (duh) – in this case the downloading of pages and processing of HTML. Once the workers have finished their work, they update the supervisor with details such as the HTTP status of the page download and links discovered. They then store the processed HTML content in our database.

Failed pages are retried (5XX responses) or discarded (404s). If the number of failed URLs crosses a threshold, the crawl is immediately paused.

 

(c) Mark Handley – http://bit.ly/QSX5q8

 

Implementation Details

Perl is our weapon of choice. We do a lot of text processing and Perl has been a godsend (though the quirky syntax sometimes annoys me). We use it for pretty much everything except for our API, which is written in Node.js.

The actual distribution of work tasks is done through Gearman, which is a work distribution library. This could be said to be the bedrock of our distributed web crawler. It handles concurrency related issues, failures, and most importantly figures, out which machine to farm out work to, all automagically without any fuss. The best thing about Gearman is that it automatically takes care of horizontal scaling. We can just keep spinning up more instances and Gearman will auto-detect them and start farming work to them.

If Gearman is the bedrock of our distributed system, then Redis is the oracle. Redis is used to hold the state of all web crawls. It stores the status of each crawl (running, paused, cancelled),  the URL crawl queue and the hash of all the URLs that have been previously visited.

Our entire distributed system is built on top of Amazon AWS:

  • The ‘Supervisor’ runs on a small instance, running Redis and a Perl script that figures out which URL to crawl next, which it then sends to the Gearman server.
  • The ‘Gearman’ server runs on a micro instance. This farms out work to all the workers who have registered with the Gearman server.
  • The ‘Workers’ all run on spot micro-instances. We have, at any given time, about 20-50 worker instances running based on our workload. The ‘Worker’ script runs on a single process, in an evented manner. They update the ‘Supervisor’ server with the status of the crawl and write the processed HTML to our ‘DB’ servers which run MongoDB. Another design feature is that the ‘Workers’ are stateless – they can be spun up or terminated at will without affecting jobs state.

The actual crawling is done 10 pages a time from each domain. We resolve and cache the DNS request, keep the TCP connection open and fetch the 10 pages asynchronously.

[To handle our server farm, we use Chef to handle the configuring of all servers based on their roles and Capistrano for the actual deployment from our Git repos.]

In terms of costs, we are able to crawl about 1-3 million pages daily for just $3 a day (excluding storage costs).

Why ‘Almost’ Distributed?

The actual weak-point in our system is the ‘Supervisor’ which constitutes a single point of failure (to somewhat mitigate this, we use replication and also store Redis state on disk) and hence our system is not purely distributed.

The primary reason is because there is no distributed version of Redis and we are addicted to its support for in-memory data structures. We use pretty much all their data structures – sets, sorted sets, lists and hash – extensively.

Possible alternatives which we are investigating are distributed databases such as Riak and Cassandra to replace the role of Redis.

Data Structures

We use priority search based graph traversal for our web crawling. Its basically breadth first search, but we use a priority queue instead of a queue to determine which URL to crawl next.

We assign an importance score to each URL which we discover and then crawl them accordingly. We use Redis sorted sets to store the priority associated with each URL and hashes to store the visited status of the discovered URLs. This, of course, comes with a large memory footprint.

A possible alternative would be to use Bloom filters. A Bloom filter is a probabilistic data structure and is used for answering set-existential questions (eg: has this URL been crawled before?). Due its probabilistic nature, it can give erroneous results in the form of false positives. You can however tweak the error rate, allowing for only a small number of false positives. The great benefit is the large amount of memory you can save (much more memory efficient than Redis Hashes). If we start crawling pages in the hundreds of millions, we definitely would have to switch to this data structure. As for the false positives, well, there ain’t no harm in occasionally crawling the same page twice.

Recrawling

Since one of the tasks we do is in building price histories of products, recrawling of pages needs to be done intelligently. We want to recrawl products that frequently change prices frequently (duh) as compared to pages that don’t really change their price. Hence a brute-force complete recrawl doesn’t really make sense.

We use a power law distribution (some pages are crawled more frequently than other pages) for recrawling products, with pages ranked based on their importance (using signals ranging from previous price history changes to how many reviews they have). Another challenge is in page discovery (eg: if a new product has been launched, how quickly can we have it in our system).

I will be writing a post detailing our recrawl strategies sometime in the future.

Resources

Here are some academic papers for reference:

Building Blocks Of A Scalable Webcrawler

Web Crawling and Indexes

Crawling The Web

Finally a really good post I came across on HackerNews, How To Crawl A Quarter Billion Webpages in 40 Hours , which motivated me to write this post.

Conclusion

Building you own distributed web crawler is a really fun challenge and you get to learn so much about distributed systems, concurrency, scaling, databases, economics, etc. But once again, you may want to evaluate the pros and cons before you decide to roll out your own. If what you do is plain ol’ vanilla crawling, then you may want to just use an open source web crawler like Nutch.

Feel free to drop me an email (varun [at] semantics3.com) or leave a comment if you have any questions, suggestions or even improvements.

 

Sign-up

We just launched our closed beta of our API. We would be most glad if you could give it a try. Here is the signup link (it comes preloaded with the invitation code):

Don’t hesitate to share it with your developer friends.

Sivamani Varun

Recent San Francisco transplant from Singapore. Less recent graduate from the National University of Singapore. One-time hardware engineer. Now, a recovering perl hacker. Part-time business guy at Semantics3.

More Posts - Website

  • Sandeep Gupta

    Nice write up.

    • http://spinn3r.com/ Kevin Burton

      If you would like to save even more cash we’d love to work with you guys for social media and blog data over at Spinn3r.  

      We’re usually 1/10th the cost of doing this yourself.  

      Onward!

  • http://twitter.com/danielvf Daniel Von Fange

    Thanks!

    Why did you choose micro instances? Because your CPU load was very uneven? Or to spread your crawl over more IP addresses?

    • http://twitter.com/vinothgopi Vinoth Gopi

      Yes, micro instances are well suited for short bursts of CPU usage. Infact, compared to a small instance, a micro instance has more ECUs. Spreading crawls over more IPs is a reason too. Another factor was cost. Since we are using spot micro instances, we are really able to keep costs low while allowing our system to scale based on our needs.

  • http://www.simplicidade.org/notes/ melo

    Good stuff.

    Question: why do you need Gearman? Redis seems to have all the features needed to implement job distribution:

     * BLPOP or BLPOPRPUSH
     * PubSub

    Just curious, nothing wrong with Gearman, it just looked like a simpler solution and one less piece of software to manage.

    • Varun

      Hi Melo,

      Apologies for the late reply, I completely missed your message.

      Thank you very much for the great work you have put in for the perl redis cpan module. We use it extensively!

      When I was doing my research I evaluated Zeromq, Rabbitmq, Beanstalkd and Gearman.
      Other than the simplicity and good pedigree (Danga), a main reason I picked Gearman was because it was used extensively at large loads in big companies (Craigslist, Yahoo, Instagram – they use Gearman to farm out work to 200 machines).

      I couldn’t find any case study regarding Redis worker queues in a large operational setting and went for the more conventional choice.

      But you are correct. It’s certainly one less piece of software to manage and more importantly has persistence baked in. Thanks for pointing it out!

      –sv

  • ts777

    Thanks for the post, some good points are mentioned. Web-crawling is a very tedious task indeed, it is always good to investigate the solutions…
     However with $3 per day you can only get 7 micro instances from AWS. Each instance  will have to process as many as 100 and up to 300 pages per minute (1-3 mil per day for 7 instances). Is it correct?

    Raymond,
    Cheers

    • http://www.simplicidade.org/notes/ melo

      They are using Spot instances, so $3 can get them more instances, it just fluctuates a bit.

  • http://www.marc-seeger.de Marc Seeger

    Thanks for Linking to my thesis :)
    Glad to see that we all run into the same problems and solutions (bloom filters, redis, …)

  • Qing21th

    How to get a invitation code?

    • Varun

      The invite code is ‘ramanujan’.

  • Rob

    Thanks for the post.
    Which AMI did you used? I am looking for the smallest one (EBS 1G)

  • http://profile.yahoo.com/LGLVDPT2SEIAH4GQ22VVYLRBLE Alexander

    Great post!

    How I implement distributed data extraction systems using Scala + Akka : http://chepurnoy.org/blog/2013/02/akka-based-actors-approach-to-data-extraction-system-design/

  • Andy

    I’m a UX specialist working on a startup requiring crawling. I have no experience coding, but want a developer with crawling experience to join the team. My startup is Chicago based and financially backed and would like someone to come on as a CTO. What skills and experience should I be asking about when networking or interviewing?

  • Ning Shi

    I think your description of the caveat of using a bloom filter for crawling may be incorrect. False positive actually means that the bloom filter will tell you that you have already visited a URL but you actually haven’t. The consequent of this is that you will miss crawling web pages rather than crawling tha same page twice.

  • Remus

    I realise that it’s been 2 years since this article was written but I have to say that it is quite educational and I appreciate taking the time to write about the solution.

    One thing I’m curious about is what kind of technologies and principles have you used to extract the relevant data from the content. I’ve read most of your blog posts but you haven’t touched on this topic. If you can discuss about it of course.

    Cheers.