IRLbot: scaling to 6 billion pages and beyond Conference Paper

Overview
Identity
Additional Document Info
Other
View All

abstract

This paper shares our experience in designing a web crawler that can download billions of pages using a single-server implementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and fixed per-host rate-limiting, current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly-branching spam, legitimate multi-million-page blog sites, and infinite loops created by server-side scripts. We offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days. IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate of 319 mb/s (1, 789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the web graph with 41 billion unique nodes.

name of conference

Proceedings of the 17th international conference on World Wide Web

authors

Loguinov, Dmitri

published proceedings

Proceedings of the 17th international conference on World Wide Web

author list (cited authors)

Lee, H., Leonard, D., Wang, X., & Loguinov, D.

citation count

42

complete list of authors

Lee, Hsin-Tsang||Leonard, Derek||Wang, Xiaoming||Loguinov, Dmitri

editor list (cited editors)

Huai, J., Chen, R., Hon, H., Liu, Y., Ma, W., Tomkins, A., & Zhang, X.

publication date

January 2008

publisher

Association for Computing Machinery (ACM) Publisher