Unsupervised Domain Ranking in Large-Scale Web Crawls - Texas A&M University (TAMU) Scholar

abstract

With the proliferation of web spam and infinite autogenerated web content, large-scale web crawlers require low-complexity ranking methods to effectively budget their limited resources and allocate bandwidth to reputable sites. In this work, we assume crawls that produce frontiers orders of magnitude larger than RAM, where sorting of pending URLs is infeasible in real time. Under these constraints, the main objective is to quickly compute domain budgets and decide which of them can be massively crawled. Those ranked at the top of the list receive aggressive crawling allowances, while all other domains are visited at some small default rate. To shed light on Internet-wide spam avoidance, we study topology-based ranking algorithms on domain-level graphs from the two largest academic crawls: a 6.3B-page IRLbot dataset and a 1B-page ClueWeb09 exploration. We first propose a new methodology for comparing the various rankings and then show that in-degree BFS-based techniques decisively outperform classic PageRank-style methods, including TrustRank. However, since BFS requires several orders of magnitude higher overhead and is generally infeasible for real-time use, we propose a fast, accurate, and scalable estimation method called TSE that can achieve much better crawl prioritization in practice. It is especially beneficial in applications with limited hardware resources.

authors

Loguinov, Dmitri

published proceedings

ACM TRANSACTIONS ON THE WEB

author list (cited authors)

Cui, Y. i., Sparkman, C., Lee, H., & Loguinov, D.

citation count

0

complete list of authors

Cui, Yi||Sparkman, Clint||Lee, Hsin-Tsang||Loguinov, Dmitri

publication date

November 2018

publisher

Association for Computing Machinery (ACM) Publisher

published in

ACM Transactions on the Web Journal

keywords

Frontier Prioritization
Ranking
Web Crawling

Digital Object Identifier (DOI)

10.1145/3182180

start page

26

end page

29

volume

12

issue

4

URL

http://dx.doi.org/10.1145/3182180

Unsupervised Domain Ranking in Large-Scale Web Crawls Academic Article

Overview

abstract

authors

published proceedings

author list (cited authors)

citation count

complete list of authors

publication date

publisher

published in

Research

keywords

Identity

Digital Object Identifier (DOI)

Additional Document Info

start page

end page

volume

issue

Other

URL