REU: CSR: Small: Large-Scale Web Crawling and Spam Avoidance in Search-Engine Applications
- View All
Search engines and various data-mining applications commonly rely on web crawlers to navigate the web, discover valuable content, and keep it fresh. However, the enormous volume of available information and sophisticated spam techniques commonly used to deceive search engines present significant challenges in web crawling, especially in non-commercial applications such as research. The first part of this project designs efficient real-time graph-manipulation algorithms and builds a high-performance distributed web-crawler architecture that seamlessly couples the various components of Internet-scale networking, information retrieval, and graph theory. The second part creates probabilistic techniques for quick estimation of domain reputation and explores various ranking techniques to achieve better robustness against spam. The third part designs advanced budgeting mechanisms to control the crawl rate of different parts of the web at multiple levels of granularity. The project is expected to engage students at Texas A&M in research-intensive education in cross-disciplinary fields (such as data-intensive computing, networking, graph theory, distributed systems, parallel architectures, and modeling), broaden integration of web research into classroom teaching, attract undergraduate students to REU, extend participation of minority groups in engineering, stimulate collaboration among students and sharing of ideas, and permit web-related research at other institutions through publicly shared outcomes of our work.