Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

Web Crawler Architecture

  • Marc Najork
Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_457

Synonyms

Robot; Spider; Web crawler

Definition

A web crawler is a program that, given one or more seed URLs, downloads the web pages associated with these URLs, extracts any hyperlinks contained in them, and recursively continues to download the web pages identified by these hyperlinks. Web crawlers are an important component of web search engines, where they are used to collect the corpus of web pages indexed by the search engine. Moreover, they are used in many other applications that process large numbers of web pages, such as web data mining, comparison shopping engines, and so on. Despite their conceptual simplicity, implementing high-performance web crawlers poses major engineering challenges due to the scale of the web. In order to crawl a substantial fraction of the “surface web” in a reasonable amount of time, web crawlers must download thousands of pages per second, and are typically distributed over tens or hundreds of computers. Their two main data structures – the...

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Boldi P, Codenotti B, Santini M, Vigna S. UbiCrawler: a scalable fully distributed web crawler. Software Pract Exper. 2004;34(8):711–26.CrossRefGoogle Scholar
  2. 2.
    Brin S, Page L. The anatomy of a large-scale hypertextual search engine. In: Proceedings of the 7th International World Wide Web Conference; 1998. p. 107–17.Google Scholar
  3. 3.
    Burner M. Crawling towards eternity: building an archive of the world wide web. Web Tech Mag. 1997;2(5):37–40.Google Scholar
  4. 4.
    Cho J, Garcia-Molina H. Parallel crawlers. In: Proceedings of the 11th International World Wide Web Conference; 2002. p. 124–35.Google Scholar
  5. 5.
    Eichmann D. The RBSE spider – balancing effective search against web load. In: Proceedings of the 3rd International World Wide Web Conference; 1994.Google Scholar
  6. 6.
    Gray M. Internet growth and statistics: credits and background. http://www.mit.edu/people/mkgray/net/background.html
  7. 7.
    Hafri Y, Djeraba C. High performance crawling system. In: Proceedings of the 6th ACM SIGMM International Workshop on Multimedia Information Retrieval; 2004. p. 299–306.Google Scholar
  8. 8.
    Heydon A, Najork M. Mercator: a scalable, extensible web crawler. World Wide Web. 1999;2(4): 219–29.CrossRefGoogle Scholar
  9. 9.
    Najork M, Heydon A. High-performance web crawling. Compaq SRC Research Report 173, Sept 2001.Google Scholar
  10. 10.
    Raghavan S, Garcia-Molina H. Crawling the hidden web. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001. p. 129–38.Google Scholar
  11. 11.
    Shkapenyuk V, Suel T. Design and implementation of a high-performance distributed web crawler. In: Proceedings of the 18th International Conference on Data Engineering; 2002. p. 357–68.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Google, Inc.Mountain ViewUSA

Section editors and affiliations

  • Cong Yu
    • 1
  1. 1.Google ResearchNew YorkUSA