Skip to main content

Web Crawler Architecture

  • Reference work entry
  • First Online:

Synonyms

Robot; Spider; Web crawler

Definition

A web crawler is a program that, given one or more seed URLs, downloads the web pages associated with these URLs, extracts any hyperlinks contained in them, and recursively continues to download the web pages identified by these hyperlinks. Web crawlers are an important component of web search engines, where they are used to collect the corpus of web pages indexed by the search engine. Moreover, they are used in many other applications that process large numbers of web pages, such as web data mining, comparison shopping engines, and so on. Despite their conceptual simplicity, implementing high-performance web crawlers poses major engineering challenges due to the scale of the web. In order to crawl a substantial fraction of the “surface web” in a reasonable amount of time, web crawlers must download thousands of pages per second, and are typically distributed over tens or hundreds of computers. Their two main data structures – the...

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   4,499.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   6,499.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Recommended Reading

  1. Boldi P, Codenotti B, Santini M, Vigna S. UbiCrawler: a scalable fully distributed web crawler. Software Pract Exper. 2004;34(8):711–26.

    Article  Google Scholar 

  2. Brin S, Page L. The anatomy of a large-scale hypertextual search engine. In: Proceedings of the 7th International World Wide Web Conference; 1998. p. 107–17.

    Google Scholar 

  3. Burner M. Crawling towards eternity: building an archive of the world wide web. Web Tech Mag. 1997;2(5):37–40.

    Google Scholar 

  4. Cho J, Garcia-Molina H. Parallel crawlers. In: Proceedings of the 11th International World Wide Web Conference; 2002. p. 124–35.

    Google Scholar 

  5. Eichmann D. The RBSE spider – balancing effective search against web load. In: Proceedings of the 3rd International World Wide Web Conference; 1994.

    Google Scholar 

  6. Gray M. Internet growth and statistics: credits and background. http://www.mit.edu/people/mkgray/net/background.html

  7. Hafri Y, Djeraba C. High performance crawling system. In: Proceedings of the 6th ACM SIGMM International Workshop on Multimedia Information Retrieval; 2004. p. 299–306.

    Google Scholar 

  8. Heydon A, Najork M. Mercator: a scalable, extensible web crawler. World Wide Web. 1999;2(4): 219–29.

    Article  Google Scholar 

  9. Najork M, Heydon A. High-performance web crawling. Compaq SRC Research Report 173, Sept 2001.

    Google Scholar 

  10. Raghavan S, Garcia-Molina H. Crawling the hidden web. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001. p. 129–38.

    Google Scholar 

  11. Shkapenyuk V, Suel T. Design and implementation of a high-performance distributed web crawler. In: Proceedings of the 18th International Conference on Data Engineering; 2002. p. 357–68.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marc Najork .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Science+Business Media, LLC, part of Springer Nature

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Najork, M. (2018). Web Crawler Architecture. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_457

Download citation

Publish with us

Policies and ethics