Synonyms
Robot; Spider; Web crawler
Definition
A web crawler is a program that, given one or more seed URLs, downloads the web pages associated with these URLs, extracts any hyperlinks contained in them, and recursively continues to download the web pages identified by these hyperlinks. Web crawlers are an important component of web search engines, where they are used to collect the corpus of web pages indexed by the search engine. Moreover, they are used in many other applications that process large numbers of web pages, such as web data mining, comparison shopping engines, and so on. Despite their conceptual simplicity, implementing high-performance web crawlers poses major engineering challenges due to the scale of the web. In order to crawl a substantial fraction of the “surface web” in a reasonable amount of time, web crawlers must download thousands of pages per second, and are typically distributed over tens or hundreds of computers. Their two main data structures – the...
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsRecommended Reading
Boldi P, Codenotti B, Santini M, Vigna S. UbiCrawler: a scalable fully distributed web crawler. Software Pract Exper. 2004;34(8):711–26.
Brin S, Page L. The anatomy of a large-scale hypertextual search engine. In: Proceedings of the 7th International World Wide Web Conference; 1998. p. 107–17.
Burner M. Crawling towards eternity: building an archive of the world wide web. Web Tech Mag. 1997;2(5):37–40.
Cho J, Garcia-Molina H. Parallel crawlers. In: Proceedings of the 11th International World Wide Web Conference; 2002. p. 124–35.
Eichmann D. The RBSE spider – balancing effective search against web load. In: Proceedings of the 3rd International World Wide Web Conference; 1994.
Gray M. Internet growth and statistics: credits and background. http://www.mit.edu/people/mkgray/net/background.html
Hafri Y, Djeraba C. High performance crawling system. In: Proceedings of the 6th ACM SIGMM International Workshop on Multimedia Information Retrieval; 2004. p. 299–306.
Heydon A, Najork M. Mercator: a scalable, extensible web crawler. World Wide Web. 1999;2(4): 219–29.
Najork M, Heydon A. High-performance web crawling. Compaq SRC Research Report 173, Sept 2001.
Raghavan S, Garcia-Molina H. Crawling the hidden web. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001. p. 129–38.
Shkapenyuk V, Suel T. Design and implementation of a high-performance distributed web crawler. In: Proceedings of the 18th International Conference on Data Engineering; 2002. p. 357–68.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media, LLC, part of Springer Nature
About this entry
Cite this entry
Najork, M. (2018). Web Crawler Architecture. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_457
Download citation
DOI: https://doi.org/10.1007/978-1-4614-8265-9_457
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering