Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

Incremental Crawling

  • Kevin S. McCurleyEmail author
Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_196


Crawler; Spidering


Part of the success of the World Wide Web arises from its lack of central control, because it allows every owner of a computer to contribute to a universally shared information space. The size and lack of central control presents a challenge for any global calculations that operate on the web as a distributed database. The scalability issue is typically handled by creating a central repository of web pages that is optimized for large-scale calculations. The process of creating this repository consists of maintaining a data structure of URLs to fetch, from which URLs are selected, the content is fetched, and the repository is updated. This process is called crawling or spidering.

Unfortunately, maintaining a consistent shadow repository is complicated by the dynamic and uncoordinated nature of the web. URLs are constantly being created or destroyed, and contents of URLs may change without notice. As a result, there will always be URLs for which the...

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Cho J, Garcia-Molina H. Effective page refresh policies for web crawlers. ACM Trans Database Syst. 2003;28(4):390–426.CrossRefGoogle Scholar
  2. 2.
    Coffman Jr EG, Liu Z, Weber RR. Optimal robot scheduling for web search engines. J Sched. 1998;1(1):15–29.MathSciNetzbMATHCrossRefGoogle Scholar
  3. 3.
    Dikaiakos MD, Stassopoulou A, Papageorgiou L. An investigation of web crawler behavior: characterization and metrics. Comput Commun. 2005;28(8):880–97.CrossRefGoogle Scholar
  4. 4.
    Edwards J, McCurley KS, Tomlin J. An adaptive model for optimizing performance of an incremental web crawler. In: Proceedings of the 10th International World Wide Web Conference; 2001. p. 106–13.Google Scholar
  5. 5.
    Fielding R, Gettys J, Mogul J, Frystyk H, Mastinter L, Leach P, Berners-Lee T. Hypertext transfer protocol – HTTP/1.1, RFC 2616 http://www.w3.org/Protocols/rfc2616/rfc2616.html
  6. 6.
    Podlipnig S, Böszörmenyi L. A survey of web cache replacement strategies. ACM Comput Surv. 2003;35(4):374–98.CrossRefGoogle Scholar
  7. 7.
    Sitemap protocol specification. http://www.sitemaps.org/protocol.php
  8. 8.
    Wang J. A survey of web caching schemes for the internet. ACM SIGCOMM Comput Commun Rev. 1999;29(5):36–46.CrossRefGoogle Scholar
  9. 9.
    Yuan X, MacGregor MH, Harms J. An efficient scheme to remove crawler traffic from the internet. In: Proceedings of the 11th International Conference on Computer Communications and Networks; 2002. p. 90–5.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Google ResearchMountain ViewUSA

Section editors and affiliations

  • Cong Yu
    • 1
  1. 1.Google ResearchNew YorkUSA