Part of the success of the World Wide Web arises from its lack of central control, because it allows every owner of a computer to contribute to a universally shared information space. The size and lack of central control presents a challenge for any global calculations that operate on the web as a distributed database. The scalability issue is typically handled by creating a central repository of web pages that is optimized for large-scale calculations. The process of creating this repository consists of maintaining a data structure of URLs to fetch, from which URLs are selected, the content is fetched, and the repository is updated. This process is called crawling or spidering.
Unfortunately, maintaining a consistent shadow repository is complicated by the dynamic and uncoordinated nature of the web. URLs are constantly being created or destroyed, and contents of URLs may change without notice. As a result, there will always be URLs for which the...
- 4.Edwards J, McCurley KS, Tomlin J. An adaptive model for optimizing performance of an incremental web crawler. In: Proceedings of the 10th International World Wide Web Conference; 2001. p. 106–13.Google Scholar
- 5.Fielding R, Gettys J, Mogul J, Frystyk H, Mastinter L, Leach P, Berners-Lee T. Hypertext transfer protocol – HTTP/1.1, RFC 2616 http://www.w3.org/Protocols/rfc2616/rfc2616.html
- 7.Sitemap protocol specification. http://www.sitemaps.org/protocol.php
- 9.Yuan X, MacGregor MH, Harms J. An efficient scheme to remove crawler traffic from the internet. In: Proceedings of the 11th International Conference on Computer Communications and Networks; 2002. p. 90–5.Google Scholar