Abstract
This paper describes a decentralized peer-to-peer model for building a Web crawler. Most of the current systems use a centralized client-server model, in which the crawl is done by one or more tightly coupled machines, but the distribution of the crawling jobs and the collection of crawled results are managed in a centralized system using a centralized URL repository. Centralized solutions are known to have problems like link congestion, being a single point of failure, and expensive administration. It requires both horizontal and vertical scalability solutions to manage Network File Systems (NFS) and load balancing DNS and HTTP requests.
In this paper, we present an architecture of a completely distributed and decentralized Peer-to-Peer (P2P) crawler called Apoidea, which is self-managing and uses geographical proximity of the web resources to the peers for a better and faster crawl. We use Distributed Hash Table (DHT) based protocols to perform the critical URL-duplicate and content-duplicate tests.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aberer, K.: P-grid: A self-organizing access structure for p2p information systems. In: Batini, C., Giunchiglia, F., Giorgini, P., Mecella, M. (eds.) CoopIS 2001. LNCS, vol. 2172, p. 179. Springer, Heidelberg (2001)
Kubaitowics, J., Zhao, B., Joseph, A.: Tapestry: An infrastructure for faulttolerance wide-area location and routing. Technical Report UCB/CSD-01-1141, University of California, Berkeley (2001)
Bloom, B.: Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13(7), 422–426 (1970)
Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: Scalability and fault-tolerance issues. Poster Proceedings of 11th International World Wide Web Conference (2002)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30(1–7), 107–117 (1998)
Burner, M.: Crawling towards eternity: Building an archive of the world wide web. In: Web Techniques (1997)
Clarke, I., Sandberg, O., Wiley, B., Hong, T.W.: Freenet: A distributed anonymous information storage and retrieval system. In: Federrath, H. (ed.) Designing Privacy Enhancing Technologies. LNCS, vol. 2009, p. 46. Springer, Heidelberg (2001)
Gnawali, O.D.: A keyword-set search system for peer-to-peer networks
Gnutella: The gnutella home page (2002), http://gnutella.wego.com/
Heydon, A., Najork, M.: Mercator: A scalable, extensible web crawler. World Wide Web 2(4), 219–229 (1999)
Kazaa: The kazaa home page (2002), http://www.kazaa.com/
Lu, T., Sinha, S., Sudam, A.: Panache: A scalable distributed index for keyword search. Technical report (2002)
Ratnasamy, S., Francis, P., Handley, M., Karp, R., Shenker, S.: A scalable contentaddressable network (2001)
Rowstron, A., Druschel, P.: Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In: Guerraoui, R. (ed.) Middleware 2001. LNCS, vol. 2218, p. 329. Springer, Heidelberg (2001)
SETI@home. The seti@home home page, http://setiathome.ssl.berkeley.edu
Shkapenyuk, V., Suel, T.: Design and implementation of a highperformance distributed web crawler. In: Proceedings of International Conference on Data Engineering (2002)
Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H.: Chord: A scalable peer-to-peer lookup service for internet applications. In: Proceedings of SIGCOMM Annual Conference on Data Communication (August 2001)
Takahashi, T., Soonsang, H., Taura, K., Tonezawa, A.: World wide web crawler. Poster Proceedings of 11th International World Wide Web Conference (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Singh, A., Srivatsa, M., Liu, L., Miller, T. (2004). Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web. In: Callan, J., Crestani, F., Sanderson, M. (eds) Distributed Multimedia Information Retrieval. DIR 2003. Lecture Notes in Computer Science, vol 2924. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24610-7_10
Download citation
DOI: https://doi.org/10.1007/978-3-540-24610-7_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20875-4
Online ISBN: 978-3-540-24610-7
eBook Packages: Springer Book Archive