Abstract
Distributing the crawling activity among multiple machines can distribute processing to reduce the analysis of web page. This paper presents the design of a distributed web crawler based on Peer-to-Peer network. The distributed crawler harnesses the excess bandwidth and computing resources of nodes in system to crawl the web. Each crawler is deployed in a computing node of P2P to analyze web page and generate indices. Control node is another node to being in charge of distributing URLs to balance the load of the crawler. Control nodes are organized as P2P network. The crawler nodes managed by the same control node is a group. According to the ID of crawler and average load of the group, crawler can decide whether transmits the URL to control node or hold itself. We present an implementation of the distributed crawler based on Igloo and simulate the environment to evaluate the balancing load on the crawlers and crawl speed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Brin, S.: Lawrence Page, Google: The Anatomy of a Large-Scale Hypertextual Web Search Engine. In: Proceedings of the 7th International World Wide Web Conference, pp. 107–117 (April 1998)
Heydon, A., Najork, M.: Mercator: A Scalable, Extensible Web Crawler. World Wide Web 2(4), 219–229 (1999)
Li, J., Loo, B.T., Hellerstein, J., Kaashoek, F., Karger, D., Morrris, R.: On the Feasibility of Peer-to-Peer Web Indexing and Search. In: Kaashoek, M.F., Stoica, I. (eds.) IPTPS 2003. LNCS, vol. 2735. Springer, Heidelberg (2003)
Ratnasamy, S., Francis, P., Handley, M., Karp, R., Shenker, S.: A scalable -addressable network. In: ACM SIGCOMM 2001 (August 2001)
Henzinger, M.R.: Hyperlink analysis for the Web. IEEE Internet Computing 5(1), 45–50 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fei, L., Fan-Yuan, M., Yun-Ming, Y., Ming-Lu, L., Jia-Di, Y. (2004). Distributed High-Performance Web Crawler Based on Peer-to-Peer Network. In: Liew, KM., Shen, H., See, S., Cai, W., Fan, P., Horiguchi, S. (eds) Parallel and Distributed Computing: Applications and Technologies. PDCAT 2004. Lecture Notes in Computer Science, vol 3320. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30501-9_13
Download citation
DOI: https://doi.org/10.1007/978-3-540-30501-9_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24013-6
Online ISBN: 978-3-540-30501-9
eBook Packages: Computer ScienceComputer Science (R0)