Distributed High-Performance Web Crawler Based on Peer-to-Peer Network

Fei, Liu; Fan-Yuan, Ma; Yun-Ming, Ye; Ming-Lu, Li; Jia-Di, Yu

doi:10.1007/978-3-540-30501-9_13

Liu Fei²²,
Ma Fan-Yuan²²,
Ye Yun-Ming²²,
Li Ming-Lu²² &
…
Yu Jia-Di²²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3320))

Included in the following conference series:

International Conference on Parallel and Distributed Computing: Applications and Technologies

1060 Accesses
1 Citations

Abstract

Distributing the crawling activity among multiple machines can distribute processing to reduce the analysis of web page. This paper presents the design of a distributed web crawler based on Peer-to-Peer network. The distributed crawler harnesses the excess bandwidth and computing resources of nodes in system to crawl the web. Each crawler is deployed in a computing node of P2P to analyze web page and generate indices. Control node is another node to being in charge of distributing URLs to balance the load of the crawler. Control nodes are organized as P2P network. The crawler nodes managed by the same control node is a group. According to the ID of crawler and average load of the group, crawler can decide whether transmits the URL to control node or hold itself. We present an implementation of the distributed crawler based on Igloo and simulate the environment to evaluate the balancing load on the crawlers and crawl speed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Brin, S.: Lawrence Page, Google: The Anatomy of a Large-Scale Hypertextual Web Search Engine. In: Proceedings of the 7th International World Wide Web Conference, pp. 107–117 (April 1998)
Google Scholar
Heydon, A., Najork, M.: Mercator: A Scalable, Extensible Web Crawler. World Wide Web 2(4), 219–229 (1999)
Article Google Scholar
Li, J., Loo, B.T., Hellerstein, J., Kaashoek, F., Karger, D., Morrris, R.: On the Feasibility of Peer-to-Peer Web Indexing and Search. In: Kaashoek, M.F., Stoica, I. (eds.) IPTPS 2003. LNCS, vol. 2735. Springer, Heidelberg (2003)
Chapter Google Scholar
Ratnasamy, S., Francis, P., Handley, M., Karp, R., Shenker, S.: A scalable -addressable network. In: ACM SIGCOMM 2001 (August 2001)
Google Scholar
Henzinger, M.R.: Hyperlink analysis for the Web. IEEE Internet Computing 5(1), 45–50 (2001)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Shanghai Jiaotong University, Shanghai, 200030, P. R. China
Liu Fei, Ma Fan-Yuan, Ye Yun-Ming, Li Ming-Lu & Yu Jia-Di

Authors

Liu Fei
View author publications
You can also search for this author in PubMed Google Scholar
Ma Fan-Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Ye Yun-Ming
View author publications
You can also search for this author in PubMed Google Scholar
Li Ming-Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yu Jia-Di
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Nanyang Centre for Supercomputing and Visualisation School of Mechanical and Production Engineering, Nanyang Technological University, 50 Nanyang Avenue, 639 798, Singapore
Kim-Meow Liew
School of Computer Science, The University of Adelaide,
Hong Shen
Asia Pacific Science and Technology Center, Sun Microsystems Inc., 50 Nanyang Avenue, N3-1c-10, 639798, Singapore
Simon See
School of Computer Engineering, Nanyang Technological University, 639798, Singapore
Wentong Cai
Southwest Jiaotong University,
Pingzhi Fan
Tohoku University ,
Susumu Horiguchi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fei, L., Fan-Yuan, M., Yun-Ming, Y., Ming-Lu, L., Jia-Di, Y. (2004). Distributed High-Performance Web Crawler Based on Peer-to-Peer Network. In: Liew, KM., Shen, H., See, S., Cai, W., Fan, P., Horiguchi, S. (eds) Parallel and Distributed Computing: Applications and Technologies. PDCAT 2004. Lecture Notes in Computer Science, vol 3320. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30501-9_13

Download citation

DOI: https://doi.org/10.1007/978-3-540-30501-9_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24013-6
Online ISBN: 978-3-540-30501-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics