Skip to main content

Distributed High-Performance Web Crawler Based on Peer-to-Peer Network

  • Conference paper
Book cover Parallel and Distributed Computing: Applications and Technologies (PDCAT 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3320))

Abstract

Distributing the crawling activity among multiple machines can distribute processing to reduce the analysis of web page. This paper presents the design of a distributed web crawler based on Peer-to-Peer network. The distributed crawler harnesses the excess bandwidth and computing resources of nodes in system to crawl the web. Each crawler is deployed in a computing node of P2P to analyze web page and generate indices. Control node is another node to being in charge of distributing URLs to balance the load of the crawler. Control nodes are organized as P2P network. The crawler nodes managed by the same control node is a group. According to the ID of crawler and average load of the group, crawler can decide whether transmits the URL to control node or hold itself. We present an implementation of the distributed crawler based on Igloo and simulate the environment to evaluate the balancing load on the crawlers and crawl speed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brin, S.: Lawrence Page, Google: The Anatomy of a Large-Scale Hypertextual Web Search Engine. In: Proceedings of the 7th International World Wide Web Conference, pp. 107–117 (April 1998)

    Google Scholar 

  2. Heydon, A., Najork, M.: Mercator: A Scalable, Extensible Web Crawler. World Wide Web 2(4), 219–229 (1999)

    Article  Google Scholar 

  3. Li, J., Loo, B.T., Hellerstein, J., Kaashoek, F., Karger, D., Morrris, R.: On the Feasibility of Peer-to-Peer Web Indexing and Search. In: Kaashoek, M.F., Stoica, I. (eds.) IPTPS 2003. LNCS, vol. 2735. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  4. Ratnasamy, S., Francis, P., Handley, M., Karp, R., Shenker, S.: A scalable -addressable network. In: ACM SIGCOMM 2001 (August 2001)

    Google Scholar 

  5. Henzinger, M.R.: Hyperlink analysis for the Web. IEEE Internet Computing 5(1), 45–50 (2001)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Fei, L., Fan-Yuan, M., Yun-Ming, Y., Ming-Lu, L., Jia-Di, Y. (2004). Distributed High-Performance Web Crawler Based on Peer-to-Peer Network. In: Liew, KM., Shen, H., See, S., Cai, W., Fan, P., Horiguchi, S. (eds) Parallel and Distributed Computing: Applications and Technologies. PDCAT 2004. Lecture Notes in Computer Science, vol 3320. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30501-9_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30501-9_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-24013-6

  • Online ISBN: 978-3-540-30501-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics