Skip to main content

Minimizing the Network Distance in Distributed Web Crawling

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3290))

Abstract

Distributed crawling has shown that it can overcome important limitations of the centralized crawling paradigm. However, the distributed nature of current distributed crawlers is currently not fully utilized. The optimal benefits of this approach are usually limited to the sites hosting the crawler. In this work we describe IPMicra, a distributed location aware web crawler that utilizes an IP address hierarchy and allows crawling of links in a near optimal location aware manner. The crawler outperforms earlier distributed crawling approaches without a significant overhead.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bowman, C.M., Danzig, P.B., Hardy, D.R., Manber, U., Schwartz, M.F.: The Harvest information discovery and access system. Computer Networks and ISDN Systems 28(1-2), 119–125 (1995)

    Article  Google Scholar 

  2. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30(1-7), 107–117 (1998)

    Article  Google Scholar 

  3. Fiedler, J., Hammer, J.: Using the web efficiently: Mobile crawlers. In: Proceedings of the Seventeenth AoM/IAoM International Conference on Computer Science, pp. 324–329. Maximilian Press Publishers, San Diego (1999)

    Google Scholar 

  4. Hammer, J., Fiedler, J.: Using mobile crawlers to search the web efficiently. International Journal of Computer and Information Science 1(1), 36–58 (2000)

    Google Scholar 

  5. Heydon, A., Najork, M.: Mercator: A scalable, extensible web crawler. World Wide Web 2(4), 219–229 (1978)

    Article  Google Scholar 

  6. Google Inc. Google (September 2003), http://www.google.com/

  7. Google Inc. Google search appliance (February 2004), http://www.google.com/appliance

  8. Lawrence, S., Lee Giles, C.: Accessibility of information on the web. Nature 400(6740), 107–109 (1999)

    Article  Google Scholar 

  9. LookSmart Ltd. Grub distributed internet crawler (2003), http://www.grub.org

  10. Papapetrou, O., Papastavrou, S., Samaras, G.: Distributed indexing of the web using migrating crawlers. In: Proceedings of the Twelfth International World Wide Web Conference, WWW (2003)

    Google Scholar 

  11. Papapetrou, O., Papastavrou, S., Samaras, G.: Ucymicra: Distributed indexing of the web using migrating crawlers. In: Proceedings of the 7th East-European Conference on Advanced Databases and Information Systems, Dresden, Germany (2003)

    Google Scholar 

  12. SETI. Search for extra terrestrial intelligence (January 2004), http://setiathome.ssl.berkeley.edu/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Papapetrou, O., Samaras, G. (2004). Minimizing the Network Distance in Distributed Web Crawling. In: Meersman, R., Tari, Z. (eds) On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE. OTM 2004. Lecture Notes in Computer Science, vol 3290. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30468-5_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30468-5_36

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23663-4

  • Online ISBN: 978-3-540-30468-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics