Skip to main content

Crawling the Web

  • Chapter
Web Dynamics

Summary

The large size and the dynamic nature of the Web make it necessary to continually maintain Web based information retrieval systems. Crawlers facilitate this process by following hyperlinks in Web pages to automatically download new and updated Web pages. While some systems rely on crawlers that exhaustively crawl the Web, others incorporate “focus” within their crawlers to harvest application- or topic-specific collections. In this chapter we discuss the basic issues related to developing an infrastructure for crawlers. This is followed by a review of several topical crawling algorithms, and evaluation metrics that may be used to judge their performance. Given that many innovative applications of Web crawling are still being invented, we briefly discuss some that have already been developed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. C. C. Aggarwal, F. Al-Garawi, and P. S. Yu. Intelligent crawling on the World Wide Web with arbitrary predicates. In WWWIO, Hong Kong, May 2001.

    Google Scholar 

  2. B. Amento, L. Terveen, and W. Hill. Does “authority” mean quality? Predicting expert quality ratings of web documents. In Proc. 23th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, Athens, Greece, 2000.

    Google Scholar 

  3. A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the Web. ACM Transactions on Internet Technology, 1(1), 2001.

    Google Scholar 

  4. K. Bharat and M.R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 1998.

    Google Scholar 

  5. Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30: 107–117, 1998.

    Article  Google Scholar 

  6. S. Chakrabarti. Mining the Web. Morgan Kaufmann, 2003.

    Google Scholar 

  7. S. Chakraharti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic resource list compilation by analyzing hyperlink structure and associated text. In Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, 1998.

    Google Scholar 

  8. S. Chakraharti, K. Punera, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In WWW2002, Hawaii, May 2002.

    Google Scholar 

  9. S. Chakraharti, M. van den Berg. and B. Dom. Focused crawling: A new approach to topic-specific Web resource discovery. Computer Networks, 31(1 1–16): 1623–1640, 1999.

    Google Scholar 

  10. J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. Computer Networks, 30: 161–172, 1998.

    Google Scholar 

  11. B.D. Davison. Topical locality in the web. In Proc. 23rd Annual Intl. ACM SIGIR Conn. on Research and Development in Information Retrieval, Athens, Greece, 2000.

    Google Scholar 

  12. P. M. E. De Bra and R. D. J. Post. Information retrieval in the World Wide Web: Making client-based searching feasible. In Proc. 1st International World Wide Web Conference, 1994.

    Google Scholar 

  13. M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In Proc. 26th International Conference on Very Large Databases (VLDB 2000), pages 527–534, Cairo, Egypt, 2000.

    Google Scholar 

  14. D. Eichmann. Ethical Web agents. In Second International World-Wide Web Conference, pages 3–13, Chicago, Illinois, 1994.

    Google Scholar 

  15. M. Hersovici, M. Jacovi, Y. S. Maarek, D. Pelleg, M. Shtalhaim, and S. Ur. The shark-search algorithm - An application: Tailored Web site mapping. In Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, 1998.

    Google Scholar 

  16. J. Johnson, K. Tsioutsiouliklis, and C.L. Giles. Evolving strategies for focused web crawling. In Proc. 12th Intl. Conf on Machine Learning (ICML-2003), Washington DC, 2003.

    Google Scholar 

  17. J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46 (5): 604–632, 1999.

    Article  MathSciNet  MATH  Google Scholar 

  18. V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjamin/Cummings, 1994.

    MATH  Google Scholar 

  19. H. Lieberman, F. Christopher, and L. Weitzman. Exploring the Weh with reconnaissance agents. Communications of the ACM, 44: 69–75, August 2001.

    Article  Google Scholar 

  20. A.K. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of interne(portals with machine learning. Information Retrieval, 3 (2): 127–163, 2000.

    Article  Google Scholar 

  21. F. Menczer and R. K. Belew. Adaptive retrieval agents: Internalizing local context and scaling up to the Weh. Machine Learning, 39 (2–3): 203–242, 2000.

    Article  MATH  Google Scholar 

  22. F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan. Evaluating topic-driven Web crawlers. In Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in hitbroration Retrieval, New Orleans, Louisiana, 2001.

    Google Scholar 

  23. F. Menczer, G. Pant, and P. Srinivasan. Topical web crawlers: Evaluating adaptive algorithms. To appear in ACM Trans. on Internet Technologies, 2003. http://dollar.biz.uiowa.edurfil/Papers/TOIT.pdf.

    Google Scholar 

  24. G. Pant. Deriving link-context from HTML tag tree. In 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2003.

    Google Scholar 

  25. G. Pant, S. Bradshaw, and F. Menczer. Search engine-crawler symbiosis: Adapting to community interests. In Proc. 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003), Trondheim, Norway, 2003.

    Google Scholar 

  26. G. Pant and F. Menczer. MySpiders: Evolve your own intelligent Web crawlers. Autonomous Agents and Multi-Agent Systems, 5 (2): 221–229, 2002.

    Article  Google Scholar 

  27. G. Pant and F. Menczer. Topical crawling for business intelligence. In Proc. 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003 ), Trondheim, Norway, 2003.

    Google Scholar 

  28. G. Pant, P. Srinivasan, and F. Menczer. Exploration versus exploitation in topic driven crawlers. In WWW02 Workshop on Web Dynamics, Honolulu, Hawaii, 2002.

    Google Scholar 

  29. M. Porter. An algorithm for suffix stripping. Program, 14 (3): 130–137, 1980.

    Article  Google Scholar 

  30. S. RaviKumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic models for the Web graph. In Proceedings of the Annual Symposium on Foundations of Computer Science (FOCS), pages 57–65, Redondo Beach, CA, Nov. 2000.

    Google Scholar 

  31. J. Rennie and A. K. McCallum. Using reinforcement learning to spider the Web efficiently. In Proc. 16th International Cont, on Machine Learning, pages 335–343, Bled, Slovenia, 1999.

    Google Scholar 

  32. G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY, 1983.

    MATH  Google Scholar 

  33. P. Srinivasan, J. Mitchell, O. Bodenreider, G. Pant, and F. Menczer. Web crawling agents for retrieving biomedical information. In NETTAB: Agents in Bioinformatics, Bologna, Italy, 2002.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Pant, G., Srinivasan, P., Menczer, F. (2004). Crawling the Web. In: Web Dynamics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-10874-1_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-10874-1_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-07377-9

  • Online ISBN: 978-3-662-10874-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics