Web Dynamics pp 153-177 | Cite as

Crawling the Web

  • Gautam Pant
  • Padmini Srinivasan
  • Filippo Menczer


The large size and the dynamic nature of the Web make it necessary to continually maintain Web based information retrieval systems. Crawlers facilitate this process by following hyperlinks in Web pages to automatically download new and updated Web pages. While some systems rely on crawlers that exhaustively crawl the Web, others incorporate “focus” within their crawlers to harvest application- or topic-specific collections. In this chapter we discuss the basic issues related to developing an infrastructure for crawlers. This is followed by a review of several topical crawling algorithms, and evaluation metrics that may be used to judge their performance. Given that many innovative applications of Web crawling are still being invented, we briefly discuss some that have already been developed.


Priority Queue Cosine Similarity Relevance Score Anchor Text Context Graph 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    C. C. Aggarwal, F. Al-Garawi, and P. S. Yu. Intelligent crawling on the World Wide Web with arbitrary predicates. In WWWIO, Hong Kong, May 2001.Google Scholar
  2. 2.
    B. Amento, L. Terveen, and W. Hill. Does “authority” mean quality? Predicting expert quality ratings of web documents. In Proc. 23th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, Athens, Greece, 2000.Google Scholar
  3. 3.
    A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the Web. ACM Transactions on Internet Technology, 1(1), 2001.Google Scholar
  4. 4.
    K. Bharat and M.R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 1998.Google Scholar
  5. 5.
    Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30: 107–117, 1998.CrossRefGoogle Scholar
  6. 6.
    S. Chakrabarti. Mining the Web. Morgan Kaufmann, 2003.Google Scholar
  7. 7.
    S. Chakraharti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic resource list compilation by analyzing hyperlink structure and associated text. In Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, 1998.Google Scholar
  8. 8.
    S. Chakraharti, K. Punera, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In WWW2002, Hawaii, May 2002.Google Scholar
  9. 9.
    S. Chakraharti, M. van den Berg. and B. Dom. Focused crawling: A new approach to topic-specific Web resource discovery. Computer Networks, 31(1 1–16): 1623–1640, 1999.Google Scholar
  10. 10.
    J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. Computer Networks, 30: 161–172, 1998.Google Scholar
  11. 11.
    B.D. Davison. Topical locality in the web. In Proc. 23rd Annual Intl. ACM SIGIR Conn. on Research and Development in Information Retrieval, Athens, Greece, 2000.Google Scholar
  12. 12.
    P. M. E. De Bra and R. D. J. Post. Information retrieval in the World Wide Web: Making client-based searching feasible. In Proc. 1st International World Wide Web Conference, 1994.Google Scholar
  13. 13.
    M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In Proc. 26th International Conference on Very Large Databases (VLDB 2000), pages 527–534, Cairo, Egypt, 2000.Google Scholar
  14. 14.
    D. Eichmann. Ethical Web agents. In Second International World-Wide Web Conference, pages 3–13, Chicago, Illinois, 1994.Google Scholar
  15. 15.
    M. Hersovici, M. Jacovi, Y. S. Maarek, D. Pelleg, M. Shtalhaim, and S. Ur. The shark-search algorithm - An application: Tailored Web site mapping. In Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, 1998.Google Scholar
  16. 16.
    J. Johnson, K. Tsioutsiouliklis, and C.L. Giles. Evolving strategies for focused web crawling. In Proc. 12th Intl. Conf on Machine Learning (ICML-2003), Washington DC, 2003.Google Scholar
  17. 17.
    J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46 (5): 604–632, 1999.MathSciNetMATHCrossRefGoogle Scholar
  18. 18.
    V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjamin/Cummings, 1994.MATHGoogle Scholar
  19. 19.
    H. Lieberman, F. Christopher, and L. Weitzman. Exploring the Weh with reconnaissance agents. Communications of the ACM, 44: 69–75, August 2001.CrossRefGoogle Scholar
  20. 20.
    A.K. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of interne(portals with machine learning. Information Retrieval, 3 (2): 127–163, 2000.CrossRefGoogle Scholar
  21. 21.
    F. Menczer and R. K. Belew. Adaptive retrieval agents: Internalizing local context and scaling up to the Weh. Machine Learning, 39 (2–3): 203–242, 2000.MATHCrossRefGoogle Scholar
  22. 22.
    F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan. Evaluating topic-driven Web crawlers. In Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in hitbroration Retrieval, New Orleans, Louisiana, 2001.Google Scholar
  23. 23.
    F. Menczer, G. Pant, and P. Srinivasan. Topical web crawlers: Evaluating adaptive algorithms. To appear in ACM Trans. on Internet Technologies, 2003. Scholar
  24. 24.
    G. Pant. Deriving link-context from HTML tag tree. In 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2003.Google Scholar
  25. 25.
    G. Pant, S. Bradshaw, and F. Menczer. Search engine-crawler symbiosis: Adapting to community interests. In Proc. 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003), Trondheim, Norway, 2003.Google Scholar
  26. 26.
    G. Pant and F. Menczer. MySpiders: Evolve your own intelligent Web crawlers. Autonomous Agents and Multi-Agent Systems, 5 (2): 221–229, 2002.CrossRefGoogle Scholar
  27. 27.
    G. Pant and F. Menczer. Topical crawling for business intelligence. In Proc. 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003 ), Trondheim, Norway, 2003.Google Scholar
  28. 28.
    G. Pant, P. Srinivasan, and F. Menczer. Exploration versus exploitation in topic driven crawlers. In WWW02 Workshop on Web Dynamics, Honolulu, Hawaii, 2002.Google Scholar
  29. 29.
    M. Porter. An algorithm for suffix stripping. Program, 14 (3): 130–137, 1980.CrossRefGoogle Scholar
  30. 30.
    S. RaviKumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic models for the Web graph. In Proceedings of the Annual Symposium on Foundations of Computer Science (FOCS), pages 57–65, Redondo Beach, CA, Nov. 2000.Google Scholar
  31. 31.
    J. Rennie and A. K. McCallum. Using reinforcement learning to spider the Web efficiently. In Proc. 16th International Cont, on Machine Learning, pages 335–343, Bled, Slovenia, 1999.Google Scholar
  32. 32.
    G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY, 1983.MATHGoogle Scholar
  33. 33.
    P. Srinivasan, J. Mitchell, O. Bodenreider, G. Pant, and F. Menczer. Web crawling agents for retrieving biomedical information. In NETTAB: Agents in Bioinformatics, Bologna, Italy, 2002.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Gautam Pant
    • 1
  • Padmini Srinivasan
    • 1
    • 2
  • Filippo Menczer
    • 3
  1. 1.Department of Management SciencesThe University of IowaIowa CityUSA
  2. 2.School of Library and Information ScienceThe University of IowaIowa CityUSA
  3. 3.School of InformaticsIndiana UniversityBloomingtonUSA

Personalised recommendations