Summary
The large size and the dynamic nature of the Web make it necessary to continually maintain Web based information retrieval systems. Crawlers facilitate this process by following hyperlinks in Web pages to automatically download new and updated Web pages. While some systems rely on crawlers that exhaustively crawl the Web, others incorporate “focus” within their crawlers to harvest application- or topic-specific collections. In this chapter we discuss the basic issues related to developing an infrastructure for crawlers. This is followed by a review of several topical crawling algorithms, and evaluation metrics that may be used to judge their performance. Given that many innovative applications of Web crawling are still being invented, we briefly discuss some that have already been developed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
C. C. Aggarwal, F. Al-Garawi, and P. S. Yu. Intelligent crawling on the World Wide Web with arbitrary predicates. In WWWIO, Hong Kong, May 2001.
B. Amento, L. Terveen, and W. Hill. Does “authority” mean quality? Predicting expert quality ratings of web documents. In Proc. 23th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, Athens, Greece, 2000.
A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the Web. ACM Transactions on Internet Technology, 1(1), 2001.
K. Bharat and M.R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 1998.
Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30: 107–117, 1998.
S. Chakrabarti. Mining the Web. Morgan Kaufmann, 2003.
S. Chakraharti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic resource list compilation by analyzing hyperlink structure and associated text. In Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, 1998.
S. Chakraharti, K. Punera, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In WWW2002, Hawaii, May 2002.
S. Chakraharti, M. van den Berg. and B. Dom. Focused crawling: A new approach to topic-specific Web resource discovery. Computer Networks, 31(1 1–16): 1623–1640, 1999.
J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. Computer Networks, 30: 161–172, 1998.
B.D. Davison. Topical locality in the web. In Proc. 23rd Annual Intl. ACM SIGIR Conn. on Research and Development in Information Retrieval, Athens, Greece, 2000.
P. M. E. De Bra and R. D. J. Post. Information retrieval in the World Wide Web: Making client-based searching feasible. In Proc. 1st International World Wide Web Conference, 1994.
M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In Proc. 26th International Conference on Very Large Databases (VLDB 2000), pages 527–534, Cairo, Egypt, 2000.
D. Eichmann. Ethical Web agents. In Second International World-Wide Web Conference, pages 3–13, Chicago, Illinois, 1994.
M. Hersovici, M. Jacovi, Y. S. Maarek, D. Pelleg, M. Shtalhaim, and S. Ur. The shark-search algorithm - An application: Tailored Web site mapping. In Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, 1998.
J. Johnson, K. Tsioutsiouliklis, and C.L. Giles. Evolving strategies for focused web crawling. In Proc. 12th Intl. Conf on Machine Learning (ICML-2003), Washington DC, 2003.
J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46 (5): 604–632, 1999.
V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjamin/Cummings, 1994.
H. Lieberman, F. Christopher, and L. Weitzman. Exploring the Weh with reconnaissance agents. Communications of the ACM, 44: 69–75, August 2001.
A.K. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of interne(portals with machine learning. Information Retrieval, 3 (2): 127–163, 2000.
F. Menczer and R. K. Belew. Adaptive retrieval agents: Internalizing local context and scaling up to the Weh. Machine Learning, 39 (2–3): 203–242, 2000.
F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan. Evaluating topic-driven Web crawlers. In Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in hitbroration Retrieval, New Orleans, Louisiana, 2001.
F. Menczer, G. Pant, and P. Srinivasan. Topical web crawlers: Evaluating adaptive algorithms. To appear in ACM Trans. on Internet Technologies, 2003. http://dollar.biz.uiowa.edurfil/Papers/TOIT.pdf.
G. Pant. Deriving link-context from HTML tag tree. In 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2003.
G. Pant, S. Bradshaw, and F. Menczer. Search engine-crawler symbiosis: Adapting to community interests. In Proc. 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003), Trondheim, Norway, 2003.
G. Pant and F. Menczer. MySpiders: Evolve your own intelligent Web crawlers. Autonomous Agents and Multi-Agent Systems, 5 (2): 221–229, 2002.
G. Pant and F. Menczer. Topical crawling for business intelligence. In Proc. 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003 ), Trondheim, Norway, 2003.
G. Pant, P. Srinivasan, and F. Menczer. Exploration versus exploitation in topic driven crawlers. In WWW02 Workshop on Web Dynamics, Honolulu, Hawaii, 2002.
M. Porter. An algorithm for suffix stripping. Program, 14 (3): 130–137, 1980.
S. RaviKumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic models for the Web graph. In Proceedings of the Annual Symposium on Foundations of Computer Science (FOCS), pages 57–65, Redondo Beach, CA, Nov. 2000.
J. Rennie and A. K. McCallum. Using reinforcement learning to spider the Web efficiently. In Proc. 16th International Cont, on Machine Learning, pages 335–343, Bled, Slovenia, 1999.
G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY, 1983.
P. Srinivasan, J. Mitchell, O. Bodenreider, G. Pant, and F. Menczer. Web crawling agents for retrieving biomedical information. In NETTAB: Agents in Bioinformatics, Bologna, Italy, 2002.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Pant, G., Srinivasan, P., Menczer, F. (2004). Crawling the Web. In: Web Dynamics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-10874-1_7
Download citation
DOI: https://doi.org/10.1007/978-3-662-10874-1_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-07377-9
Online ISBN: 978-3-662-10874-1
eBook Packages: Springer Book Archive