Crawling the Web

Pant, Gautam; Srinivasan, Padmini; Menczer, Filippo

doi:10.1007/978-3-662-10874-1_7

Gautam Pant²,
Padmini Srinivasan^2,3 &
Filippo Menczer⁴

562 Accesses
76 Citations

Summary

The large size and the dynamic nature of the Web make it necessary to continually maintain Web based information retrieval systems. Crawlers facilitate this process by following hyperlinks in Web pages to automatically download new and updated Web pages. While some systems rely on crawlers that exhaustively crawl the Web, others incorporate “focus” within their crawlers to harvest application- or topic-specific collections. In this chapter we discuss the basic issues related to developing an infrastructure for crawlers. This is followed by a review of several topical crawling algorithms, and evaluation metrics that may be used to judge their performance. Given that many innovative applications of Web crawling are still being invented, we briefly discuss some that have already been developed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.00; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

C. C. Aggarwal, F. Al-Garawi, and P. S. Yu. Intelligent crawling on the World Wide Web with arbitrary predicates. In WWWIO, Hong Kong, May 2001.
Google Scholar
B. Amento, L. Terveen, and W. Hill. Does “authority” mean quality? Predicting expert quality ratings of web documents. In Proc. 23th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, Athens, Greece, 2000.
Google Scholar
A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the Web. ACM Transactions on Internet Technology, 1(1), 2001.
Google Scholar
K. Bharat and M.R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 1998.
Google Scholar
Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30: 107–117, 1998.
Article Google Scholar
S. Chakrabarti. Mining the Web. Morgan Kaufmann, 2003.
Google Scholar
S. Chakraharti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic resource list compilation by analyzing hyperlink structure and associated text. In Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, 1998.
Google Scholar
S. Chakraharti, K. Punera, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In WWW2002, Hawaii, May 2002.
Google Scholar
S. Chakraharti, M. van den Berg. and B. Dom. Focused crawling: A new approach to topic-specific Web resource discovery. Computer Networks, 31(1 1–16): 1623–1640, 1999.
Google Scholar
J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. Computer Networks, 30: 161–172, 1998.
Google Scholar
B.D. Davison. Topical locality in the web. In Proc. 23rd Annual Intl. ACM SIGIR Conn. on Research and Development in Information Retrieval, Athens, Greece, 2000.
Google Scholar
P. M. E. De Bra and R. D. J. Post. Information retrieval in the World Wide Web: Making client-based searching feasible. In Proc. 1st International World Wide Web Conference, 1994.
Google Scholar
M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In Proc. 26th International Conference on Very Large Databases (VLDB 2000), pages 527–534, Cairo, Egypt, 2000.
Google Scholar
D. Eichmann. Ethical Web agents. In Second International World-Wide Web Conference, pages 3–13, Chicago, Illinois, 1994.
Google Scholar
M. Hersovici, M. Jacovi, Y. S. Maarek, D. Pelleg, M. Shtalhaim, and S. Ur. The shark-search algorithm - An application: Tailored Web site mapping. In Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, 1998.
Google Scholar
J. Johnson, K. Tsioutsiouliklis, and C.L. Giles. Evolving strategies for focused web crawling. In Proc. 12th Intl. Conf on Machine Learning (ICML-2003), Washington DC, 2003.
Google Scholar
J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46 (5): 604–632, 1999.
Article MathSciNet MATH Google Scholar
V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjamin/Cummings, 1994.
MATH Google Scholar
H. Lieberman, F. Christopher, and L. Weitzman. Exploring the Weh with reconnaissance agents. Communications of the ACM, 44: 69–75, August 2001.
Article Google Scholar
A.K. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of interne(portals with machine learning. Information Retrieval, 3 (2): 127–163, 2000.
Article Google Scholar
F. Menczer and R. K. Belew. Adaptive retrieval agents: Internalizing local context and scaling up to the Weh. Machine Learning, 39 (2–3): 203–242, 2000.
Article MATH Google Scholar
F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan. Evaluating topic-driven Web crawlers. In Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in hitbroration Retrieval, New Orleans, Louisiana, 2001.
Google Scholar
F. Menczer, G. Pant, and P. Srinivasan. Topical web crawlers: Evaluating adaptive algorithms. To appear in ACM Trans. on Internet Technologies, 2003. http://dollar.biz.uiowa.edurfil/Papers/TOIT.pdf.
Google Scholar
G. Pant. Deriving link-context from HTML tag tree. In 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2003.
Google Scholar
G. Pant, S. Bradshaw, and F. Menczer. Search engine-crawler symbiosis: Adapting to community interests. In Proc. 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003), Trondheim, Norway, 2003.
Google Scholar
G. Pant and F. Menczer. MySpiders: Evolve your own intelligent Web crawlers. Autonomous Agents and Multi-Agent Systems, 5 (2): 221–229, 2002.
Article Google Scholar
G. Pant and F. Menczer. Topical crawling for business intelligence. In Proc. 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003 ), Trondheim, Norway, 2003.
Google Scholar
G. Pant, P. Srinivasan, and F. Menczer. Exploration versus exploitation in topic driven crawlers. In WWW02 Workshop on Web Dynamics, Honolulu, Hawaii, 2002.
Google Scholar
M. Porter. An algorithm for suffix stripping. Program, 14 (3): 130–137, 1980.
Article Google Scholar
S. RaviKumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic models for the Web graph. In Proceedings of the Annual Symposium on Foundations of Computer Science (FOCS), pages 57–65, Redondo Beach, CA, Nov. 2000.
Google Scholar
J. Rennie and A. K. McCallum. Using reinforcement learning to spider the Web efficiently. In Proc. 16th International Cont, on Machine Learning, pages 335–343, Bled, Slovenia, 1999.
Google Scholar
G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY, 1983.
MATH Google Scholar
P. Srinivasan, J. Mitchell, O. Bodenreider, G. Pant, and F. Menczer. Web crawling agents for retrieving biomedical information. In NETTAB: Agents in Bioinformatics, Bologna, Italy, 2002.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Management Sciences, The University of Iowa, Iowa City, IA, 52242, USA
Gautam Pant & Padmini Srinivasan
School of Library and Information Science, The University of Iowa, Iowa City, IA, 52242, USA
Padmini Srinivasan
School of Informatics, Indiana University, Bloomington, IN, 47408, USA
Filippo Menczer

Authors

Gautam Pant
View author publications
You can also search for this author in PubMed Google Scholar
Padmini Srinivasan
View author publications
You can also search for this author in PubMed Google Scholar
Filippo Menczer
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Pant, G., Srinivasan, P., Menczer, F. (2004). Crawling the Web. In: Web Dynamics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-10874-1_7

Download citation

DOI: https://doi.org/10.1007/978-3-662-10874-1_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-07377-9
Online ISBN: 978-3-662-10874-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics