Web Crawling

Liu, Bing; Menczer, Filippo

doi:10.1007/978-3-642-19460-3_8

Bing Liu² &
Filippo Menczer

Part of the book series: Data-Centric Systems and Applications ((DCSA))

11k Accesses
4 Citations

Abstract

Web crawlers, also known as spiders or robots, are programs that automatically download Web pages. Since information on the Web is scattered among billions of pages served by millions of servers around the globe, users who browse the Web can follow hyperlinks to access information, virtually moving from one page to the next. A crawler can visit many sites to collect information that can be analyzed and mined in a central location, either online (as it is downloaded) or off-line (after it is stored).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Hardcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Bibliography

Aggarwal, C., F. Al-Garawi, and P. Yu. Intelligent crawling on the World Wide Web with arbitrary predicates. In Proceedings of 10th Internaitonal Conference on World Wide Web (WWW-2001), 2001.
Google Scholar
Akavipat, R., L. Wu, and F. Menczer. Small world peer networks in distributed Web search. In Proceedings of Alternative Track Papers and Posters Proceedings of International Conference on World Wide Web, 2004.
Google Scholar
Amento, B., L. Terveen, and W. Hill. Does “authority” mean quality? Predicting expert quality ratings of Web documents. In Proceedings of ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR-2000), 2000.
Google Scholar
Arasu, A., J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the web. ACM Transactions on Internet Technology (TOIT), 2001, 1(1): p. 2–43.
Article Google Scholar
Bharat, K. and M. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR-1998), 1998.
Google Scholar
Brin, S. and P. Lawrence. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 1998, 30(1–7): p. 107–117.
Google Scholar
Broder, A., R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Computer Networks, 2000, 33(1–6): p. 309–320.
Article Google Scholar
Chakrabarti, S. Mining the Web: discovering knowledge from hypertext data. 2003: Morgan Kaufmann Publishers.
Google Scholar
Chakrabarti, S., B. Dom, S. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. Kleinberg. Mining the Web's link structure. Computer, 2002, 32(8): p. 60–67.
Article Google Scholar
Chakrabarti, S., B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. Computer Networks, 1998, 30(1–7): p. 65–74.
Google Scholar
Chakrabarti, S., M. Van den Berg, and B. Dom. Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks, 1999, 31(11–16): p. 1623–1640.
Article Google Scholar
Chen, H., Y. Chung, M. Ramsey, and C. Yang. A smart itsy bitsy spider for the web. Journal of the American Society for Information Science, 1998, 49(7): p. 604–618.
Article Google Scholar
Cho, J. and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proceedings of International Conference on Very Large Data Bases (VLDB-2000), 2000.
Google Scholar
Cho, J., H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. Computer Networks, 1998, 30(1–7): p. 161–172.
Google Scholar
Davison, B. Topical locality in the Web. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-2000), 2000.
Google Scholar
De Bra, P. and R. Post. Information retrieval in the World-Wide Web: making client-based searching feasible. Computer Networks, 1994, 27(2): p. 183–192.
Google Scholar
Degeratu, M., G. Pant, and F. Menczer. Latency-dependent fitness in evolutionary multithreaded web agents. In Proceedings of GECCO Workshop on Evolutionary Computation and Multi-Agent Systems, 2001.
Google Scholar
Diligenti, M., F. Coetzee, S. Lawrence, C. Giles, and M. Gori. Focused crawling using context graphs. In Proceedings of International Conference on Very Large Data Bases (VLDB-2000), 2000.
Google Scholar
Eichmann, D. Ethical Web agents. Computer Networks and ISDN Systems, 1995, 28(1–2): p. 127–136.
Article Google Scholar
Fetterly, D., M. Manasse, M. Najork, and J. Wiener. A large scale study of the evolution of Web pages. Software: Practice and Experience, 2004, 34(2): p. 213–237.
Article Google Scholar
Gasparetti, F. and A. Micarelli. Swarm intelligence: Agents for adaptive web search. In Proceedings of European Conf. on Artificial Intelligence (ECAI- 2004), 2004.
Google Scholar
Henzinger, M., A. Heydon, M. Mitzenmacher, and M. Najork. Measuring index quality using random walks on the Web. Computer Networks, 1999, 31(11–16): p. 1291–1303.
Article Google Scholar
Henzinger, M., A. Heydon, M. Mitzenmacher, and M. Najork. On nearuniform URL sampling. Computer Networks, 2000, 33(1–6): p. 295–308.
Article Google Scholar
Hersovici, M., M. Jacovi, Y. Maarek, D. Pelleg, M. Shtalhaim, and S. Ur. The shark-search algorithm. An application: tailored Web site mapping. Computer Networks, 1998, 30(1–7): p. 317–326.
Google Scholar
Heydon, A. and M. Najork. Mercator: A scalable, extensible web crawler. World Wide Web, 1999, 2(4): p. 219–229.
Article Google Scholar
Jagatic, T., N. Johnson, M. Jakobsson, and F. Menczer. Social phishing. Communications of the ACM, 2007, 50(10): p. 94–100.
Article Google Scholar
Kaelbling, L., M. Littman, and A. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research 1996, 4: p. 237–285.
Google Scholar
Kleinberg, J. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 1999, 46(5): p. 604–632.
Article MATH MathSciNet Google Scholar
Lawrence, S., L. Giles, and K. Bollacker. Digital libraries and autonomous citation indexing. Computer, 2002, 32(6): p. 67–71.
Article Google Scholar
Lin, L. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 1992, 8(3): p. 293–321.
Google Scholar
Lu, J. and J. Callan. Content-based retrieval in hybrid peer-to-peer networks. In Proceedings of ACM International Conference on Information and knowledge management (CIKM-2003), 2003.
Google Scholar
Maguitman, A., F. Menczer, H. Roinestad, and A. Vespignani. Algorithmic detection of semantic similarity. In Proceedings of International Conference on World Wide Web (WWW-2005), 2005.
Google Scholar
McCallum, A., K. Nigam, J. Rennie, and K. Seymore. A machine learning approach to building domain-specific search engines. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI-1999), 1999.
Google Scholar
Menczer, F. ARACHNID: Adaptive retrieval agents choosing heuristic neighborhoods for information discovery. In Proceedings of International Conference on Machine Learning (ICML-1997), 1997.
Google Scholar
Menczer, F. Lexical and semantic clustering by web links. Journal of the American Society for Information Science and Technology, 2004, 55(14): p. 1261–1269.
Article Google Scholar
Menczer, F. Mapping the semantics of web text and links. Internet Computing, IEEE, 2005, 9(3): p. 27–36.
Article Google Scholar
Menczer, F. and R. Belew. Adaptive retrieval agents: Internalizing local
Google Scholar
context and scaling up to the Web. Machine Learning, 2000, 39(2): p. 203–242.
Google Scholar
Menczer, F., G. Pant, and P. Srinivasan. Topical web crawlers: Evaluating adaptive algorithms. ACM Transactions on Internet Technology (TOIT), 2004, 4(4): p. 378–419.
Article Google Scholar
Menczer, F., G. Pant, P. Srinivasan, and M. Ruiz. Evaluating topic-driven Web crawlers. In Proceedings of ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR-2001), 2001.
Google Scholar
Micarelli, A. and F. Gasparetti. Adaptive focused crawling. In P. Brusilovsky, W. Nejdl, and A. Kobsa (eds.), Adaptive Web., 2007: Springer-Verlag.
Google Scholar
Najork, M. and J. Wiener. Breadth-first crawling yields high-quality pages. In Proceedings of International Conference on World Wide Web (WWW-2001), 2001.
Google Scholar
Ntoulas, A., J. Cho, and C. Olston. What's new on the web?: the evolution of the web from a search engine perspective. In Proceedings of International Conference on World Wide Web (WWW-2004), 2004.
Google Scholar
Pant, G. Deriving link-context from HTML tag tree. In Proceedings of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD’03), 2003.
Google Scholar
Pant, G., S. Bradshaw, and F. Menczer. Search engine-crawler symbiosis: Adapting to community interests. Research and AdvancedTechnology for Digital Libraries, 2004: p. 221–232.
Google Scholar
Pant, G. and F. Menczer. MySpiders: Evolve your own intelligent Web crawlers. Autonomous Agents and Multi-Agent Systems, 2002, 5(2): p. 221–229.
Article Google Scholar
Pant, G. and F. Menczer. Topical crawling for business intelligence. Research and Advanced Technology for Digital Libraries, 2004: p. 233–244.
Google Scholar
Pant, G. and P. Srinivasan. Learning to crawl: Comparing classification schemes. ACM Transactions on Information Systems (TOIS), 2005, 23(4): p. 430–462.
Article Google Scholar
Pant, G., P. Srinivasan, and F. Menczer. Exploration versus exploitation in topic driven crawlers. In Proceedings of WWW-02 Workshop on Web Dynamics, 2002.
Google Scholar
Pastor-Satorras, R. and A. Vespignani. Evolution and structure of the Internet: A statistical physics approach. 2004: Cambridge Univ Press.
Google Scholar
Rennie, J. and A. McCallum. Using reinforcement learning to spider the web efficiently. In Proceedings of International Conference on Machine Learning (ICML-1999), 1999.
Google Scholar
Rijsbergen, C.v. Information Retrieval. 1979: Butterworths. Second edition.
Google Scholar
Rumelhart, D., G. Hinton, and R. Williams. Learning internal representations by error propagation. D. Rumelhart and J. McClelland (eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1996.
Google Scholar
Srinivasan, P., F. Menczer, and G. Pant. A general evaluation framework for topical crawlers. Information Retrieval, 2005, 8(3): p. 417–447.
Article Google Scholar
Srinivasan, P., J. Mitchell, O. Bodenreider, G. Pant, F. Menczer, and P. Acd. Web crawling agents for retrieving biomedical information. In Proceedings of Workshop on Agents in Bioinformatics (NETTAB’02), 2002.
Google Scholar
Von Ahn, L., M. Blum, N. Hopper, and J. Langford. CAPTCHA: Using hard AI problems for security. Advances in Cryptology—EUROCRYPT-2003, 2003: p. 646–646.
Google Scholar
Witten, I., C. Nevill-Manning, and S. Cunningham. Building a digital library for computer science research: technical issues. Australian Computer Science Communications, 1996, 18 p. 534–542.
Google Scholar
Wu, L., R. Akavipat, and F. Menczer. 6S: Distributing crawling and searching across Web peers. In Proceedings of IASTED Int. Conf. on Web Technologies, Applications, and Services, 2005.
Google Scholar
Wu, L., R. Akavipat, and F. Menczer. Adaptive query routing in peer Web search. In Proceedings of International Conference on World Wide Web (WWW-2005), 2005.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Illinois, Chicago, 851 S. Morgan St., Chicago, IL, 60607-7053, USA
Bing Liu

Authors

Bing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Filippo Menczer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bing Liu .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Liu, B., Menczer, F. (2011). Web Crawling. In: Web Data Mining. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19460-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-19460-3_8
Published: 15 April 2011
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19459-7
Online ISBN: 978-3-642-19460-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics