Abstract
Web crawlers, also known as spiders or robots, are programs that automatically download Web pages. Since information on the Web is scattered among billions of pages served by millions of servers around the globe, users who browse the Web can follow hyperlinks to access information, virtually moving from one page to the next. A crawler can visit many sites to collect information that can be analyzed and mined in a central location, either online (as it is downloaded) or off-line (after it is stored).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Bibliography
Aggarwal, C., F. Al-Garawi, and P. Yu. Intelligent crawling on the World Wide Web with arbitrary predicates. In Proceedings of 10th Internaitonal Conference on World Wide Web (WWW-2001), 2001.
Akavipat, R., L. Wu, and F. Menczer. Small world peer networks in distributed Web search. In Proceedings of Alternative Track Papers and Posters Proceedings of International Conference on World Wide Web, 2004.
Amento, B., L. Terveen, and W. Hill. Does “authority” mean quality? Predicting expert quality ratings of Web documents. In Proceedings of ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR-2000), 2000.
Arasu, A., J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the web. ACM Transactions on Internet Technology (TOIT), 2001, 1(1): p. 2–43.
Bharat, K. and M. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR-1998), 1998.
Brin, S. and P. Lawrence. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 1998, 30(1–7): p. 107–117.
Broder, A., R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Computer Networks, 2000, 33(1–6): p. 309–320.
Chakrabarti, S. Mining the Web: discovering knowledge from hypertext data. 2003: Morgan Kaufmann Publishers.
Chakrabarti, S., B. Dom, S. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. Kleinberg. Mining the Web's link structure. Computer, 2002, 32(8): p. 60–67.
Chakrabarti, S., B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. Computer Networks, 1998, 30(1–7): p. 65–74.
Chakrabarti, S., M. Van den Berg, and B. Dom. Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks, 1999, 31(11–16): p. 1623–1640.
Chen, H., Y. Chung, M. Ramsey, and C. Yang. A smart itsy bitsy spider for the web. Journal of the American Society for Information Science, 1998, 49(7): p. 604–618.
Cho, J. and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proceedings of International Conference on Very Large Data Bases (VLDB-2000), 2000.
Cho, J., H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. Computer Networks, 1998, 30(1–7): p. 161–172.
Davison, B. Topical locality in the Web. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-2000), 2000.
De Bra, P. and R. Post. Information retrieval in the World-Wide Web: making client-based searching feasible. Computer Networks, 1994, 27(2): p. 183–192.
Degeratu, M., G. Pant, and F. Menczer. Latency-dependent fitness in evolutionary multithreaded web agents. In Proceedings of GECCO Workshop on Evolutionary Computation and Multi-Agent Systems, 2001.
Diligenti, M., F. Coetzee, S. Lawrence, C. Giles, and M. Gori. Focused crawling using context graphs. In Proceedings of International Conference on Very Large Data Bases (VLDB-2000), 2000.
Eichmann, D. Ethical Web agents. Computer Networks and ISDN Systems, 1995, 28(1–2): p. 127–136.
Fetterly, D., M. Manasse, M. Najork, and J. Wiener. A large scale study of the evolution of Web pages. Software: Practice and Experience, 2004, 34(2): p. 213–237.
Gasparetti, F. and A. Micarelli. Swarm intelligence: Agents for adaptive web search. In Proceedings of European Conf. on Artificial Intelligence (ECAI- 2004), 2004.
Henzinger, M., A. Heydon, M. Mitzenmacher, and M. Najork. Measuring index quality using random walks on the Web. Computer Networks, 1999, 31(11–16): p. 1291–1303.
Henzinger, M., A. Heydon, M. Mitzenmacher, and M. Najork. On nearuniform URL sampling. Computer Networks, 2000, 33(1–6): p. 295–308.
Hersovici, M., M. Jacovi, Y. Maarek, D. Pelleg, M. Shtalhaim, and S. Ur. The shark-search algorithm. An application: tailored Web site mapping. Computer Networks, 1998, 30(1–7): p. 317–326.
Heydon, A. and M. Najork. Mercator: A scalable, extensible web crawler. World Wide Web, 1999, 2(4): p. 219–229.
Jagatic, T., N. Johnson, M. Jakobsson, and F. Menczer. Social phishing. Communications of the ACM, 2007, 50(10): p. 94–100.
Kaelbling, L., M. Littman, and A. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research 1996, 4: p. 237–285.
Kleinberg, J. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 1999, 46(5): p. 604–632.
Lawrence, S., L. Giles, and K. Bollacker. Digital libraries and autonomous citation indexing. Computer, 2002, 32(6): p. 67–71.
Lin, L. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 1992, 8(3): p. 293–321.
Lu, J. and J. Callan. Content-based retrieval in hybrid peer-to-peer networks. In Proceedings of ACM International Conference on Information and knowledge management (CIKM-2003), 2003.
Maguitman, A., F. Menczer, H. Roinestad, and A. Vespignani. Algorithmic detection of semantic similarity. In Proceedings of International Conference on World Wide Web (WWW-2005), 2005.
McCallum, A., K. Nigam, J. Rennie, and K. Seymore. A machine learning approach to building domain-specific search engines. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI-1999), 1999.
Menczer, F. ARACHNID: Adaptive retrieval agents choosing heuristic neighborhoods for information discovery. In Proceedings of International Conference on Machine Learning (ICML-1997), 1997.
Menczer, F. Lexical and semantic clustering by web links. Journal of the American Society for Information Science and Technology, 2004, 55(14): p. 1261–1269.
Menczer, F. Mapping the semantics of web text and links. Internet Computing, IEEE, 2005, 9(3): p. 27–36.
Menczer, F. and R. Belew. Adaptive retrieval agents: Internalizing local
context and scaling up to the Web. Machine Learning, 2000, 39(2): p. 203–242.
Menczer, F., G. Pant, and P. Srinivasan. Topical web crawlers: Evaluating adaptive algorithms. ACM Transactions on Internet Technology (TOIT), 2004, 4(4): p. 378–419.
Menczer, F., G. Pant, P. Srinivasan, and M. Ruiz. Evaluating topic-driven Web crawlers. In Proceedings of ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR-2001), 2001.
Micarelli, A. and F. Gasparetti. Adaptive focused crawling. In P. Brusilovsky, W. Nejdl, and A. Kobsa (eds.), Adaptive Web., 2007: Springer-Verlag.
Najork, M. and J. Wiener. Breadth-first crawling yields high-quality pages. In Proceedings of International Conference on World Wide Web (WWW-2001), 2001.
Ntoulas, A., J. Cho, and C. Olston. What's new on the web?: the evolution of the web from a search engine perspective. In Proceedings of International Conference on World Wide Web (WWW-2004), 2004.
Pant, G. Deriving link-context from HTML tag tree. In Proceedings of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD’03), 2003.
Pant, G., S. Bradshaw, and F. Menczer. Search engine-crawler symbiosis: Adapting to community interests. Research and AdvancedTechnology for Digital Libraries, 2004: p. 221–232.
Pant, G. and F. Menczer. MySpiders: Evolve your own intelligent Web crawlers. Autonomous Agents and Multi-Agent Systems, 2002, 5(2): p. 221–229.
Pant, G. and F. Menczer. Topical crawling for business intelligence. Research and Advanced Technology for Digital Libraries, 2004: p. 233–244.
Pant, G. and P. Srinivasan. Learning to crawl: Comparing classification schemes. ACM Transactions on Information Systems (TOIS), 2005, 23(4): p. 430–462.
Pant, G., P. Srinivasan, and F. Menczer. Exploration versus exploitation in topic driven crawlers. In Proceedings of WWW-02 Workshop on Web Dynamics, 2002.
Pastor-Satorras, R. and A. Vespignani. Evolution and structure of the Internet: A statistical physics approach. 2004: Cambridge Univ Press.
Rennie, J. and A. McCallum. Using reinforcement learning to spider the web efficiently. In Proceedings of International Conference on Machine Learning (ICML-1999), 1999.
Rijsbergen, C.v. Information Retrieval. 1979: Butterworths. Second edition.
Rumelhart, D., G. Hinton, and R. Williams. Learning internal representations by error propagation. D. Rumelhart and J. McClelland (eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1996.
Srinivasan, P., F. Menczer, and G. Pant. A general evaluation framework for topical crawlers. Information Retrieval, 2005, 8(3): p. 417–447.
Srinivasan, P., J. Mitchell, O. Bodenreider, G. Pant, F. Menczer, and P. Acd. Web crawling agents for retrieving biomedical information. In Proceedings of Workshop on Agents in Bioinformatics (NETTAB’02), 2002.
Von Ahn, L., M. Blum, N. Hopper, and J. Langford. CAPTCHA: Using hard AI problems for security. Advances in Cryptology—EUROCRYPT-2003, 2003: p. 646–646.
Witten, I., C. Nevill-Manning, and S. Cunningham. Building a digital library for computer science research: technical issues. Australian Computer Science Communications, 1996, 18 p. 534–542.
Wu, L., R. Akavipat, and F. Menczer. 6S: Distributing crawling and searching across Web peers. In Proceedings of IASTED Int. Conf. on Web Technologies, Applications, and Services, 2005.
Wu, L., R. Akavipat, and F. Menczer. Adaptive query routing in peer Web search. In Proceedings of International Conference on World Wide Web (WWW-2005), 2005.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Liu, B., Menczer, F. (2011). Web Crawling. In: Web Data Mining. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19460-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-19460-3_8
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19459-7
Online ISBN: 978-3-642-19460-3
eBook Packages: Computer ScienceComputer Science (R0)