Skip to main content

Web Crawling

  • Chapter
  • First Online:
Web Data Mining

Part of the book series: Data-Centric Systems and Applications ((DCSA))

Abstract

Web crawlers, also known as spiders or robots, are programs that automatically download Web pages. Since information on the Web is scattered among billions of pages served by millions of servers around the globe, users who browse the Web can follow hyperlinks to access information, virtually moving from one page to the next. A crawler can visit many sites to collect information that can be analyzed and mined in a central location, either online (as it is downloaded) or off-line (after it is stored).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 89.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Bibliography

  1. Aggarwal, C., F. Al-Garawi, and P. Yu. Intelligent crawling on the World Wide Web with arbitrary predicates. In Proceedings of 10th Internaitonal Conference on World Wide Web (WWW-2001), 2001.

    Google Scholar 

  2. Akavipat, R., L. Wu, and F. Menczer. Small world peer networks in distributed Web search. In Proceedings of Alternative Track Papers and Posters Proceedings of International Conference on World Wide Web, 2004.

    Google Scholar 

  3. Amento, B., L. Terveen, and W. Hill. Does “authority” mean quality? Predicting expert quality ratings of Web documents. In Proceedings of ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR-2000), 2000.

    Google Scholar 

  4. Arasu, A., J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the web. ACM Transactions on Internet Technology (TOIT), 2001, 1(1): p. 2–43.

    Article  Google Scholar 

  5. Bharat, K. and M. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR-1998), 1998.

    Google Scholar 

  6. Brin, S. and P. Lawrence. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 1998, 30(1–7): p. 107–117.

    Google Scholar 

  7. Broder, A., R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Computer Networks, 2000, 33(1–6): p. 309–320.

    Article  Google Scholar 

  8. Chakrabarti, S. Mining the Web: discovering knowledge from hypertext data. 2003: Morgan Kaufmann Publishers.

    Google Scholar 

  9. Chakrabarti, S., B. Dom, S. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. Kleinberg. Mining the Web's link structure. Computer, 2002, 32(8): p. 60–67.

    Article  Google Scholar 

  10. Chakrabarti, S., B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. Computer Networks, 1998, 30(1–7): p. 65–74.

    Google Scholar 

  11. Chakrabarti, S., M. Van den Berg, and B. Dom. Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks, 1999, 31(11–16): p. 1623–1640.

    Article  Google Scholar 

  12. Chen, H., Y. Chung, M. Ramsey, and C. Yang. A smart itsy bitsy spider for the web. Journal of the American Society for Information Science, 1998, 49(7): p. 604–618.

    Article  Google Scholar 

  13. Cho, J. and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proceedings of International Conference on Very Large Data Bases (VLDB-2000), 2000.

    Google Scholar 

  14. Cho, J., H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. Computer Networks, 1998, 30(1–7): p. 161–172.

    Google Scholar 

  15. Davison, B. Topical locality in the Web. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-2000), 2000.

    Google Scholar 

  16. De Bra, P. and R. Post. Information retrieval in the World-Wide Web: making client-based searching feasible. Computer Networks, 1994, 27(2): p. 183–192.

    Google Scholar 

  17. Degeratu, M., G. Pant, and F. Menczer. Latency-dependent fitness in evolutionary multithreaded web agents. In Proceedings of GECCO Workshop on Evolutionary Computation and Multi-Agent Systems, 2001.

    Google Scholar 

  18. Diligenti, M., F. Coetzee, S. Lawrence, C. Giles, and M. Gori. Focused crawling using context graphs. In Proceedings of International Conference on Very Large Data Bases (VLDB-2000), 2000.

    Google Scholar 

  19. Eichmann, D. Ethical Web agents. Computer Networks and ISDN Systems, 1995, 28(1–2): p. 127–136.

    Article  Google Scholar 

  20. Fetterly, D., M. Manasse, M. Najork, and J. Wiener. A large scale study of the evolution of Web pages. Software: Practice and Experience, 2004, 34(2): p. 213–237.

    Article  Google Scholar 

  21. Gasparetti, F. and A. Micarelli. Swarm intelligence: Agents for adaptive web search. In Proceedings of European Conf. on Artificial Intelligence (ECAI- 2004), 2004.

    Google Scholar 

  22. Henzinger, M., A. Heydon, M. Mitzenmacher, and M. Najork. Measuring index quality using random walks on the Web. Computer Networks, 1999, 31(11–16): p. 1291–1303.

    Article  Google Scholar 

  23. Henzinger, M., A. Heydon, M. Mitzenmacher, and M. Najork. On nearuniform URL sampling. Computer Networks, 2000, 33(1–6): p. 295–308.

    Article  Google Scholar 

  24. Hersovici, M., M. Jacovi, Y. Maarek, D. Pelleg, M. Shtalhaim, and S. Ur. The shark-search algorithm. An application: tailored Web site mapping. Computer Networks, 1998, 30(1–7): p. 317–326.

    Google Scholar 

  25. Heydon, A. and M. Najork. Mercator: A scalable, extensible web crawler. World Wide Web, 1999, 2(4): p. 219–229.

    Article  Google Scholar 

  26. Jagatic, T., N. Johnson, M. Jakobsson, and F. Menczer. Social phishing. Communications of the ACM, 2007, 50(10): p. 94–100.

    Article  Google Scholar 

  27. Kaelbling, L., M. Littman, and A. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research 1996, 4: p. 237–285.

    Google Scholar 

  28. Kleinberg, J. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 1999, 46(5): p. 604–632.

    Article  MATH  MathSciNet  Google Scholar 

  29. Lawrence, S., L. Giles, and K. Bollacker. Digital libraries and autonomous citation indexing. Computer, 2002, 32(6): p. 67–71.

    Article  Google Scholar 

  30. Lin, L. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 1992, 8(3): p. 293–321.

    Google Scholar 

  31. Lu, J. and J. Callan. Content-based retrieval in hybrid peer-to-peer networks. In Proceedings of ACM International Conference on Information and knowledge management (CIKM-2003), 2003.

    Google Scholar 

  32. Maguitman, A., F. Menczer, H. Roinestad, and A. Vespignani. Algorithmic detection of semantic similarity. In Proceedings of International Conference on World Wide Web (WWW-2005), 2005.

    Google Scholar 

  33. McCallum, A., K. Nigam, J. Rennie, and K. Seymore. A machine learning approach to building domain-specific search engines. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI-1999), 1999.

    Google Scholar 

  34. Menczer, F. ARACHNID: Adaptive retrieval agents choosing heuristic neighborhoods for information discovery. In Proceedings of International Conference on Machine Learning (ICML-1997), 1997.

    Google Scholar 

  35. Menczer, F. Lexical and semantic clustering by web links. Journal of the American Society for Information Science and Technology, 2004, 55(14): p. 1261–1269.

    Article  Google Scholar 

  36. Menczer, F. Mapping the semantics of web text and links. Internet Computing, IEEE, 2005, 9(3): p. 27–36.

    Article  Google Scholar 

  37. Menczer, F. and R. Belew. Adaptive retrieval agents: Internalizing local

    Google Scholar 

  38. context and scaling up to the Web. Machine Learning, 2000, 39(2): p. 203–242.

    Google Scholar 

  39. Menczer, F., G. Pant, and P. Srinivasan. Topical web crawlers: Evaluating adaptive algorithms. ACM Transactions on Internet Technology (TOIT), 2004, 4(4): p. 378–419.

    Article  Google Scholar 

  40. Menczer, F., G. Pant, P. Srinivasan, and M. Ruiz. Evaluating topic-driven Web crawlers. In Proceedings of ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR-2001), 2001.

    Google Scholar 

  41. Micarelli, A. and F. Gasparetti. Adaptive focused crawling. In P. Brusilovsky, W. Nejdl, and A. Kobsa (eds.), Adaptive Web., 2007: Springer-Verlag.

    Google Scholar 

  42. Najork, M. and J. Wiener. Breadth-first crawling yields high-quality pages. In Proceedings of International Conference on World Wide Web (WWW-2001), 2001.

    Google Scholar 

  43. Ntoulas, A., J. Cho, and C. Olston. What's new on the web?: the evolution of the web from a search engine perspective. In Proceedings of International Conference on World Wide Web (WWW-2004), 2004.

    Google Scholar 

  44. Pant, G. Deriving link-context from HTML tag tree. In Proceedings of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD’03), 2003.

    Google Scholar 

  45. Pant, G., S. Bradshaw, and F. Menczer. Search engine-crawler symbiosis: Adapting to community interests. Research and AdvancedTechnology for Digital Libraries, 2004: p. 221–232.

    Google Scholar 

  46. Pant, G. and F. Menczer. MySpiders: Evolve your own intelligent Web crawlers. Autonomous Agents and Multi-Agent Systems, 2002, 5(2): p. 221–229.

    Article  Google Scholar 

  47. Pant, G. and F. Menczer. Topical crawling for business intelligence. Research and Advanced Technology for Digital Libraries, 2004: p. 233–244.

    Google Scholar 

  48. Pant, G. and P. Srinivasan. Learning to crawl: Comparing classification schemes. ACM Transactions on Information Systems (TOIS), 2005, 23(4): p. 430–462.

    Article  Google Scholar 

  49. Pant, G., P. Srinivasan, and F. Menczer. Exploration versus exploitation in topic driven crawlers. In Proceedings of WWW-02 Workshop on Web Dynamics, 2002.

    Google Scholar 

  50. Pastor-Satorras, R. and A. Vespignani. Evolution and structure of the Internet: A statistical physics approach. 2004: Cambridge Univ Press.

    Google Scholar 

  51. Rennie, J. and A. McCallum. Using reinforcement learning to spider the web efficiently. In Proceedings of International Conference on Machine Learning (ICML-1999), 1999.

    Google Scholar 

  52. Rijsbergen, C.v. Information Retrieval. 1979: Butterworths. Second edition.

    Google Scholar 

  53. Rumelhart, D., G. Hinton, and R. Williams. Learning internal representations by error propagation. D. Rumelhart and J. McClelland (eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1996.

    Google Scholar 

  54. Srinivasan, P., F. Menczer, and G. Pant. A general evaluation framework for topical crawlers. Information Retrieval, 2005, 8(3): p. 417–447.

    Article  Google Scholar 

  55. Srinivasan, P., J. Mitchell, O. Bodenreider, G. Pant, F. Menczer, and P. Acd. Web crawling agents for retrieving biomedical information. In Proceedings of Workshop on Agents in Bioinformatics (NETTAB’02), 2002.

    Google Scholar 

  56. Von Ahn, L., M. Blum, N. Hopper, and J. Langford. CAPTCHA: Using hard AI problems for security. Advances in Cryptology—EUROCRYPT-2003, 2003: p. 646–646.

    Google Scholar 

  57. Witten, I., C. Nevill-Manning, and S. Cunningham. Building a digital library for computer science research: technical issues. Australian Computer Science Communications, 1996, 18 p. 534–542.

    Google Scholar 

  58. Wu, L., R. Akavipat, and F. Menczer. 6S: Distributing crawling and searching across Web peers. In Proceedings of IASTED Int. Conf. on Web Technologies, Applications, and Services, 2005.

    Google Scholar 

  59. Wu, L., R. Akavipat, and F. Menczer. Adaptive query routing in peer Web search. In Proceedings of International Conference on World Wide Web (WWW-2005), 2005.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bing Liu .

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Liu, B., Menczer, F. (2011). Web Crawling. In: Web Data Mining. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19460-3_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-19460-3_8

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-19459-7

  • Online ISBN: 978-3-642-19460-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics