Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

Focused Web Crawling

  • Soumen Chakrabarti
Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_165

Synonyms

Topic-directed web crawling; Web resource discovery

Definition

The world-wide Web can be modeled as a very large graph with nodes representing pages and edges representing hyperlinks. Thanks to dynamically generated content, the Web graph is infinitely large. Page content and hyperlinks change continually. Any centralized Web search service must first fetch a large number of Web pages over the Internet using a Web crawler, and then subject the local copies to indexing and other analysis. At any time during its execution, a Web crawler has a set of pages that have been fetched, and a frontier of unexplored hyperlinks encountered on fetched pages. Given finite network resources, it is critical for the crawler to choose carefully the subset of frontier hyperlinks it should fetch next. Depending on the application and user group, it may be beneficial to preferentially acquire pages that are highly linked, pages that pertain to specific topics, pages that are likely to mention...

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Babaria R, Saketha Nath J, Krishnan S, Sivaramakrishnan KR, Bhattacharyya C, Murty MN. Focused crawling with scalable ordinal regression solvers. In: Proceedings of the 24th International Conference on Machine Learning; 2007. p. 57–64.Google Scholar
  2. 2.
    Broder A et~al. Graph structure in the web: experiments and models. In: Proceedings of the 9th International World Wide Web Conference; 2000. p. 309–20.Google Scholar
  3. 3.
    Chakrabarti S. Mining the web: discovering knowledge from hypertext data. Morgan-Kauffman; 2002.Google Scholar
  4. 4.
    Chakrabarti S, Dom B, Indyk P. Enhanced hypertext categorization using hyperlinks. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 1998. p. 307–18.Google Scholar
  5. 5.
    Chakrabarti S, van den Berg M, Dom B. Focused crawling: a new approach to topic-specific web resource discovery. Comput Netw. 1999;31(11–16):1623–40.CrossRefGoogle Scholar
  6. 6.
    Chakrabarti S, Joshi MM, Punera K, Pennock DM. The structure of broad topics on the web. In: Proceedings of the 11th International World Wide Web Conference; 2002. p. 251–62.Google Scholar
  7. 7.
    Chakrabarti S, Punera K, Subramanyam M. Accelerated focused crawling through online relevance feedback. In: Proceedings of the 11th International World Wide Web Conference; 2002. p. 148–59.Google Scholar
  8. 8.
    Cho J, Garcia-Molina H, Page L. Efficient crawling through URL ordering. In: Proceedings of the 7th International World Wide Web Conference; 1998. p. 161–72.CrossRefGoogle Scholar
  9. 9.
    Davison BD. Topical locality in the web. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 2000. p. 272–9.Google Scholar
  10. 10.
    Diligenti M, Coetzee F, Lawrence S, Giles CL, Gori M. Focused crawling using context graphs. In: Proceedings of the 26th International Conference on Very Large Data Bases; 2000. p. 527–34.Google Scholar
  11. 11.
    Dill S, Ravi Kumar S, McCurley KS, Rajagopalan S, Sivakumar D, Tomkins A. Self-similarity in the web. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001. p. 69–78.Google Scholar
  12. 12.
    Herseovici M, Jacovi M, Maarek YS, Pelleg D, Shtalhaim M, Ur S. The shark-search algorithm – an application: tailored web site mapping. In: Proceedings of the 7th International World Wide Web Conference; 1998. p. 317–26.Google Scholar
  13. 13.
    Lafferty J, McCallum A, Pereira F. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning; 2001. p. 282–9.Google Scholar
  14. 14.
    Najork M, Weiner J. Breadth-first search crawling yields high-quality pages. In: Proceedings of the 10th International World Wide Web Conference; 2001. p. 114–8.Google Scholar
  15. 15.
    Pandey S, Olston C. User-centric web crawling. In: Proceedings of the 14th International World Wide Web Conference; 2005. p. 401–11.Google Scholar
  16. 16.
    Page L, Brin S, Motwani R, Winograd T. The PageRank citation ranking: bringing order to the web. Manuscript, Stanford University; 1998.Google Scholar
  17. 17.
    Rennie J, McCallum A. Using reinforcement learning to spider the web efficiently. In: Proceedings of the 16th International Conference on Machine Learning; 1999. p. 335–43.Google Scholar
  18. 18.
    Sutton RS, Barto AG. Reinforcement learning: an introduction. Cambridge, MA: MIT; 1998.Google Scholar
  19. 19.
    Vinod Vydiswaran VG, Sarawagi S. Learning to extract information from large websites using sequential models. In: Proceedings of the 11th International Conference on Management of Data; 2005. p. 3–14.Google Scholar
  20. 20.
    Wikipedia page on Focused Crawling at http://en.wikipedia.org/wiki/Focused_crawler

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Indian Institute of Technology of BombayMumbaiIndia

Section editors and affiliations

  • Cong Yu
    • 1
  1. 1.Google ResearchNew YorkUSA