Synonyms
Topic-directed web crawling; Web resource discovery
Definition
The world-wide Web can be modeled as a very large graph with nodes representing pages and edges representing hyperlinks. Thanks to dynamically generated content, the Web graph is infinitely large. Page content and hyperlinks change continually. Any centralized Web search service must first fetch a large number of Web pages over the Internet using a Web crawler, and then subject the local copies to indexing and other analysis. At any time during its execution, a Web crawler has a set of pages that have been fetched, and a frontier of unexplored hyperlinks encountered on fetched pages. Given finite network resources, it is critical for the crawler to choose carefully the subset of frontier hyperlinks it should fetch next. Depending on the application and user group, it may be beneficial to preferentially acquire pages that are highly linked, pages that pertain to specific topics, pages that are likely to mention...
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsRecommended Reading
Babaria R, Saketha Nath J, Krishnan S, Sivaramakrishnan KR, Bhattacharyya C, Murty MN. Focused crawling with scalable ordinal regression solvers. In: Proceedings of the 24th International Conference on Machine Learning; 2007. p. 57–64.
Broder A et~al. Graph structure in the web: experiments and models. In: Proceedings of the 9th International World Wide Web Conference; 2000. p. 309–20.
Chakrabarti S. Mining the web: discovering knowledge from hypertext data. Morgan-Kauffman; 2002.
Chakrabarti S, Dom B, Indyk P. Enhanced hypertext categorization using hyperlinks. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 1998. p. 307–18.
Chakrabarti S, van den Berg M, Dom B. Focused crawling: a new approach to topic-specific web resource discovery. Comput Netw. 1999;31(11–16):1623–40.
Chakrabarti S, Joshi MM, Punera K, Pennock DM. The structure of broad topics on the web. In: Proceedings of the 11th International World Wide Web Conference; 2002. p. 251–62.
Chakrabarti S, Punera K, Subramanyam M. Accelerated focused crawling through online relevance feedback. In: Proceedings of the 11th International World Wide Web Conference; 2002. p. 148–59.
Cho J, Garcia-Molina H, Page L. Efficient crawling through URL ordering. In: Proceedings of the 7th International World Wide Web Conference; 1998. p. 161–72.
Davison BD. Topical locality in the web. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 2000. p. 272–9.
Diligenti M, Coetzee F, Lawrence S, Giles CL, Gori M. Focused crawling using context graphs. In: Proceedings of the 26th International Conference on Very Large Data Bases; 2000. p. 527–34.
Dill S, Ravi Kumar S, McCurley KS, Rajagopalan S, Sivakumar D, Tomkins A. Self-similarity in the web. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001. p. 69–78.
Herseovici M, Jacovi M, Maarek YS, Pelleg D, Shtalhaim M, Ur S. The shark-search algorithm – an application: tailored web site mapping. In: Proceedings of the 7th International World Wide Web Conference; 1998. p. 317–26.
Lafferty J, McCallum A, Pereira F. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning; 2001. p. 282–9.
Najork M, Weiner J. Breadth-first search crawling yields high-quality pages. In: Proceedings of the 10th International World Wide Web Conference; 2001. p. 114–8.
Pandey S, Olston C. User-centric web crawling. In: Proceedings of the 14th International World Wide Web Conference; 2005. p. 401–11.
Page L, Brin S, Motwani R, Winograd T. The PageRank citation ranking: bringing order to the web. Manuscript, Stanford University; 1998.
Rennie J, McCallum A. Using reinforcement learning to spider the web efficiently. In: Proceedings of the 16th International Conference on Machine Learning; 1999. p. 335–43.
Sutton RS, Barto AG. Reinforcement learning: an introduction. Cambridge, MA: MIT; 1998.
Vinod Vydiswaran VG, Sarawagi S. Learning to extract information from large websites using sequential models. In: Proceedings of the 11th International Conference on Management of Data; 2005. p. 3–14.
Wikipedia page on Focused Crawling at http://en.wikipedia.org/wiki/Focused_crawler
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media, LLC, part of Springer Nature
About this entry
Cite this entry
Chakrabarti, S. (2018). Focused Web Crawling. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_165
Download citation
DOI: https://doi.org/10.1007/978-1-4614-8265-9_165
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering