A Novel Focused Crawler Based on Breadcrumb Navigation

Ying, Lizhi; Zhou, Xinhao; Yuan, Jian; Huang, Yongfeng

doi:10.1007/978-3-642-31020-1_31

A Novel Focused Crawler Based on Breadcrumb Navigation

Lizhi Ying¹⁹,
Xinhao Zhou¹⁹,
Jian Yuan¹⁹ &
…
Yongfeng Huang¹⁹

Conference paper

2194 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7332))

Abstract

In this paper, a novel focused crawler based on Breadcrumb Navigation (BN) is proposed. It mainly leverages Breadcrumb Navigation in the webpages to reconstruct the website structures and resolve focused crawling problems. Different from some previous focused crawlers which use prediction models, the BN crawler firstly samples the web to construct the semantic forest for websites based on Breadcrumb Navigation, and then searches the forest to find the sub-trees relevant to the given topic. After sampling, the BN crawler only needs to download the webpages belonging to the relevant sub-forest. By using this method, the BN crawler costs less time to analyze the Webpage-to-Topic (W2T) similarity but results in a highly efficient performance. The experimental evidences show that the BN crawler significantly outperforms Breadth-First and Best-First crawlers in harvest ratio and can be widely used for most websites.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Elsevier Science B.V. (1999)
Google Scholar
Batsakis, S., Petrakis, E.G.M., Milios, E.: Improving the Performance of Focused Web Crawlers. DKE (2008)
Google Scholar
Liu, H.: Probabilistic Models for Focused Web Crawling. In: WIDM 2004, Washington DC (2004)
Google Scholar
Uemura, Y., Itokawa, T., Kitasuka, T., Aritsugi, M.: Where to Crawl Next for Focused Crawlers. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds.) KES 2010, Part IV. LNCS (LNAI), vol. 6279, pp. 220–229. Springer, Heidelberg (2010)
Chapter Google Scholar
Hati, D., Kumar, A.: An Approach for Identifying URLs Based on Division Score and Link Score in Focused Crawler. International Journal of Computer Applications (0975 – 8887) 2(3) (2010)
Google Scholar
Diligenti, M., Coetzee, F.M., Lawrence, S., Giles, C.L., Gori, M.: Focused Crawling Using Context Graphs. In: Proceedings of the 26th VLDB Conference, Cairo, Egypt, pp. 527–534 (2000)
Google Scholar
Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Chang, Y.-C., Yang, P.-C., Chiang, J.-H.: Ontology-based Intelligent Web Mining Agent for Taiwan Travel. In: International Joint Conferences on Web Intelligence and Intelligent Agent Technology (2009)
Google Scholar
Yang, S.-Y.: A Focused Crawler with Ontology-Supported Website Models for Information Agents. Expert Systems with Applications (2010)
Google Scholar
Gao, Z., Du, Y., Yi, L., Yang, Y., Peng, Q.: Focused Web Crawling Based on Incremental Learning. Journal of Computational Information Systems 6(1), 9–16 (2010)
Google Scholar
Feng, S., Zhang, L., Xiong, Y., Yao, C.: Focused Crawling Using Navigational Rank. In: CIKM 2010. HP Labs China, Toronto Ontario Canada (2010)
Google Scholar
Kumar, M., Vig, R.: Multilingual Context Ontology Rule Enhanced Focused Web Crawler. University Institute of Engineering and Technology, Panjab University, India, Journal of Advances in information Technology (2010)
Google Scholar
Hati, D., Kuma, A.: UDBFC: An Effective Focused Crawling Approach Based On URL Distance Calculation. Computer Science and Information Technology (2010)
Google Scholar
Blanco, L., Dalvi, I., Machanavajjhala, S.: Hghly Efficient Algorithms for Structural Clustering of Large Websites. In: WWW (2011)
Google Scholar
Kan, M.-Y., Thi, H.O.N.: Fast webpage classification using URL features. In: CIKM 2005 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Information Cognition and Intelligence System, Department of Electric Engineering, Tsinghua University, Haidian District, 100084, Beijing, China
Lizhi Ying, Xinhao Zhou, Jian Yuan & Yongfeng Huang

Authors

Lizhi Ying
View author publications
You can also search for this author in PubMed Google Scholar
Xinhao Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jian Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Yongfeng Huang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Key Laboratory of Machine Perception (MOE), Peking University, Department of Machine Intelligence, School of Electronics Engineering and Computer Science, Peking University, 100871, Beijing, China
Ying Tan
Department of Electrical and Electronic Engineering, Xi’an Jiaotong-Liverpool University, Suzhou, China
Yuhui Shi
Shenzhen City Key Laboratory of Embedded System Design, College of Computer Science and Software Engineering, Shenzhen University, 518060, Shenzhen, China
Zhen Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ying, L., Zhou, X., Yuan, J., Huang, Y. (2012). A Novel Focused Crawler Based on Breadcrumb Navigation. In: Tan, Y., Shi, Y., Ji, Z. (eds) Advances in Swarm Intelligence. ICSI 2012. Lecture Notes in Computer Science, vol 7332. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31020-1_31

Download citation

DOI: https://doi.org/10.1007/978-3-642-31020-1_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31019-5
Online ISBN: 978-3-642-31020-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics