Abstract
In this paper, a novel focused crawler based on Breadcrumb Navigation (BN) is proposed. It mainly leverages Breadcrumb Navigation in the webpages to reconstruct the website structures and resolve focused crawling problems. Different from some previous focused crawlers which use prediction models, the BN crawler firstly samples the web to construct the semantic forest for websites based on Breadcrumb Navigation, and then searches the forest to find the sub-trees relevant to the given topic. After sampling, the BN crawler only needs to download the webpages belonging to the relevant sub-forest. By using this method, the BN crawler costs less time to analyze the Webpage-to-Topic (W2T) similarity but results in a highly efficient performance. The experimental evidences show that the BN crawler significantly outperforms Breadth-First and Best-First crawlers in harvest ratio and can be widely used for most websites.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Elsevier Science B.V. (1999)
Batsakis, S., Petrakis, E.G.M., Milios, E.: Improving the Performance of Focused Web Crawlers. DKE (2008)
Liu, H.: Probabilistic Models for Focused Web Crawling. In: WIDM 2004, Washington DC (2004)
Uemura, Y., Itokawa, T., Kitasuka, T., Aritsugi, M.: Where to Crawl Next for Focused Crawlers. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds.) KES 2010, Part IV. LNCS (LNAI), vol. 6279, pp. 220–229. Springer, Heidelberg (2010)
Hati, D., Kumar, A.: An Approach for Identifying URLs Based on Division Score and Link Score in Focused Crawler. International Journal of Computer Applications (0975 – 8887) 2(3) (2010)
Diligenti, M., Coetzee, F.M., Lawrence, S., Giles, C.L., Gori, M.: Focused Crawling Using Context Graphs. In: Proceedings of the 26th VLDB Conference, Cairo, Egypt, pp. 527–534 (2000)
Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18(11), 613–620 (1975)
Chang, Y.-C., Yang, P.-C., Chiang, J.-H.: Ontology-based Intelligent Web Mining Agent for Taiwan Travel. In: International Joint Conferences on Web Intelligence and Intelligent Agent Technology (2009)
Yang, S.-Y.: A Focused Crawler with Ontology-Supported Website Models for Information Agents. Expert Systems with Applications (2010)
Gao, Z., Du, Y., Yi, L., Yang, Y., Peng, Q.: Focused Web Crawling Based on Incremental Learning. Journal of Computational Information Systems 6(1), 9–16 (2010)
Feng, S., Zhang, L., Xiong, Y., Yao, C.: Focused Crawling Using Navigational Rank. In: CIKM 2010. HP Labs China, Toronto Ontario Canada (2010)
Kumar, M., Vig, R.: Multilingual Context Ontology Rule Enhanced Focused Web Crawler. University Institute of Engineering and Technology, Panjab University, India, Journal of Advances in information Technology (2010)
Hati, D., Kuma, A.: UDBFC: An Effective Focused Crawling Approach Based On URL Distance Calculation. Computer Science and Information Technology (2010)
Blanco, L., Dalvi, I., Machanavajjhala, S.: Hghly Efficient Algorithms for Structural Clustering of Large Websites. In: WWW (2011)
Kan, M.-Y., Thi, H.O.N.: Fast webpage classification using URL features. In: CIKM 2005 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ying, L., Zhou, X., Yuan, J., Huang, Y. (2012). A Novel Focused Crawler Based on Breadcrumb Navigation. In: Tan, Y., Shi, Y., Ji, Z. (eds) Advances in Swarm Intelligence. ICSI 2012. Lecture Notes in Computer Science, vol 7332. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31020-1_31
Download citation
DOI: https://doi.org/10.1007/978-3-642-31020-1_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31019-5
Online ISBN: 978-3-642-31020-1
eBook Packages: Computer ScienceComputer Science (R0)