Skip to main content

A Novel Focused Crawler Based on Breadcrumb Navigation

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7332))

Abstract

In this paper, a novel focused crawler based on Breadcrumb Navigation (BN) is proposed. It mainly leverages Breadcrumb Navigation in the webpages to reconstruct the website structures and resolve focused crawling problems. Different from some previous focused crawlers which use prediction models, the BN crawler firstly samples the web to construct the semantic forest for websites based on Breadcrumb Navigation, and then searches the forest to find the sub-trees relevant to the given topic. After sampling, the BN crawler only needs to download the webpages belonging to the relevant sub-forest. By using this method, the BN crawler costs less time to analyze the Webpage-to-Topic (W2T) similarity but results in a highly efficient performance. The experimental evidences show that the BN crawler significantly outperforms Breadth-First and Best-First crawlers in harvest ratio and can be widely used for most websites.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Elsevier Science B.V. (1999)

    Google Scholar 

  2. Batsakis, S., Petrakis, E.G.M., Milios, E.: Improving the Performance of Focused Web Crawlers. DKE (2008)

    Google Scholar 

  3. Liu, H.: Probabilistic Models for Focused Web Crawling. In: WIDM 2004, Washington DC (2004)

    Google Scholar 

  4. Uemura, Y., Itokawa, T., Kitasuka, T., Aritsugi, M.: Where to Crawl Next for Focused Crawlers. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds.) KES 2010, Part IV. LNCS (LNAI), vol. 6279, pp. 220–229. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  5. Hati, D., Kumar, A.: An Approach for Identifying URLs Based on Division Score and Link Score in Focused Crawler. International Journal of Computer Applications (0975 – 8887) 2(3) (2010)

    Google Scholar 

  6. Diligenti, M., Coetzee, F.M., Lawrence, S., Giles, C.L., Gori, M.: Focused Crawling Using Context Graphs. In: Proceedings of the 26th VLDB Conference, Cairo, Egypt, pp. 527–534 (2000)

    Google Scholar 

  7. Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  8. Chang, Y.-C., Yang, P.-C., Chiang, J.-H.: Ontology-based Intelligent Web Mining Agent for Taiwan Travel. In: International Joint Conferences on Web Intelligence and Intelligent Agent Technology (2009)

    Google Scholar 

  9. Yang, S.-Y.: A Focused Crawler with Ontology-Supported Website Models for Information Agents. Expert Systems with Applications (2010)

    Google Scholar 

  10. Gao, Z., Du, Y., Yi, L., Yang, Y., Peng, Q.: Focused Web Crawling Based on Incremental Learning. Journal of Computational Information Systems 6(1), 9–16 (2010)

    Google Scholar 

  11. Feng, S., Zhang, L., Xiong, Y., Yao, C.: Focused Crawling Using Navigational Rank. In: CIKM 2010. HP Labs China, Toronto Ontario Canada (2010)

    Google Scholar 

  12. Kumar, M., Vig, R.: Multilingual Context Ontology Rule Enhanced Focused Web Crawler. University Institute of Engineering and Technology, Panjab University, India, Journal of Advances in information Technology (2010)

    Google Scholar 

  13. Hati, D., Kuma, A.: UDBFC: An Effective Focused Crawling Approach Based On URL Distance Calculation. Computer Science and Information Technology (2010)

    Google Scholar 

  14. Blanco, L., Dalvi, I., Machanavajjhala, S.: Hghly Efficient Algorithms for Structural Clustering of Large Websites. In: WWW (2011)

    Google Scholar 

  15. Kan, M.-Y., Thi, H.O.N.: Fast webpage classification using URL features. In: CIKM 2005 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ying, L., Zhou, X., Yuan, J., Huang, Y. (2012). A Novel Focused Crawler Based on Breadcrumb Navigation. In: Tan, Y., Shi, Y., Ji, Z. (eds) Advances in Swarm Intelligence. ICSI 2012. Lecture Notes in Computer Science, vol 7332. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31020-1_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-31020-1_31

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-31019-5

  • Online ISBN: 978-3-642-31020-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics