Skip to main content

Focused Crawler Framework Based on Open Search Engine

  • Conference paper
  • First Online:
  • 2537 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11065))

Abstract

When users need to analyze webpages related to some specific topics, generally they use crawlers to acquire webpages, and then analyze the results to extract those match the users’ interests. However, in data acquisition stage, users usually have customize demand on acquiring data. Ordinary crawler systems are very resource-constrained so they cannot traverse the entire internet. Meanwhile, search engines can satisfy these demand but it relies on many manual interactions. The traditional solution is to constrain the crawlers in some limited domain, but this will lead to the problem of low recall rate as well as inefficiency. In order to solve the problems above, this paper does some research on focused crawlers framework based on open search engine. It takes advantage of open search engine’s information gather and retrieval capabilities, and can automatically/semi-automatically generate the topic model to interpret and complete users search intents, with only a few seed keywords need to be provided initially. Then it uses open search engine interfaces to iteratively crawl topic-specific webpages. Compared with the traditional ways, the focused crawler based on open search engine proposed in this paper improves the recall rate and efficiency under the premise of ensuring the accuracy.

This work is supported by the National Key Research and Development Program of China (No. 2016YFB0800402) and the National Natural Science Foundation of China (No. U1536201, U1705261).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Page, L.: The PageRank citation ranking: bringing order to the web. Stanf. Dig. Libr. Work. Paper 9(1), 1–14 (1999)

    MathSciNet  Google Scholar 

  2. Kleinberg, J.M.: Hubs, authorities, and communities. ACM Comput. Surv. 31(4es), 5 (1999)

    Article  Google Scholar 

  3. Chakrabarti, S., Berg, M.V.D., Dom, B.: Focused crawling: a new approach to topic specific resource discovery. Comput. Netw. 31(11–16), 1623–1640 (2000)

    Google Scholar 

  4. Bra, D.P.M.E.D.: Searching for arbitrary information in the www: the fish-search for mosaic. In: World Wide Web Conference Series (1994)

    Google Scholar 

  5. Vieira, K., Barbosa, L., Silva, A.S.D., Freire, J., Moura, E.: Finding seeds to boot-strap focused crawlers. World Wide Web-Internet Web Inf. Syst. 19(3), 449–474 (2016)

    Article  Google Scholar 

  6. Rawat, S., Patil, D.R.: Efficient focused crawling based on best first search. In: Advance Computing Conference, pp. 908–911 (2013)

    Google Scholar 

  7. Hersovici, M., Jacovi, M., Maarek, Y.S., Dan, P., Shtalhaim, M., Ur, S.: The shark-search algorithm. An application: tailored web site mapping. In: International Conference on World Wide Web, pp. 317–326 (1998)

    Google Scholar 

  8. Aggarwal, C.C., Al-Garawi, F., Yu, P.S.: Intelligent crawling on the world wide web with arbitrary predicates. In: International Conference on World Wide Web, pp. 96–105 (2001)

    Google Scholar 

  9. Novak, B.: A survey of focused web crawling algorithms (2004)

    Google Scholar 

  10. Baidu Encyclopedia: Meta-search engine. https://baike.baidu.com/item/%E5%85%83%E6%90%9C%E7%B4%A2%E5%BC%95%E6%93%8E/205513?fr=aladdin. Accessed 27 Feb 2018

  11. Blei, D.M., Lafferty, J.D.: Topic models. In: Text Mining, pp. 101–124. Chapman and Hall/CRC (2009)

    Google Scholar 

  12. Robertson, S.: Understanding inverse document frequency: on theoretical arguments for IDF. J. Doc. 60(5), 503–520 (2013)

    Article  Google Scholar 

  13. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. Arch. 3, 993–1022 (2003)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiawei Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, J., Huang, Y. (2018). Focused Crawler Framework Based on Open Search Engine. In: Sun, X., Pan, Z., Bertino, E. (eds) Cloud Computing and Security. ICCCS 2018. Lecture Notes in Computer Science(), vol 11065. Springer, Cham. https://doi.org/10.1007/978-3-030-00012-7_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00012-7_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00011-0

  • Online ISBN: 978-3-030-00012-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics