Skip to main content

Theme-Based Spider for Academic Paper

  • Conference paper
  • First Online:
Intelligent Computing, Networked Control, and Their Engineering Applications (ICSEE 2017, LSMS 2017)

Abstract

Nowadays contents of the web multiply everyday. However, for particular company or individual, some kind of information has higher priority. For example, among so much information on the internet, web pages containing academic papers are definitely more attractive to a researcher. And the problem lies in how to find that kind of data. Therefore we design a spider that targets only on online academic papers. Besides reserving three major parts of a traditional spider, we make some modifications on Filter and Parser so that our spider is competent enough to accomplish the mission. And the essential mechanism of recognizing and extracting expected pages primarily lies on keyword-matching and Finite State Machine Theory. After roaming on two web sites, the spider successfully collects desirable information. We can safely see from the result that in future by optimization and modification this theme-based spider may work more efficiently or even expands to other fields of interest.

This work are supported by Natural Science Foundation of China (Nos. 61472381, 61472382, 61572454 and 61174144), NOE-Micrsoft Key Laboratory of Multimedia Computing and Communication Foundation, Anhui Province Key Laboratory of Software in Computing and Communication.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Cheong, F.C.: Internet Agents: Spiders, Wanderers, Brokers, and Bots. New Riders Publishing, Indianapolis (1996)

    Google Scholar 

  2. Rennie, J., McCallum, A.K.: Using reinforcement learning to spider the web efficiently. In: ICML 1999 Workshop, Machine Learning in Text Data Analysis (1999)

    Google Scholar 

  3. Jin-hong, L., Yu-liang, L.: Survey on topic-focused web crawler. Appl. Res. Comput. 24(10), 26–29 (2007)

    Google Scholar 

  4. Yuanchao, X., Jianghua, L., Lizhen, L., Yong, G.: Design and implementation of spider on web-based full-text search engine. Control Autom. 23(7–3), 119–121 (2007)

    Google Scholar 

  5. Wang, J., Peng, J.: Design and research of web spider’s structure. Sci. Technol. Inf. 27, 96–99 (2007)

    Google Scholar 

  6. Jia, N., Huang., W.: Non-recursive crawling schema of mobile web spider. J. Xihua Univ.-Nat. Sci. 26(3), 51–53 (2007)

    Google Scholar 

  7. Heaton, J.: Programming Spiders, Bots, and Aggregators in Java. Sybex, San Francisco (2002)

    Google Scholar 

  8. Chau, M., Chen, H.: Personalized and focused web spiders. In: Zhong, N., Liu, J., Yao, Y. (eds.) Web Intelligence, pp. 197–217. Springer, Heidelberg (2003). doi:10.1007/978-3-662-05320-1_10

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chenxi Shao .

Editor information

Editors and Affiliations

Appendix: URLs of Search Engines

Appendix: URLs of Search Engines

  1. 1.

    Google: http://www.google.cn

  2. 2.

    Baidu: http://www.baidu.com

  3. 3.

    Science Paper Online: http://www.paper.edu.cn

  4. 4.

    CNKI: http://dlib.cnki.net/kns50

  5. 5.

    PaperSo: http://202.38.79.80:8080/ps/search.jsp

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper

Yin, P., Shao, Q., Wang, X., Wang, W., Miao, F., Shao, C. (2017). Theme-Based Spider for Academic Paper. In: Yue, D., Peng, C., Du, D., Zhang, T., Zheng, M., Han, Q. (eds) Intelligent Computing, Networked Control, and Their Engineering Applications. ICSEE LSMS 2017 2017. Communications in Computer and Information Science, vol 762. Springer, Singapore. https://doi.org/10.1007/978-981-10-6373-2_23

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-6373-2_23

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-6372-5

  • Online ISBN: 978-981-10-6373-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics