Web Data Extraction from Scientific Publishers’ Website Using Hidden Markov Model

  • Jing Huang
  • Ziyu Liu
  • Beibei Wang
  • Mingyue Duan
  • Bo YangEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11061)


Recently, large amounts of information on web pages have been emerging in an endless stream. And numerously papers are published on more than three thousands of journals, especially in the field of technology. It’s almost impossible for the user to search the information one by one. The user has to click a lot of links when he or she wants to get information among the thousands of journals, such as the introduction of the journals, impact factor, ISSN and so on. To solve this problem, it’s necessary to develop an automatic method that filter the information out of deep web automatically. The method in this paper is able to help people quickly get needed information classified and extracted. This paper contains the following work: firstly, the method of machine learning, HMM, is used to extract the journal information from the publisher’s website, which improves the generalization ability of using the heuristic method; then, during the data processing step, content extraction technique is used to improve the performance of Hidden Markov Model; finally, we store the extracted information in a structured way and display it. In the experimental step, three algorithms are tested and compared in the accuracy, recall and F-measure, the results show that HMM with content extraction (C-HMM) has the best performance.


Web information extraction Hidden markov model Content extraction 



This work was supported in part by National Natural Science Foundation of China under grants 61373053 and 61572226, and Jilin Province Key Scientific and Technological Research and Development project under grants 20180201044GX and 20180201067GX.


  1. 1.
    Bergman, M.: The deep web: surfacing hidden value. J. Electron. Publ. 7(1), 1–14 (2001)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: towards automatic data extraction from large web sites. In: 27th International Conference on Very Large Data Bases, pp. 109–118. Morgan Kaufmann, Roma, Italy (2001)Google Scholar
  3. 3.
    Gutierrez, F., Dou, D., Fickas, S., et al.: A hybrid ontology-based information extraction system. J. Inf. Sci. 42(6), 798–820 (2016)CrossRefGoogle Scholar
  4. 4.
    Zhang, N., Chen, H., Wang, Y., et al.: Odaies: ontology-driven adaptive Web information extraction system. In: IEEE/WIC International Conference on Intelligent Agent Technology, pp. 454–460. IEEE (2003)Google Scholar
  5. 5.
    Wang, J., Lochovsky, F.H.: Data-rich section extraction from HTML pages. In: International Conference on Web Information Systems Engineering, pp. 313–322. IEEE, Singapore (2003)Google Scholar
  6. 6.
    Liu, B., Grossman, R., Zhai, Y.: Mining data records in Web pages. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606. ACM (2003)Google Scholar
  7. 7.
    Kumaresan, U., Ramanujam, K.: Web data extraction from scientific publishers’ website using heuristic algorithm. Int. J. Intell. Syst. Appl. 9(10), 31–39 (2017)Google Scholar
  8. 8.
    Zhong, P., Chen, J.: A generalized hidden markov model approach for web information extraction. In: IEEE/WIC/ACM International Conference on Web Intelligence, pp. 709–718. IEEE, Hong Kong (2006)Google Scholar
  9. 9.
    Forney, G.: The Viterbi algorithm. Proc. IEEE 61(3), 268–278 (1973)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Rabiner, L.R., Juang, B.H.: An introduction to hidden Markov models. IEEE ASSP Mag. 3(1), 4–16 (1986)CrossRefGoogle Scholar
  11. 11.
    Lai, J., Liu, Q., Liu, Y.: Web information extraction based on hidden Markov model. In: 14th International Conference on Computer Supported Cooperative Work in Design, pp. 234–238. IEEE, Shanghai (2010)Google Scholar
  12. 12.
    Xiong, Z., Lin, X., Zhang, Y., Ya, M.: Content extraction method combining web page structure and text feature. Comput. Eng. 39(12), 200–203 (2013)Google Scholar
  13. 13.
    Elsevier. Accessed 25 Apr 2018
  14. 14.
    Springer. Accessed 25 Apr 2018
  15. 15.
    Wiley. Accessed 25 Apr 2018
  16. 16.
    APP download link.

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.College of Computer Science and TechnologyJilin UniversityChangchunChina
  2. 2.Key Laboratory of Symbol Computation and Knowledge EngineeringJilin University, Ministry of EducationChangchunChina

Personalised recommendations