A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth

Abstract

Unknown word recognition technology is of great significance to improve the precision of text segmentation and syntax analysis. Social network has become an important platform for sharing, disseminating, and acquiring information. Unknown word recognition based on micro-blog short text has become a research hot spot, while the micro-blog text contains a large number of nonstandard terms and network buzzwords, which has increased the difficulty of unknown word recognition. This paper proposes a Chinese unknown word recognition method for micro-blog short text based on improved FP-growth (POS-FP). Firstly, the POS-FP algorithm is used to get frequent itemsets from micro-blog, and the N-grams model is used to filter out unknown words from frequent itemsets. Secondly, the improved mutual information and left–right information entropy are used to verify the internal features of candidate unknown words. Then, context-dependent and open-source methods are used for external verification of candidate unknown words. Compared with traditional methods, this algorithm improves the recognition rate of unknown words in micro-blog short texts.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3

References

  1. 1.

    Maosong S, Jiayan Z (1995) The hard work theory in the study of Chinese automatic word segmentation. Lang Charact Appl 04:40–46

    Google Scholar 

  2. 2.

    Chen X (1999) A package of solutions to the problem of unknown words in automatic word segmentation. Appl Linguist 3:103–109

    Google Scholar 

  3. 3.

    Zhao H, Cai D, Huang C, Kit C (2019) Chinese word segmentation: another decade review (2007–2017). arXiv: https://arxiv.org/abs/1901.06079

  4. 4.

    Wang L (2017) Research and implementation of evaluation object phrase recognition in the field of sentiment analysis. Donghua University, Shanghai

    Google Scholar 

  5. 5.

    Wang Y (2015) Research on automatic segmentation of Chinese product description information based on the characteristics of conditional random fields and e-commerce. East China Normal University, Shanghai

    Google Scholar 

  6. 6.

    Chen H (2012) Research on network information collection technology and Chinese unknown words. Beijing University of Posts and Telecommunications, Beijing

    Google Scholar 

  7. 7.

    Wu A, Jiang Z (2000) Statistically enhanced new word identification in a rule based Chinese system. In: Proceedings of the second Chinese language processing workshop. Hong Kong, China, pp 46–51

  8. 8.

    Zhang HP, Liu Q, Yu HK (2003) Chinese name entity recognition using role model. Comput Linguist Chin Lang Proces 8(2):29–60

    Google Scholar 

  9. 9.

    Deng W (2014) Improved BP-HMM and its application in Chinese part-of-speech tagging. Jiangxi Institute of Technology, Ganzhou

    Google Scholar 

  10. 10.

    Han Y et al (2016) J Nanjing Univ (Nat Sci) 2:353–360

    Google Scholar 

  11. 11.

    Gang Z, Yang L, Qun L (2004) Internet-oriented Chinese new word detection. Chin J Inform 18(6):1–9

    Google Scholar 

  12. 12.

    Zheng J, Li X, Tan H (2000) Research on Chinese names recognition method based on corpus. J Chin Inform Process 14(1):7–12

    Google Scholar 

  13. 13.

    Liu B, Huang W, Guo Y et al (2000) Chinese name recognition based on statistical methods. J Chin Inform Process 14(3):16–24

    Google Scholar 

  14. 14.

    Sun M, Huang C, Gao H et al (1994) Automatic identification of Chinese names. Chin J Inform 9(2):16–27

    Google Scholar 

  15. 15.

    Huang D, Yue G, Yang Y et al (2003) Identification of Chinese place names based on statistics. J Chin Inform Process 17(2):36–41

    Article  Google Scholar 

  16. 16.

    Tan H, Zheng J, Liu K (2002) Design and implementation of automatic identification system for Chinese geographical names. Comput Eng 28(8):128–129

    Google Scholar 

  17. 17.

    Xiang X (2016) Research and application of the Chinese organization names recognition and disambiguation. East China Normal University, Shanghai

    Google Scholar 

  18. 18.

    Hao Z et al (2019) Unknown word recognition based on extended rules and statistical features. Comput Appl Res 09:1–6

    Google Scholar 

  19. 19.

    Xianying H, Hongyang C, Yingtao L, Liyuan X (2015) A new microblog short text feature word selection algorithm. Comput Eng Sci 37(09):1761–1767

    Google Scholar 

  20. 20.

    Xianyi C, Qian Z (2010) Text mining principle. Science Press, Beijing, pp 1–8

    Google Scholar 

  21. 21.

    Veeraswamy A (2011) A survey of feature selection algorithms in data mining. Int J Adv Res Technol 1:108–117

    Google Scholar 

  22. 22.

    El-Fishawy N, Hamouda A, Attiya GM, Atef M (2014) Arabic summarization in Twitter social network. Ain Shams Eng J 5(2):411–420

    Article  Google Scholar 

  23. 23.

    Niu P (2015) Research on automatic extraction of Chinese keywords in combination with TF–IDF and rules. Dalian University of Technology, Dalian

    Google Scholar 

  24. 24.

    He X (2017) Improvement and experimental research on TF–IDF algorithm. Jilin University, Changchun

    Google Scholar 

  25. 25.

    He H, Sun X (2017) F-score driven max margin neural network for named entity recognition in Chinese social media. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, pp 713–718

  26. 26.

    Wang X, Zhang Y, Ren X, Zhang Y, Zitnik M, Shang J, Langlotz C, Han J (2019) Cross-type biomedical named entity recognition with deep multi-task learning. Bioinformatics 35(10):1745–1752

    Article  Google Scholar 

  27. 27.

    Liu L, Shang J, Xu F, Ren X, Gui H, Peng J, Han J (2018) Empower sequence labeling with task-aware neural language model. In: AAAI, pp 5245–5253

  28. 28.

    Moon S, Neves L, Carvalho V (2018) Multimodal named entity recognition for short social media posts. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 852–860

  29. 29.

    The ICTCLAS Word Segmentation System. https://github.com/NLPIR-team/NLPIR

  30. 30.

    Zhang H, Wang S, Zhao M et al (2018) Locality reconstruction models for book representation. IEEE Trans Knowl Data Eng 30(10):1873–1886

    Article  Google Scholar 

  31. 31.

    Zhang H, Wang S, Xu X et al (2018) Tree2Vector: learning a vectorial representation for tree-structured data. IEEE Trans Neural Netw Learn Syst 29(11):5304–5318

    MathSciNet  Article  Google Scholar 

Download references

Acknowledgement

This work is supported by the National Natural Science Foundation of China (Grant Nos. 61105040, 61203284), the Beijing Natural Science Foundation (Grant No. 4133085), the general program of science and technology development project of Beijing Municipal Education Commission (Grant No. KM201810005005).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Lei Liu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Jia, Y., Liu, L., Chen, H. et al. A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth. Pattern Anal Applic 23, 1011–1020 (2020). https://doi.org/10.1007/s10044-019-00833-z

Download citation

Keywords

  • Unknown word recognition
  • FP-growth algorithm
  • Mutual information
  • Information entropy