A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth

  • Yalu Jia
  • Lei LiuEmail author
  • Hao Chen
  • Yinghong Sun
Industrial and commercial application


Unknown word recognition technology is of great significance to improve the precision of text segmentation and syntax analysis. Social network has become an important platform for sharing, disseminating, and acquiring information. Unknown word recognition based on micro-blog short text has become a research hot spot, while the micro-blog text contains a large number of nonstandard terms and network buzzwords, which has increased the difficulty of unknown word recognition. This paper proposes a Chinese unknown word recognition method for micro-blog short text based on improved FP-growth (POS-FP). Firstly, the POS-FP algorithm is used to get frequent itemsets from micro-blog, and the N-grams model is used to filter out unknown words from frequent itemsets. Secondly, the improved mutual information and left–right information entropy are used to verify the internal features of candidate unknown words. Then, context-dependent and open-source methods are used for external verification of candidate unknown words. Compared with traditional methods, this algorithm improves the recognition rate of unknown words in micro-blog short texts.


Unknown word recognition FP-growth algorithm Mutual information Information entropy 



This work is supported by the National Natural Science Foundation of China (Grant Nos. 61105040, 61203284), the Beijing Natural Science Foundation (Grant No. 4133085), the general program of science and technology development project of Beijing Municipal Education Commission (Grant No. KM201810005005).


  1. 1.
    Maosong S, Jiayan Z (1995) The hard work theory in the study of Chinese automatic word segmentation. Lang Charact Appl 04:40–46Google Scholar
  2. 2.
    Chen X (1999) A package of solutions to the problem of unknown words in automatic word segmentation. Appl Linguist 3:103–109Google Scholar
  3. 3.
    Zhao H, Cai D, Huang C, Kit C (2019) Chinese word segmentation: another decade review (2007–2017). arXiv:
  4. 4.
    Wang L (2017) Research and implementation of evaluation object phrase recognition in the field of sentiment analysis. Donghua University, ShanghaiGoogle Scholar
  5. 5.
    Wang Y (2015) Research on automatic segmentation of Chinese product description information based on the characteristics of conditional random fields and e-commerce. East China Normal University, ShanghaiGoogle Scholar
  6. 6.
    Chen H (2012) Research on network information collection technology and Chinese unknown words. Beijing University of Posts and Telecommunications, BeijingGoogle Scholar
  7. 7.
    Wu A, Jiang Z (2000) Statistically enhanced new word identification in a rule based Chinese system. In: Proceedings of the second Chinese language processing workshop. Hong Kong, China, pp 46–51Google Scholar
  8. 8.
    Zhang HP, Liu Q, Yu HK (2003) Chinese name entity recognition using role model. Comput Linguist Chin Lang Proces 8(2):29–60Google Scholar
  9. 9.
    Deng W (2014) Improved BP-HMM and its application in Chinese part-of-speech tagging. Jiangxi Institute of Technology, GanzhouGoogle Scholar
  10. 10.
    Han Y et al (2016) J Nanjing Univ (Nat Sci) 2:353–360Google Scholar
  11. 11.
    Gang Z, Yang L, Qun L (2004) Internet-oriented Chinese new word detection. Chin J Inform 18(6):1–9Google Scholar
  12. 12.
    Zheng J, Li X, Tan H (2000) Research on Chinese names recognition method based on corpus. J Chin Inform Process 14(1):7–12Google Scholar
  13. 13.
    Liu B, Huang W, Guo Y et al (2000) Chinese name recognition based on statistical methods. J Chin Inform Process 14(3):16–24Google Scholar
  14. 14.
    Sun M, Huang C, Gao H et al (1994) Automatic identification of Chinese names. Chin J Inform 9(2):16–27Google Scholar
  15. 15.
    Huang D, Yue G, Yang Y et al (2003) Identification of Chinese place names based on statistics. J Chin Inform Process 17(2):36–41CrossRefGoogle Scholar
  16. 16.
    Tan H, Zheng J, Liu K (2002) Design and implementation of automatic identification system for Chinese geographical names. Comput Eng 28(8):128–129Google Scholar
  17. 17.
    Xiang X (2016) Research and application of the Chinese organization names recognition and disambiguation. East China Normal University, ShanghaiGoogle Scholar
  18. 18.
    Hao Z et al (2019) Unknown word recognition based on extended rules and statistical features. Comput Appl Res 09:1–6Google Scholar
  19. 19.
    Xianying H, Hongyang C, Yingtao L, Liyuan X (2015) A new microblog short text feature word selection algorithm. Comput Eng Sci 37(09):1761–1767Google Scholar
  20. 20.
    Xianyi C, Qian Z (2010) Text mining principle. Science Press, Beijing, pp 1–8Google Scholar
  21. 21.
    Veeraswamy A (2011) A survey of feature selection algorithms in data mining. Int J Adv Res Technol 1:108–117Google Scholar
  22. 22.
    El-Fishawy N, Hamouda A, Attiya GM, Atef M (2014) Arabic summarization in Twitter social network. Ain Shams Eng J 5(2):411–420CrossRefGoogle Scholar
  23. 23.
    Niu P (2015) Research on automatic extraction of Chinese keywords in combination with TF–IDF and rules. Dalian University of Technology, DalianGoogle Scholar
  24. 24.
    He X (2017) Improvement and experimental research on TF–IDF algorithm. Jilin University, ChangchunGoogle Scholar
  25. 25.
    He H, Sun X (2017) F-score driven max margin neural network for named entity recognition in Chinese social media. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, pp 713–718Google Scholar
  26. 26.
    Wang X, Zhang Y, Ren X, Zhang Y, Zitnik M, Shang J, Langlotz C, Han J (2019) Cross-type biomedical named entity recognition with deep multi-task learning. Bioinformatics 35(10):1745–1752CrossRefGoogle Scholar
  27. 27.
    Liu L, Shang J, Xu F, Ren X, Gui H, Peng J, Han J (2018) Empower sequence labeling with task-aware neural language model. In: AAAI, pp 5245–5253Google Scholar
  28. 28.
    Moon S, Neves L, Carvalho V (2018) Multimodal named entity recognition for short social media posts. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 852–860Google Scholar
  29. 29.
    The ICTCLAS Word Segmentation System.
  30. 30.
    Zhang H, Wang S, Zhao M et al (2018) Locality reconstruction models for book representation. IEEE Trans Knowl Data Eng 30(10):1873–1886CrossRefGoogle Scholar
  31. 31.
    Zhang H, Wang S, Xu X et al (2018) Tree2Vector: learning a vectorial representation for tree-structured data. IEEE Trans Neural Netw Learn Syst 29(11):5304–5318MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  1. 1.College of Applied SciencesBeijing University of TechnologyBeijingChina
  2. 2.Taiji Computer Co., LtdBeijingChina
  3. 3.Beijing Institute for Scientific and Engineering ComputingBeijing University of TechnologyBeijingChina

Personalised recommendations