Unknown word recognition technology is of great significance to improve the precision of text segmentation and syntax analysis. Social network has become an important platform for sharing, disseminating, and acquiring information. Unknown word recognition based on micro-blog short text has become a research hot spot, while the micro-blog text contains a large number of nonstandard terms and network buzzwords, which has increased the difficulty of unknown word recognition. This paper proposes a Chinese unknown word recognition method for micro-blog short text based on improved FP-growth (POS-FP). Firstly, the POS-FP algorithm is used to get frequent itemsets from micro-blog, and the N-grams model is used to filter out unknown words from frequent itemsets. Secondly, the improved mutual information and left–right information entropy are used to verify the internal features of candidate unknown words. Then, context-dependent and open-source methods are used for external verification of candidate unknown words. Compared with traditional methods, this algorithm improves the recognition rate of unknown words in micro-blog short texts.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
Maosong S, Jiayan Z (1995) The hard work theory in the study of Chinese automatic word segmentation. Lang Charact Appl 04:40–46
Chen X (1999) A package of solutions to the problem of unknown words in automatic word segmentation. Appl Linguist 3:103–109
Zhao H, Cai D, Huang C, Kit C (2019) Chinese word segmentation: another decade review (2007–2017). arXiv: https://arxiv.org/abs/1901.06079
Wang L (2017) Research and implementation of evaluation object phrase recognition in the field of sentiment analysis. Donghua University, Shanghai
Wang Y (2015) Research on automatic segmentation of Chinese product description information based on the characteristics of conditional random fields and e-commerce. East China Normal University, Shanghai
Chen H (2012) Research on network information collection technology and Chinese unknown words. Beijing University of Posts and Telecommunications, Beijing
Wu A, Jiang Z (2000) Statistically enhanced new word identification in a rule based Chinese system. In: Proceedings of the second Chinese language processing workshop. Hong Kong, China, pp 46–51
Zhang HP, Liu Q, Yu HK (2003) Chinese name entity recognition using role model. Comput Linguist Chin Lang Proces 8(2):29–60
Deng W (2014) Improved BP-HMM and its application in Chinese part-of-speech tagging. Jiangxi Institute of Technology, Ganzhou
Han Y et al (2016) J Nanjing Univ (Nat Sci) 2:353–360
Gang Z, Yang L, Qun L (2004) Internet-oriented Chinese new word detection. Chin J Inform 18(6):1–9
Zheng J, Li X, Tan H (2000) Research on Chinese names recognition method based on corpus. J Chin Inform Process 14(1):7–12
Liu B, Huang W, Guo Y et al (2000) Chinese name recognition based on statistical methods. J Chin Inform Process 14(3):16–24
Sun M, Huang C, Gao H et al (1994) Automatic identification of Chinese names. Chin J Inform 9(2):16–27
Huang D, Yue G, Yang Y et al (2003) Identification of Chinese place names based on statistics. J Chin Inform Process 17(2):36–41
Tan H, Zheng J, Liu K (2002) Design and implementation of automatic identification system for Chinese geographical names. Comput Eng 28(8):128–129
Xiang X (2016) Research and application of the Chinese organization names recognition and disambiguation. East China Normal University, Shanghai
Hao Z et al (2019) Unknown word recognition based on extended rules and statistical features. Comput Appl Res 09:1–6
Xianying H, Hongyang C, Yingtao L, Liyuan X (2015) A new microblog short text feature word selection algorithm. Comput Eng Sci 37(09):1761–1767
Xianyi C, Qian Z (2010) Text mining principle. Science Press, Beijing, pp 1–8
Veeraswamy A (2011) A survey of feature selection algorithms in data mining. Int J Adv Res Technol 1:108–117
El-Fishawy N, Hamouda A, Attiya GM, Atef M (2014) Arabic summarization in Twitter social network. Ain Shams Eng J 5(2):411–420
Niu P (2015) Research on automatic extraction of Chinese keywords in combination with TF–IDF and rules. Dalian University of Technology, Dalian
He X (2017) Improvement and experimental research on TF–IDF algorithm. Jilin University, Changchun
He H, Sun X (2017) F-score driven max margin neural network for named entity recognition in Chinese social media. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, pp 713–718
Wang X, Zhang Y, Ren X, Zhang Y, Zitnik M, Shang J, Langlotz C, Han J (2019) Cross-type biomedical named entity recognition with deep multi-task learning. Bioinformatics 35(10):1745–1752
Liu L, Shang J, Xu F, Ren X, Gui H, Peng J, Han J (2018) Empower sequence labeling with task-aware neural language model. In: AAAI, pp 5245–5253
Moon S, Neves L, Carvalho V (2018) Multimodal named entity recognition for short social media posts. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 852–860
The ICTCLAS Word Segmentation System. https://github.com/NLPIR-team/NLPIR
Zhang H, Wang S, Zhao M et al (2018) Locality reconstruction models for book representation. IEEE Trans Knowl Data Eng 30(10):1873–1886
Zhang H, Wang S, Xu X et al (2018) Tree2Vector: learning a vectorial representation for tree-structured data. IEEE Trans Neural Netw Learn Syst 29(11):5304–5318
This work is supported by the National Natural Science Foundation of China (Grant Nos. 61105040, 61203284), the Beijing Natural Science Foundation (Grant No. 4133085), the general program of science and technology development project of Beijing Municipal Education Commission (Grant No. KM201810005005).
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Jia, Y., Liu, L., Chen, H. et al. A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth. Pattern Anal Applic 23, 1011–1020 (2020). https://doi.org/10.1007/s10044-019-00833-z
- Unknown word recognition
- FP-growth algorithm
- Mutual information
- Information entropy