TCMEF: A TCM Entity Filter Using Less Text
We often need to cut out a subset of required entities from existing knowledge graphs or websites, when building a knowledge graph for a certain field. In the area of Traditional Chinese Medicine (TCM), we face the task of screening relevant entities from knowledge bases and websites. In this paper, a three-phase TCM entity filter (TCMEF) is proposed, which can identify TCM related entities with high accuracy only using the texts of very short entity titles instead of analyzing the long document texts. The main part of our method is a Short Text LSTM Classifier (STLC), which learns the text style of TCM terms using stroke and character joint features without word segmentation. In addition, an entity representing a person name, which is severe to be classified by STLC, will be picked out by a Person Name Filter (PNF) and further analyzed by a Rich Text Filter (RTF). The filter uses BaiduBaike and HudongBaike (the two largest Chinese encyclopedia websites) as the main data sources. TCMEF gets an F1 score of 0.9275 in classification, which outperforms general word based short text classification algorithms and is close to a Latent Dirichlet Allocation based model (LDA-SVM) using rich texts.
KeywordsTCM entity filter Short text classification Chinese stroke
- 1.Bizer, C., Cyganiak, R.: A nucleus for a web of open data. In: Conference, ISWC 2007 + ASWC 2007, Busan, Korea, November, pp. 722–735. DBLP (2007)Google Scholar
- 4.Sun, B., Zhao, P.: Feature extension for Chinese short text classification based on topical N-Grams. In: International Conference on Computer and Information ScienceGoogle Scholar
- 5.Mao, T.T., Li-Shuang, L.I., Huang, D.G.: Recognizing Chinese person names based on hybrid models. J. Chin. Inf. Process. 21(2), 22–28 (2007)Google Scholar
- 6.Lee, S., Baker, J., Song, J.: An empirical comparison off our text mining methods. J. Comput. Inf. Syst. 51(1), 1–10 (2010)Google Scholar
- 7.Wu, X., Fang, L., Wang, P.: Performance of using LDA for Chinese news text classification. In: Electrical and Computer Engineering, pp. 1260–1264. IEEE (2015)Google Scholar