Abstract
We often need to cut out a subset of required entities from existing knowledge graphs or websites, when building a knowledge graph for a certain field. In the area of Traditional Chinese Medicine (TCM), we face the task of screening relevant entities from knowledge bases and websites. In this paper, a three-phase TCM entity filter (TCMEF) is proposed, which can identify TCM related entities with high accuracy only using the texts of very short entity titles instead of analyzing the long document texts. The main part of our method is a Short Text LSTM Classifier (STLC), which learns the text style of TCM terms using stroke and character joint features without word segmentation. In addition, an entity representing a person name, which is severe to be classified by STLC, will be picked out by a Person Name Filter (PNF) and further analyzed by a Rich Text Filter (RTF). The filter uses BaiduBaike and HudongBaike (the two largest Chinese encyclopedia websites) as the main data sources. TCMEF gets an F1 score of 0.9275 in classification, which outperforms general word based short text classification algorithms and is close to a Latent Dirichlet Allocation based model (LDA-SVM) using rich texts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bizer, C., Cyganiak, R.: A nucleus for a web of open data. In: Conference, ISWC 2007 + ASWC 2007, Busan, Korea, November, pp. 722–735. DBLP (2007)
Xu, B., et al.: CN-DBpedia: a never-ending chinese knowledge extraction system. In: Benferhat, S., Tabia, K., Ali, M. (eds.) IEA/AIE 2017. LNCS (LNAI), vol. 10351, pp. 428–438. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-60045-1_44
Bengio, Y., Ducharme, R., Vincent, V., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3(Feb), 1137–1155 (2003)
Sun, B., Zhao, P.: Feature extension for Chinese short text classification based on topical N-Grams. In: International Conference on Computer and Information Science
Mao, T.T., Li-Shuang, L.I., Huang, D.G.: Recognizing Chinese person names based on hybrid models. J. Chin. Inf. Process. 21(2), 22–28 (2007)
Lee, S., Baker, J., Song, J.: An empirical comparison off our text mining methods. J. Comput. Inf. Syst. 51(1), 1–10 (2010)
Wu, X., Fang, L., Wang, P.: Performance of using LDA for Chinese news text classification. In: Electrical and Computer Engineering, pp. 1260–1264. IEEE (2015)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, H., Cheng, S., Liu, L., Shi, W. (2018). TCMEF: A TCM Entity Filter Using Less Text. In: Liu, W., Giunchiglia, F., Yang, B. (eds) Knowledge Science, Engineering and Management. KSEM 2018. Lecture Notes in Computer Science(), vol 11061. Springer, Cham. https://doi.org/10.1007/978-3-319-99365-2_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-99365-2_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99364-5
Online ISBN: 978-3-319-99365-2
eBook Packages: Computer ScienceComputer Science (R0)