Advertisement

TCMEF: A TCM Entity Filter Using Less Text

  • Hualong Zhang
  • Shuzhi Cheng
  • Liting Liu
  • Wenxuan Shi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11061)

Abstract

We often need to cut out a subset of required entities from existing knowledge graphs or websites, when building a knowledge graph for a certain field. In the area of Traditional Chinese Medicine (TCM), we face the task of screening relevant entities from knowledge bases and websites. In this paper, a three-phase TCM entity filter (TCMEF) is proposed, which can identify TCM related entities with high accuracy only using the texts of very short entity titles instead of analyzing the long document texts. The main part of our method is a Short Text LSTM Classifier (STLC), which learns the text style of TCM terms using stroke and character joint features without word segmentation. In addition, an entity representing a person name, which is severe to be classified by STLC, will be picked out by a Person Name Filter (PNF) and further analyzed by a Rich Text Filter (RTF). The filter uses BaiduBaike and HudongBaike (the two largest Chinese encyclopedia websites) as the main data sources. TCMEF gets an F1 score of 0.9275 in classification, which outperforms general word based short text classification algorithms and is close to a Latent Dirichlet Allocation based model (LDA-SVM) using rich texts.

Keywords

TCM entity filter Short text classification Chinese stroke 

References

  1. 1.
    Bizer, C., Cyganiak, R.: A nucleus for a web of open data. In: Conference, ISWC 2007 + ASWC 2007, Busan, Korea, November, pp. 722–735. DBLP (2007)Google Scholar
  2. 2.
    Xu, B., et al.: CN-DBpedia: a never-ending chinese knowledge extraction system. In: Benferhat, S., Tabia, K., Ali, M. (eds.) IEA/AIE 2017. LNCS (LNAI), vol. 10351, pp. 428–438. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-60045-1_44CrossRefGoogle Scholar
  3. 3.
    Bengio, Y., Ducharme, R., Vincent, V., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3(Feb), 1137–1155 (2003)MATHGoogle Scholar
  4. 4.
    Sun, B., Zhao, P.: Feature extension for Chinese short text classification based on topical N-Grams. In: International Conference on Computer and Information ScienceGoogle Scholar
  5. 5.
    Mao, T.T., Li-Shuang, L.I., Huang, D.G.: Recognizing Chinese person names based on hybrid models. J. Chin. Inf. Process. 21(2), 22–28 (2007)Google Scholar
  6. 6.
    Lee, S., Baker, J., Song, J.: An empirical comparison off our text mining methods. J. Comput. Inf. Syst. 51(1), 1–10 (2010)Google Scholar
  7. 7.
    Wu, X., Fang, L., Wang, P.: Performance of using LDA for Chinese news text classification. In: Electrical and Computer Engineering, pp. 1260–1264. IEEE (2015)Google Scholar
  8. 8.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)MATHGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Nankai UniversityTianjinChina

Personalised recommendations