Identifying Clinical Terms in Free-Text Notes Using Ontology-Guided Machine Learning

  • Aryan Arbabi
  • David R. Adams
  • Sanja Fidler
  • Michael BrudnoEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11467)


Objective: Automatic recognition of medical concepts in unstructured text is an important component of many clinical and research applications and its accuracy has a large impact on electronic health record analysis. The mining of such terms is complicated by the broad use of synonyms and non-standard terms in medical documents. Here we presented a machine learning model for concept recognition in large unstructured text which optimizes the use of ontological structures and can identify previously unobserved synonyms for concepts in the ontology.

Materials and Methods: We present a neural dictionary model which can be used to predict if a phrase is synonymous to a concept in a reference ontology. Our model, called Neural Concept Recognizer (NCR), uses a convolutional neural network and utilizes the taxonomy structure to encode input phrases, then rank medical concepts based on the similarity in that space. It also utilizes the biomedical ontology structure to optimize the embedding of various terms and has fewer training constraints than previous methods. We train our model on two biomedical ontologies, the Human Phenotype Ontology (HPO) and SNOMED-CT.

Results: We tested our model trained on HPO on two different data sets: 288 annotated PubMed abstracts and 39 clinical reports. We also tested our model trained on the SNOMED-CT on 2000 MIMIC-III ICU discharge summaries. The results of our experiments show the high accuracy of our model, as well as the value of utilizing the taxonomy structure of the ontology in concept recognition.

Conclusion: Most popular medical concept recognizers rely on rule-based models, which cannot generalize well to unseen synonyms. Also, most machine learning methods typically require large corpora of annotated text that cover all classes of concepts, which can be extremely difficult to get for biomedical ontologies. Without relying on a large-scale labeled training data or requiring any custom training, our model can efficiently generalize to new synonyms and performs as well or better than state-of-the-art methods custom built for specific ontologies.


Concept recognition Ontologies Named entity recognition Phenotyping Human Phenotype Ontology 



We thank Michael Glueck for his valuable comments and discussions. We also thank Tudor Groza for his helpful comments and for providing us the BioLarK API used for the experiments.


  1. 1.
    Simmons, M., Singhal, A., Lu, Z.: Text mining for precision medicine: bringing structure to EHRs and biomedical literature to understand genes and health. In: Shen, B., Tang, H., Jiang, X. (eds.) Translational Biomedical Informatics. AEMB, vol. 939, pp. 139–166. Springer, Singapore (2016). Scholar
  2. 2.
    Jonnagaddala, J., Dai, H.-J., Ray, P., Liaw, S.-T.: Mining electronic health records to guide and support clinical decision support systems. In: Healthcare Ethics and Training: Concepts, Methodologies, Tools, and Applications, pp. 184–201. IGI Global (2017)Google Scholar
  3. 3.
    Luo, Y., et al.: Natural language processing for EHR-based pharmacovigilance: a structured review. Drug Saf. 40(11), 1075–1089 (2017)CrossRefGoogle Scholar
  4. 4.
    Gonzalez, G.H., Tahsin, T., Goodale, B.C., Greene, A.C., Greene, C.S.: Recent advances and emerging applications in text and data mining for biomedical discovery. Brief. Bioinform. 17(1), 33–42 (2015)CrossRefGoogle Scholar
  5. 5.
    Piñero, J., et al.: DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database 2015 (2015)Google Scholar
  6. 6.
  7. 7.
    Köhler, S., et al.: The human phenotype ontology in 2017. Nucleic Acids Res. 45(D1), D865–D876 (2017)CrossRefGoogle Scholar
  8. 8.
    Lochmüller, H., et al.: ‘IRDiRC Recognized Resources’: a new mechanism to support scientists to conduct efficient, high-quality research for rare diseases. Eur. J. Hum. Genet. 25(2), 162–165 (2017)CrossRefGoogle Scholar
  9. 9.
    Rehm, H.L., et al.: ClinGen—the clinical genome resource. N. Engl. J. Med. 372(23), 2235–2242 (2015)CrossRefGoogle Scholar
  10. 10.
    Jonquet, C., Shah, N.H., Musen, M.A.: The open biomedical annotator. Summit Transl. Bioinform. 2009, 56 (2009)Google Scholar
  11. 11.
    Taboada, M., Rodríguez, H., Martínez, D., Pardo, M., Sobrido, M.J.: Automated semantic annotation of rare disease cases: a case study. Database (Oxford) 2014 (2014)Google Scholar
  12. 12.
    Aronson, A.R.: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In: Proceedings of the AMIA Symposium, p. 17 (2001)Google Scholar
  13. 13.
    Savova, G.K., et al.: Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J. Am. Med. Inform. Assoc. 17(5), 507–513 (2010)CrossRefGoogle Scholar
  14. 14.
    Groza, T., et al.: Automatic concept recognition using the Human Phenotype Ontology reference and test suite corpora. Database 2015, bav005 (2015)CrossRefGoogle Scholar
  15. 15.
    Lobo, M., Lamurias, A., Couto, F.M.: Identifying human phenotype terms by combining machine learning and validation rules. Biomed. Res. Int. 2017, Article no. 8565739 (2017)Google Scholar
  16. 16.
    Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv Preprint arXiv:1603.01360 (2016)
  17. 17.
    Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv Preprint arXiv:1508.01991 (2015)
  18. 18.
    Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. arXiv Preprint arXiv:1603.01354 (2016)
  19. 19.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  20. 20.
    Lafferty, J., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001)Google Scholar
  21. 21.
    Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, vol. 4, pp. 142–147 (2003)Google Scholar
  22. 22.
    Johnson, A.E.W., et al.: MIMIC-III, a freely accessible critical care database. Sci. Data 3 (2016)Google Scholar
  23. 23.
    Girdea, M., et al.: PhenoTips: patient phenotyping software for clinical and research use. Hum. Mutat. 34(8), 1057–1065 (2013)CrossRefGoogle Scholar
  24. 24.
    Glueck, M., et al.: PhenoLines: phenotype comparison visualizations for disease subtyping via topic models. IEEE Trans. Vis. Comput. Graph. 24(1), 371–381 (2018)CrossRefGoogle Scholar
  25. 25.
    Habibi, M., Weber, L., Neves, M., Wiegandt, D.L., Leser, U.: Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33(14), i37–i48 (2017)CrossRefGoogle Scholar
  26. 26.
    Vani, A., Jernite, Y., Sontag, D.: Grounded recurrent neural networks. arXiv Preprint arXiv:1705.08557 (2017)
  27. 27.
    Deng, J., et al.: Large-scale object classification using label relation graphs. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 48–64. Springer, Cham (2014). Scholar
  28. 28.
    Vendrov, I., Kiros, R., Fidler, S., Urtasun, R.: Order-embeddings of images and language. arXiv Preprint arXiv:1511.06361 (2015)
  29. 29.
    Neelakantan, A., Roth, B., McCallum, A.: Compositional vector space models for knowledge base inference. In: 2015 AAAI Spring Symposium Series (2015)Google Scholar
  30. 30.
    Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representations. arXiv Preprint arXiv:1705.08039 (2017)
  31. 31.
    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv Preprint arXiv:1607.04606 (2016)
  32. 32.
    Kim, Y.: Convolutional neural networks for sentence classification. arXiv Preprint arXiv:1408.5882 (2014)
  33. 33.
    Clevert, D.-A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (ELUs). arXiv Preprint arXiv:1511.07289 (2015)
  34. 34.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv Preprint arXiv:1412.6980 (2014)
  35. 35.
    Tifft, C.J., Adams, D.R.: The National Institutes of Health undiagnosed diseases program. Curr. Opin. Pediatr. 26(6), 626 (2014)CrossRefGoogle Scholar
  36. 36.
    Bodenreider, O.: The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 32(90001), 267D–270D (2004)CrossRefGoogle Scholar
  37. 37.
    Kiros, R., et al.: Skip-thought vectors. In: Advances in Neural Information Processing Systems, pp. 3294–3302 (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Aryan Arbabi
    • 1
    • 2
    • 3
  • David R. Adams
    • 4
  • Sanja Fidler
    • 1
    • 3
  • Michael Brudno
    • 1
    • 2
    • 3
    Email author
  1. 1.Department of Computer ScienceUniversity of TorontoTorontoCanada
  2. 2.Center for Computational MedicineHospital for Sick ChildrenTorontoCanada
  3. 3.Vector InstituteTorontoCanada
  4. 4.Section on Human Biochemical Genetics, National Human Genome Research InstituteNational Institutes of HealthBethesdaUSA

Personalised recommendations