Identifying Clinical Terms in Free-Text Notes Using Ontology-Guided Machine Learning
Objective: Automatic recognition of medical concepts in unstructured text is an important component of many clinical and research applications and its accuracy has a large impact on electronic health record analysis. The mining of such terms is complicated by the broad use of synonyms and non-standard terms in medical documents. Here we presented a machine learning model for concept recognition in large unstructured text which optimizes the use of ontological structures and can identify previously unobserved synonyms for concepts in the ontology.
Materials and Methods: We present a neural dictionary model which can be used to predict if a phrase is synonymous to a concept in a reference ontology. Our model, called Neural Concept Recognizer (NCR), uses a convolutional neural network and utilizes the taxonomy structure to encode input phrases, then rank medical concepts based on the similarity in that space. It also utilizes the biomedical ontology structure to optimize the embedding of various terms and has fewer training constraints than previous methods. We train our model on two biomedical ontologies, the Human Phenotype Ontology (HPO) and SNOMED-CT.
Results: We tested our model trained on HPO on two different data sets: 288 annotated PubMed abstracts and 39 clinical reports. We also tested our model trained on the SNOMED-CT on 2000 MIMIC-III ICU discharge summaries. The results of our experiments show the high accuracy of our model, as well as the value of utilizing the taxonomy structure of the ontology in concept recognition.
Conclusion: Most popular medical concept recognizers rely on rule-based models, which cannot generalize well to unseen synonyms. Also, most machine learning methods typically require large corpora of annotated text that cover all classes of concepts, which can be extremely difficult to get for biomedical ontologies. Without relying on a large-scale labeled training data or requiring any custom training, our model can efficiently generalize to new synonyms and performs as well or better than state-of-the-art methods custom built for specific ontologies.
KeywordsConcept recognition Ontologies Named entity recognition Phenotyping Human Phenotype Ontology
We thank Michael Glueck for his valuable comments and discussions. We also thank Tudor Groza for his helpful comments and for providing us the BioLarK API used for the experiments.
- 1.Simmons, M., Singhal, A., Lu, Z.: Text mining for precision medicine: bringing structure to EHRs and biomedical literature to understand genes and health. In: Shen, B., Tang, H., Jiang, X. (eds.) Translational Biomedical Informatics. AEMB, vol. 939, pp. 139–166. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-1503-8_7CrossRefGoogle Scholar
- 2.Jonnagaddala, J., Dai, H.-J., Ray, P., Liaw, S.-T.: Mining electronic health records to guide and support clinical decision support systems. In: Healthcare Ethics and Training: Concepts, Methodologies, Tools, and Applications, pp. 184–201. IGI Global (2017)Google Scholar
- 5.Piñero, J., et al.: DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database 2015 (2015)Google Scholar
- 6.SNOMED-CT. https://www.nlm.nih.gov/healthit/snomedct/
- 10.Jonquet, C., Shah, N.H., Musen, M.A.: The open biomedical annotator. Summit Transl. Bioinform. 2009, 56 (2009)Google Scholar
- 11.Taboada, M., Rodríguez, H., Martínez, D., Pardo, M., Sobrido, M.J.: Automated semantic annotation of rare disease cases: a case study. Database (Oxford) 2014 (2014)Google Scholar
- 12.Aronson, A.R.: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In: Proceedings of the AMIA Symposium, p. 17 (2001)Google Scholar
- 15.Lobo, M., Lamurias, A., Couto, F.M.: Identifying human phenotype terms by combining machine learning and validation rules. Biomed. Res. Int. 2017, Article no. 8565739 (2017)Google Scholar
- 16.Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv Preprint arXiv:1603.01360 (2016)
- 17.Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv Preprint arXiv:1508.01991 (2015)
- 18.Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. arXiv Preprint arXiv:1603.01354 (2016)
- 20.Lafferty, J., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001)Google Scholar
- 21.Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, vol. 4, pp. 142–147 (2003)Google Scholar
- 22.Johnson, A.E.W., et al.: MIMIC-III, a freely accessible critical care database. Sci. Data 3 (2016)Google Scholar
- 26.Vani, A., Jernite, Y., Sontag, D.: Grounded recurrent neural networks. arXiv Preprint arXiv:1705.08557 (2017)
- 28.Vendrov, I., Kiros, R., Fidler, S., Urtasun, R.: Order-embeddings of images and language. arXiv Preprint arXiv:1511.06361 (2015)
- 29.Neelakantan, A., Roth, B., McCallum, A.: Compositional vector space models for knowledge base inference. In: 2015 AAAI Spring Symposium Series (2015)Google Scholar
- 30.Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representations. arXiv Preprint arXiv:1705.08039 (2017)
- 31.Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv Preprint arXiv:1607.04606 (2016)
- 32.Kim, Y.: Convolutional neural networks for sentence classification. arXiv Preprint arXiv:1408.5882 (2014)
- 33.Clevert, D.-A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (ELUs). arXiv Preprint arXiv:1511.07289 (2015)
- 34.Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv Preprint arXiv:1412.6980 (2014)
- 37.Kiros, R., et al.: Skip-thought vectors. In: Advances in Neural Information Processing Systems, pp. 3294–3302 (2015)Google Scholar