Abstract
Objective: Automatic recognition of medical concepts in unstructured text is an important component of many clinical and research applications and its accuracy has a large impact on electronic health record analysis. The mining of such terms is complicated by the broad use of synonyms and non-standard terms in medical documents. Here we presented a machine learning model for concept recognition in large unstructured text which optimizes the use of ontological structures and can identify previously unobserved synonyms for concepts in the ontology.
Materials and Methods: We present a neural dictionary model which can be used to predict if a phrase is synonymous to a concept in a reference ontology. Our model, called Neural Concept Recognizer (NCR), uses a convolutional neural network and utilizes the taxonomy structure to encode input phrases, then rank medical concepts based on the similarity in that space. It also utilizes the biomedical ontology structure to optimize the embedding of various terms and has fewer training constraints than previous methods. We train our model on two biomedical ontologies, the Human Phenotype Ontology (HPO) and SNOMED-CT.
Results: We tested our model trained on HPO on two different data sets: 288 annotated PubMed abstracts and 39 clinical reports. We also tested our model trained on the SNOMED-CT on 2000 MIMIC-III ICU discharge summaries. The results of our experiments show the high accuracy of our model, as well as the value of utilizing the taxonomy structure of the ontology in concept recognition.
Conclusion: Most popular medical concept recognizers rely on rule-based models, which cannot generalize well to unseen synonyms. Also, most machine learning methods typically require large corpora of annotated text that cover all classes of concepts, which can be extremely difficult to get for biomedical ontologies. Without relying on a large-scale labeled training data or requiring any custom training, our model can efficiently generalize to new synonyms and performs as well or better than state-of-the-art methods custom built for specific ontologies.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Simmons, M., Singhal, A., Lu, Z.: Text mining for precision medicine: bringing structure to EHRs and biomedical literature to understand genes and health. In: Shen, B., Tang, H., Jiang, X. (eds.) Translational Biomedical Informatics. AEMB, vol. 939, pp. 139–166. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-1503-8_7
Jonnagaddala, J., Dai, H.-J., Ray, P., Liaw, S.-T.: Mining electronic health records to guide and support clinical decision support systems. In: Healthcare Ethics and Training: Concepts, Methodologies, Tools, and Applications, pp. 184–201. IGI Global (2017)
Luo, Y., et al.: Natural language processing for EHR-based pharmacovigilance: a structured review. Drug Saf. 40(11), 1075–1089 (2017)
Gonzalez, G.H., Tahsin, T., Goodale, B.C., Greene, A.C., Greene, C.S.: Recent advances and emerging applications in text and data mining for biomedical discovery. Brief. Bioinform. 17(1), 33–42 (2015)
Piñero, J., et al.: DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database 2015 (2015)
SNOMED-CT. https://www.nlm.nih.gov/healthit/snomedct/
Köhler, S., et al.: The human phenotype ontology in 2017. Nucleic Acids Res. 45(D1), D865–D876 (2017)
Lochmüller, H., et al.: ‘IRDiRC Recognized Resources’: a new mechanism to support scientists to conduct efficient, high-quality research for rare diseases. Eur. J. Hum. Genet. 25(2), 162–165 (2017)
Rehm, H.L., et al.: ClinGen—the clinical genome resource. N. Engl. J. Med. 372(23), 2235–2242 (2015)
Jonquet, C., Shah, N.H., Musen, M.A.: The open biomedical annotator. Summit Transl. Bioinform. 2009, 56 (2009)
Taboada, M., RodrÃguez, H., MartÃnez, D., Pardo, M., Sobrido, M.J.: Automated semantic annotation of rare disease cases: a case study. Database (Oxford) 2014 (2014)
Aronson, A.R.: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In: Proceedings of the AMIA Symposium, p. 17 (2001)
Savova, G.K., et al.: Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J. Am. Med. Inform. Assoc. 17(5), 507–513 (2010)
Groza, T., et al.: Automatic concept recognition using the Human Phenotype Ontology reference and test suite corpora. Database 2015, bav005 (2015)
Lobo, M., Lamurias, A., Couto, F.M.: Identifying human phenotype terms by combining machine learning and validation rules. Biomed. Res. Int. 2017, Article no. 8565739 (2017)
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv Preprint arXiv:1603.01360 (2016)
Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv Preprint arXiv:1508.01991 (2015)
Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. arXiv Preprint arXiv:1603.01354 (2016)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Lafferty, J., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001)
Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, vol. 4, pp. 142–147 (2003)
Johnson, A.E.W., et al.: MIMIC-III, a freely accessible critical care database. Sci. Data 3 (2016)
Girdea, M., et al.: PhenoTips: patient phenotyping software for clinical and research use. Hum. Mutat. 34(8), 1057–1065 (2013)
Glueck, M., et al.: PhenoLines: phenotype comparison visualizations for disease subtyping via topic models. IEEE Trans. Vis. Comput. Graph. 24(1), 371–381 (2018)
Habibi, M., Weber, L., Neves, M., Wiegandt, D.L., Leser, U.: Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33(14), i37–i48 (2017)
Vani, A., Jernite, Y., Sontag, D.: Grounded recurrent neural networks. arXiv Preprint arXiv:1705.08557 (2017)
Deng, J., et al.: Large-scale object classification using label relation graphs. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 48–64. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_4
Vendrov, I., Kiros, R., Fidler, S., Urtasun, R.: Order-embeddings of images and language. arXiv Preprint arXiv:1511.06361 (2015)
Neelakantan, A., Roth, B., McCallum, A.: Compositional vector space models for knowledge base inference. In: 2015 AAAI Spring Symposium Series (2015)
Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representations. arXiv Preprint arXiv:1705.08039 (2017)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv Preprint arXiv:1607.04606 (2016)
Kim, Y.: Convolutional neural networks for sentence classification. arXiv Preprint arXiv:1408.5882 (2014)
Clevert, D.-A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (ELUs). arXiv Preprint arXiv:1511.07289 (2015)
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv Preprint arXiv:1412.6980 (2014)
Tifft, C.J., Adams, D.R.: The National Institutes of Health undiagnosed diseases program. Curr. Opin. Pediatr. 26(6), 626 (2014)
Bodenreider, O.: The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 32(90001), 267D–270D (2004)
Kiros, R., et al.: Skip-thought vectors. In: Advances in Neural Information Processing Systems, pp. 3294–3302 (2015)
Acknowledgements
We thank Michael Glueck for his valuable comments and discussions. We also thank Tudor Groza for his helpful comments and for providing us the BioLarK API used for the experiments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Arbabi, A., Adams, D.R., Fidler, S., Brudno, M. (2019). Identifying Clinical Terms in Free-Text Notes Using Ontology-Guided Machine Learning. In: Cowen, L. (eds) Research in Computational Molecular Biology. RECOMB 2019. Lecture Notes in Computer Science(), vol 11467. Springer, Cham. https://doi.org/10.1007/978-3-030-17083-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-17083-7_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-17082-0
Online ISBN: 978-3-030-17083-7
eBook Packages: Computer ScienceComputer Science (R0)