Comparing High Dimensional Word Embeddings Trained on Medical Text to Bag-of-Words for Predicting Medical Codes

  • Vithya YogarajanEmail author
  • Henry Gouk
  • Tony Smith
  • Michael Mayo
  • Bernhard Pfahringer
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12033)


Word embeddings are a useful tool for extracting knowledge from the free-form text contained in electronic health records, but it has become commonplace to train such word embeddings on data that do not accurately reflect how language is used in a healthcare context. We use prediction of medical codes as an example application to compare the accuracy of word embeddings trained on health corpora to those trained on more general collections of text. It is shown that both an increase in embedding dimensionality and an increase in the volume of health-related training data improves prediction accuracy. We also present a comparison to the traditional bag-of-words feature representation, demonstrating that in many cases, this conceptually simple method for representing text results in superior accuracy to that of word embeddings.


Word embeddings Binary classification Machine learning for health 


  1. 1.
    Beam, A.L., et al.: Clinical concept embeddings learned from massive sources of multimodal medical data. arXiv preprint arXiv:1804.01486 (2018)
  2. 2.
    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
  3. 3.
    Cao, Y., Huang, L., Ji, H., Chen, X., Li, J.: Bridge text and knowledge by learning multi-prototype entity mention embedding. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1623–1633 (2017)Google Scholar
  4. 4.
    Chen, Q., Peng, Y., Lu, Z.: BioSentVec: creating sentence embeddings for biomedical texts. In: 7th IEEE International Conference on Healthcare Informatics (2019)Google Scholar
  5. 5.
    Choi, E., Schuetz, A., Stewart, W.F., Sun, J.: Using recurrent neural network models for early detection of heart failure onset. J. Am. Med. Inform. Assoc. JAMIA 24(2), 361–370 (2017). Scholar
  6. 6.
    Choi, Y., Chiu, C.Y.I., Sontag, D.: Learning low-dimensional representations of medical concepts. AMIA Summits on Transl. Sci. Proc. 41–50 (2016) Google Scholar
  7. 7.
    MIT Critical Data: Secondary Analysis of Electronic Health Records. Springer, Cham (2016).
  8. 8.
    Goldberg, Y.: Neural network methods for natural language processing: Synth. Lect. Hum. Lang. Technol. 10(1), 1–309 (2017)Google Scholar
  9. 9.
    Goldberger, A.L., et al.: PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000)CrossRefGoogle Scholar
  10. 10.
    Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vectors for 157 languages. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018) (2018)Google Scholar
  11. 11.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  12. 12.
    Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954). Scholar
  13. 13.
    Jagannatha, A.N., Yu, H.: Bidirectional RNN for medical event detection in electronic health records. In: North American Chapter Meeting, pp. 473–482. Association for Computational Linguistics (2016)Google Scholar
  14. 14.
    Jensen, P.B., Jensen, L.J., Brunak, S.: Mining electronic health records: towards better research applications and clinical care. Nat. Rev. Genet. 13(6), 395 (2012)CrossRefGoogle Scholar
  15. 15.
    Johnson, A.E., et al.: MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016)CrossRefGoogle Scholar
  16. 16.
    Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: compressing text classification models. arXiv preprint arXiv:1612.03651 (2016)
  17. 17.
    Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)
  18. 18.
    Mencía, E.L., De Melo, G., Nam, J.: Medical concept embeddings via labeled background corpora. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 4629–4636 (2016)Google Scholar
  19. 19.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
  20. 20.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  21. 21.
    Pakhomov, S.V., Finley, G., McEwan, R., Wang, Y., Melton, G.B.: Corpus domain effects on distributional semantic modeling of medical terms. Bioinformatics 32(23), 3635–3644 (2016)Google Scholar
  22. 22.
    Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)Google Scholar
  23. 23.
    Purushotham, S., Meng, C., Che, Z., Liu, Y.: Benchmark of deep learning models on large healthcare mimic datasets. arXiv preprint arXiv:1710.08531 (2017)
  24. 24.
    Roberts, K., et al.: Overview of the TREC 2017 precision medicine track. NIST Special Publication, pp. 500–324 (2017)Google Scholar
  25. 25.
    Shi, H., Xie, P., Hu, Z., Zhang, M., Xing, E.P.: Towards automated ICD coding using deep learning. arXiv preprint arXiv:1711.04075 (2017)
  26. 26.
    Witten, I., Frank, E., Hall, M., Pal, C.: Data Mining: Practical Machine Learning Tools and Techniques, 4th edn. Morgan Kaufmann Publishers Inc., San Francisco (2016)Google Scholar
  27. 27.
    Yamada, I., Shindo, H., Takeda, H., Takefuji, Y.: Joint learning of the embedding of words and entities for named entity disambiguation. arXiv preprint arXiv:1601.01343 (2016)
  28. 28.
    Zhang, Y., Chen, Q., Yang, Z., Lin, H., Lu, Z.: BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci. Data 6(1), 52 (2019)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of WaikatoHamiltonNew Zealand
  2. 2.School of InformaticsUniversity of EdinburghEdinburghScotland

Personalised recommendations