Automated Coding of Medical Diagnostics from Free-Text: The Role of Parameters Optimization and Imbalanced Classes

  • Luiz VirginioEmail author
  • Julio Cesar dos Reis
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11371)


The extraction of codes from Electronic Health Records (EHR) data is an important task because extracted codes can be used for different purposes such as billing and reimbursement, quality control, epidemiological studies, and cohort identification for clinical trials. The codes are based on standardized vocabularies. Diagnostics, for example, are frequently coded using the International Classification of Diseases (ICD), which is a taxonomy of diagnosis codes organized in a hierarchical structure. Extracting codes from free-text medical notes in EHR such as the discharge summary requires the review of patient data searching for information that can be coded in a standardized manner. The manual human coding assignment is a complex and time-consuming process. The use of machine learning and natural language processing approaches have been receiving an increasing attention to automate the process of ICD coding. In this article, we investigate the use of Support Vector Machines (SVM) and the binary relevance method for multi-label classification in the task of automatic ICD coding from free-text discharge summaries. In particular, we explored the role of SVM parameters optimization and class weighting for addressing imbalanced class. Experiments conducted with the Medical Information Mart for Intensive Care III (MIMIC III) database reached 49.86% of f1-macro for the 100 most frequent diagnostics. Our findings indicated that optimization of SVM parameters and the use of class weighting can improve the effectiveness of the classifier.


Automated ICD coding Multi-label classification Imbalanced classes 



This work is supported by the São Paulo Research Foundation (FAPESP) (Grant #2017/02325-5)7.


  1. 1.
    Chaudhry, B.: Systematic review: impact of health information technology on quality, efficiency, and costs of medical care. Ann. Intern. Med. 144(10), 742 (2006)CrossRefGoogle Scholar
  2. 2.
    Navas, H., Osornio, A.L., Baum, A., Gomez, A., Luna, D., de Quiros, F.G.B.: Creation and evaluation of a terminology server for the interactive coding of discharge summaries. Stud. Health Technol. Inform. 129, 650–654 (2007)Google Scholar
  3. 3.
    Rios, A., Kavuluru, R.: Supervised extraction of diagnosis codes from EMRs: role of feature selection, data selection, and probabilistic thresholding. In: 2013 IEEE International Conference on Healthcare Informatics, pp. 66–73 (2013)Google Scholar
  4. 4.
    Scheurwegs, E., Luyckx, K., Luyten, L., Daelemans, W., Van den Bulcke, T.: Data integration of structured and unstructured sources for assigning clinical codes to patient stays. J. Am. Med. Inform. Assoc. 23(e1), 11–19 (2016)CrossRefGoogle Scholar
  5. 5.
    Perotte, A., Pivovarov, R., Natarajan, K., Weiskopf, N., Wood, F., Elhadad, N.: Diagnosis code assignment: models and evaluation metrics. J. Am. Med. Inform. Assoc. 21(2), 231–237 (2014)CrossRefGoogle Scholar
  6. 6.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  7. 7.
    Kavuluru, R., Rios, A., Lu, Y.: An empirical evaluation of supervised learning approaches in assigning diagnosis codes to electronic medical records. Artif. Intell. Med. 65(2), 155–166 (2015)CrossRefGoogle Scholar
  8. 8.
    Dougherty, M., Seabold, S., White, S.: Study Reveals hard facts on CAC. J. AHIMA 84(7), 54–56 (2013)Google Scholar
  9. 9.
    Helwe, C., Elbassuoni, S., Geha, M., Hitti, E., Makhlouf Obermeyer, C.: CCS coding of discharge diagnoses via deep neural networks. In: Proceedings of the 2017 International Conference on Digital Health, DH 2017, pp. 175–179 (2017)Google Scholar
  10. 10.
    Wang, S., Chang, X., Li, X., Long, G., Yao, L., Sheng, Q.: Diagnosis code assignment using sparsity-based disease correlation embedding. IEEE Trans. Knowl. Data Eng. 28(12), 3191–3202 (2016)CrossRefGoogle Scholar
  11. 11.
    Rizzo, S.G., Montesi, D., Fabbri, A., Marchesini, G.: ICD code retrieval: novel approach for assisted disease classification. In: Ashish, N., Ambite, J.-L. (eds.) DILS 2015. LNCS, vol. 9162, pp. 147–161. Springer, Cham (2015). Scholar
  12. 12.
    Farkas, R., Szarvas, G.: Automatic construction of rule-based ICD-9-CM coding systems. BMC Bioinf. 9(Suppl. 3), S10 (2008)CrossRefGoogle Scholar
  13. 13.
    Stanfill, M.H., Williams, M., Fenton, S.H., Jenders, R.A., Hersh, W.R.: A systematic literature review of automated clinical coding and classification systems. J. Am. Med. Inform. Assoc. 17(6), 646–651 (2010)CrossRefGoogle Scholar
  14. 14.
    Zhang, Y.: A hierarchical approach to encoding medical concepts for clinical notes. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies Student Research Workshop, HLT 2008, p. 67 (2008)Google Scholar
  15. 15.
    Subotin, M., Davis, A.R.: A method for modeling co-occurrence propensity of clinical codes with application to ICD-10-PCS auto-coding. J. Am. Med. Inform. Assoc. 23(5), 866–871 (2016)CrossRefGoogle Scholar
  16. 16.
    Berndorfer, S., Henriksson, A.: Automated diagnosis coding with combined text representations. Stud. Health Technol. Inform. 235, 201–205 (2017)Google Scholar
  17. 17.
    Shi, H., Xie, P., Hu, Z., Zhang, M., Xing, E.P.: Towards automated ICD coding using deep learning, pp. 1–11 (2017)Google Scholar
  18. 18.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRefGoogle Scholar
  19. 19.
    Haykin, S.: Neural Networks and Learning Machines, vol. 3. Pearson, Upper Saddle River (2009)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.University of CampinasCampinas, São PauloBrazil

Personalised recommendations