Machine Learning for Structured Clinical Data

  • Brett Beaulieu-JonesEmail author
Part of the Intelligent Systems Reference Library book series (ISRL, volume 137)


Research is a tertiary priority in the EHR, where the priorities are patient care and billing. Because of this, the data is not standardized or formatted in a manner easily adapted to machine learning approaches. Data may be missing for a large variety of reasons ranging from individual input styles to differences in clinical decision making, for example, which lab tests to issue. Few patients are annotated at a research quality, limiting sample size and presenting a moving gold standard. Patient progression over time is key to understanding many diseases but many machine learning algorithms require a snapshot, at a single time point, to create a usable vector form. Furthermore, algorithms that produce black box results do not provide the interpretability required for clinical adoption. This chapter discusses these challenges and others in applying machine learning techniques to the structured EHR (i.e. Patient Demographics, Family History, Medication Information, Vital Signs, Laboratory Tests, Genetic Testing). It does not cover feature extraction from additional sources such as imaging data or free text patient notes but the approaches discussed can include features extracted from these sources.


Missing data Semi-supervised machine learning Longitudinal modeling Machine learning interpretability 


  1. 1.
    Collins, F.S., Varmus, H.: A new initiative on precision medicine. N. Engl. J. Med. 363, 1–3 (2010). doi: 10.1056/NEJMp1002530 CrossRefGoogle Scholar
  2. 2.
    Bishop, C.M.: Pattern recognition and machine learning. Springer, Berlin (2006)Google Scholar
  3. 3.
    Kreybe, L.: Histological lung cancer types. A morphological and biological correlation. Acta Pathol Microbiol Scand Suppl 157, 1–92 (1962)Google Scholar
  4. 4.
    Mountain, C.F.: Revisions in the international system for staging lung cancer. Chest 111, 1710–1717 (1997). doi: 10.1378/chest.111.6.1710 CrossRefGoogle Scholar
  5. 5.
    West, L., Vidwans, S.J., Campbell, N.P., et al.: A novel classification of lung cancer into molecular subtypes. PLoS ONE 7, 1–11 (2012). doi: 10.1371/journal.pone.0031906 Google Scholar
  6. 6.
    Shin, J.-A., Lee, J.-H., Lim, S.-Y., et al.: Metabolic syndrome as a predictor of type 2 diabetes, and its clinical interpretations and usefulness. J Diabetes Investig 4, 334–343 (2013). doi: 10.1111/jdi.12075 CrossRefGoogle Scholar
  7. 7.
    Li, L., Cheng, W., Glicksberg, B.S., et al.: Identification of type 2 diabetes subgroups through topological analysis of patient similarity. Sci. Transl. Med. 7, 1–16 (2015). doi: 10.1126/scitranslmed.aaa9364 Google Scholar
  8. 8.
    Lublin, F.D., Reingold, S.C., Cohen, J.A., et al.: Defining the clinical course of multiple sclerosis: The 2013 revisions. Neurology 83, 278–286 (2014). doi: 10.1212/WNL.0000000000000560 CrossRefGoogle Scholar
  9. 9.
    Denny, J.C., Ritchie, M.D., Basford, M.A., et al.: PheWAS: Demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics 26, 1205–1210 (2010). doi: 10.1093/bioinformatics/btq126 CrossRefGoogle Scholar
  10. 10.
    Buyske, S., Yang, G., Matise, T.C., Gordon, D.: When a case is not a case: Effects of phenotype misclassification on power and sample size requirements for the transmission disequilibrium test with affected child trios. Hum. Hered. 67, 287–292 (2009). doi: 10.1159/000194981 CrossRefGoogle Scholar
  11. 11.
    Gordon D, Yang Y, Haynes C, et al: Increasing power for tests of genetic association in the presence of phenotype and/or genotype error by use of double-sampling. Stat Appl Genet Mol Biol. 3: Article 26 (2004). doi:  10.2202/1544-6115.1085
  12. 12.
    Manchia, M., Cullis, J., Turecki, G., et al.: The Impact of phenotypic and genetic heterogeneity on results of genome wide association studies of complex diseases. PLoS ONE 8, 1–7 (2013). doi: 10.1371/journal.pone.0076295 Google Scholar
  13. 13.
    Labbe, A., Bureau, A., Moreau, I., et al.: Symptom dimensions as alternative phenotypes to address genetic heterogeneity in schizophrenia and bipolar disorder. Eur. J. Hum. Genet. 20, 1182–1188 (2012). doi: 10.1038/ejhg.2012.67 CrossRefGoogle Scholar
  14. 14.
    Chaste, P., Klei, L., Sanders, S.J., et al.: A genome-wide association study of autism using the Simons Simplex Collection: Does reducing phenotypic heterogeneity in autism increase genetic homogeneity? Biol. Psychiatry 77, 775–784 (2015). doi: 10.1016/j.biopsych.2014.09.017 CrossRefGoogle Scholar
  15. 15.
    Wiley, L.K., Vanhouten, J.P., Samuels, D.C., et al.: strategies for equitable pharmacogenomic-guided warfarin dosing among european and african american individuals in a clinical population. Pac Symp Biocomput 22, 545–556 (2016)Google Scholar
  16. 16.
    Shaw, J.: The erosion of privacy in the internet era (2009)Google Scholar
  17. 17.
    Narayanan, A., Shmatikov, V.: Robust de-anonymization of large sparse datasets (2008)Google Scholar
  18. 18.
    Shokri, R., Stronati, M., Song, C., Shmatikov, V. Membership inference attacks against machine learning models (2016)Google Scholar
  19. 19.
    McSherry, F., Talwar, K.: Mechanism design via differential privacy. 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07). IEEE, pp. 94–103 (2007)Google Scholar
  20. 20.
    Beaulieu-Jones, B.K., Wu, Z.S., Williams, C., Greene, C.S.: Privacy-preserving generative deep neural networks support clinical data sharing. bioRxiv (2017). doi: 10.1101/159756 Google Scholar
  21. 21.
    Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found trends®. Theor Comput Sci 9, 211–407 (2013). doi: 10.1561/0400000042 zbMATHMathSciNetGoogle Scholar
  22. 22.
    Beaulieu-Jones, B.K., Greene, C.S.: Reproducibility of computational workflows is automated using continuous analysis. Nat Biotech 35, 342–346 (2017)CrossRefGoogle Scholar
  23. 23.
    Group TSR: A randomized trial of intensive versus standard blood-pressure control. N. Engl. J. Med. 373, 2103–2116 (2015). doi: 10.1056/NEJMoa1511939 CrossRefGoogle Scholar
  24. 24.
    Jensen, A.B., Moseley, P.L., Oprea, T.I., et al.: Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients. Nat Commun 5, 1769–1775 (2014). doi: 10.1038/ncomms5022 Google Scholar
  25. 25.
    LeCun, Y., Bengio, Y., Hinton, G., et al.: Deep learning. Nature 521, 436–444 (2015). doi: 10.1038/nature14539 CrossRefGoogle Scholar
  26. 26.
    Beaulieu-Jones, B.K., Greene, C.S.: Semi-supervised learning of the electronic health record for phenotype stratification. J. Biomed. Inform. 64, 168–178 (2016). doi: 10.1016/j.jbi.2016.10.007 CrossRefGoogle Scholar
  27. 27.
    Miotto, R., Li, L., Kidd, B.A., et al.: Deep patient: An unsupervised representation to predict the future of patients from the electronic health records. Sci Rep 6, 26094 (2016). doi: 10.1038/srep26094 CrossRefGoogle Scholar
  28. 28.
    Khardori, R.M. Type 2 Diabetes Mellitus. PhekKB 1–24 (2014)Google Scholar
  29. 29.
    Ching, T. et al. Opportunities And Obstacles For Deep Learning In Biology And Medicine. bioRXiv. 102 (2017). doi: 10.1101/142760Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.Institute of Biomedical Informatics, Perelman School of MedicineUniversity of PennsylvaniaPhiladelphiaUSA

Personalised recommendations