Skip to main content

Beyond Volume: The Impact of Complex Healthcare Data on the Machine Learning Pipeline

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10344))

Abstract

From medical charts to national census, healthcare has traditionally operated under a paper-based paradigm. However, the past decade has marked a long and arduous transformation bringing healthcare into the digital age. Ranging from electronic health records, to digitized imaging and laboratory reports, to public health datasets, today, healthcare now generates an incredible amount of digital information. Such a wealth of data presents an exciting opportunity for integrated machine learning solutions to address problems across multiple facets of healthcare practice and administration. Unfortunately, the ability to derive accurate and informative insights requires more than the ability to execute machine learning models. Rather, a deeper understanding of the data on which the models are run is imperative for their success. While a significant effort has been undertaken to develop models able to process the volume of data obtained during the analysis of millions of digitalized patient records, it is important to remember that volume represents only one aspect of the data. In fact, drawing on data from an increasingly diverse set of sources, healthcare data presents an incredibly complex set of attributes that must be accounted for throughout the machine learning pipeline. This chapter focuses on highlighting such challenges, and is broken down into three distinct components, each representing a phase of the pipeline. We begin with attributes of the data accounted for during preprocessing, then move to considerations during model building, and end with challenges to the interpretation of model output. For each component, we present a discussion around data as it relates to the healthcare domain and offer insight into the challenges each may impose on the efficiency of machine learning techniques.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Yoo, I., Alafaireet, P., Marinov, M., Pena-Hernandez, K., Gopidi, R., Chang, J.F., Hua, L.: Data mining in healthcare and biomedicine: a survey of the literature. J. Med. Syst. 36(4), 2431–2448 (2012)

    Article  Google Scholar 

  2. Jensen, P.B., Jensen, L.J., Brunak, S.: Mining electronic health records: towards better research applications and clinical care. Nat. Rev. Genet. 13(6), 395–405 (2012)

    Article  Google Scholar 

  3. Hughes, G.: How big is big data in healthcare. From a Shot in the Arm Blog (2011)

    Google Scholar 

  4. Raghupathi, W., Raghupathi, V.: Big data analytics in healthcare: promise and potential. Health Inf. Sci. Syst. 2(1), 3 (2014)

    Article  Google Scholar 

  5. Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier (2011)

    Google Scholar 

  6. Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)

    MATH  Google Scholar 

  7. Sammut, C., Webb, G.I.: Encyclopedia of Machine Learning. Springer Science & Business Media, New York (2011)

    MATH  Google Scholar 

  8. Kantardzic, M.: Data Mining: Concepts, Models, Methods, and Algorithms. Wiley, Chichester (2011)

    Book  MATH  Google Scholar 

  9. Diamond, M.: Mastering Medical Coding. Elsevier Health Sciences (2013)

    Google Scholar 

  10. Tan, P.N., et al.: Introduction to Data Mining. Pearson Education India (2006)

    Google Scholar 

  11. Tsymbal, A.: The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin 106(2) (2004)

    Google Scholar 

  12. Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)

    Google Scholar 

  13. King, L.A., Fisher, J., Jacquin, L., Zeltwanger, P.: The digital hospital: opportunities and challenges. J. Healthc. Inf. Manag. JHIM 17(1), 37–45 (2002)

    Google Scholar 

  14. Andreu-Perez, J., Leff, D.R., Ip, H.M., Yang, G.Z.: From wearable sensors to smart implants–toward pervasive and personalized healthcare. IEEE Trans. Biomed. Eng. 62(12), 2750–2762 (2015)

    Article  Google Scholar 

  15. Kidd, C.D., Orr, R., Abowd, G.D., Atkeson, C.G., Essa, I.A., MacIntyre, B., Mynatt, E., Starner, T.E., Newstetter, W.: The aware home: a living laboratory for ubiquitous computing research. In: Streitz, N.A., Siegel, J., Hartkopf, V., Konomi, S. (eds.) CoBuild 1999. LNCS, vol. 1670, pp. 191–198. Springer, Heidelberg (1999). doi:10.1007/10705432_17

    Chapter  Google Scholar 

  16. Caceres, C.A.: Medical Devices-measurement, Quality Assurance, and Standards. Number 800. ASTM International (1983)

    Google Scholar 

  17. Koumoundouros, E.: Clinical engineering and uncertainty in clinical measurements. Australas. Phys. Eng. Sci. Med. 37(3), 467 (2014)

    Article  Google Scholar 

  18. Bland, J.M., Altman, D.G.: Statistics notes: measurement error. BMJ 313(7059), 744 (1996)

    Article  Google Scholar 

  19. Sethi, N., Sethi, J., Torgovnick, E., Arsura, E.: Physiological and non-physiological EEG artifacts. Internet J. Neuromonitoring 5(1) (2007)

    Google Scholar 

  20. Wood, A.M., White, I.R., Thompson, S.G.: Are missing outcome data adequately handled? A review of published randomized controlled trials in major medical journals. Clin. Trials 1(4), 368–376 (2004)

    Article  Google Scholar 

  21. Little, R.J., D’agostino, R., Cohen, M.L., Dickersin, K., Emerson, S.S., Farrar, J.T., Frangakis, C., Hogan, J.W., Molenberghs, G., Murphy, S.A., et al.: The prevention and treatment of missing data in clinical trials. N. Engl. J. Med. 367(14), 1355–1360 (2012)

    Article  Google Scholar 

  22. Marlin, B.M., Kale, D.C., Khemani, R.G., Wetzel, R.C.: Unsupervised pattern discovery in electronic health care data using probabilistic clustering models. In: Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium, pp. 389–398. ACM (2012)

    Google Scholar 

  23. Azarm-Daigle, M., Kuziemsky, C., Peyton, L.: A review of cross organizational healthcare data sharing. Procedia Comput. Sci. 63, 425–432 (2015)

    Article  Google Scholar 

  24. Quan, H., Li, B., Duncan Saunders, L., Parsons, G.A., Nilsson, C.I., Alibhai, A., Ghali, W.A.: Assessing validity of ICD-9-CM and ICD-10 administrative data in recording clinical conditions in a unique dually coded database. Health Serv. Res. 43(4), 1424–1441 (2008)

    Article  Google Scholar 

  25. International classification of diseases, (ICD-10-CM/PCS) transition, October 2015

    Google Scholar 

  26. Meyer, H.: Coding complexity: US health care gets ready for the coming of ICD-10. Health Aff. 30(5), 968–974 (2011)

    Article  Google Scholar 

  27. Fisher, E.S., Whaley, F.S., Krushat, W.M., Malenka, D.J., Fleming, C., Baron, J.A., Hsia, D.C.: The accuracy of medicare’s hospital claims data: progress has been made, but problems remain. Am. J. Public Health 82(2), 243–248 (1992)

    Article  Google Scholar 

  28. MacIntyre, C.R., Ackland, M.J., Chandraraj, E.J., Pilla, J.E.: Accuracy of ICD-9-CM codes in hospital morbidity data, victoria: implications for public health research. Aust. N. Z. J. Public Health 21(5), 477–482 (1997)

    Article  Google Scholar 

  29. Cortes, C., Jackel, L.D., Chiang, W.P., et al.: Limits on learning machine accuracy imposed by data quality. KDD 95, 57–62 (1995)

    Google Scholar 

  30. Vapnik, V.N., Vapnik, V.: Statistical Learning Theory, vol. 1. Wiley, New York (1998)

    MATH  Google Scholar 

  31. Kearns, M.J., Vazirani, U.V.: An Introduction to Computational Learning Theory. MIT press (1994)

    Google Scholar 

  32. Sessions, V., Valtorta, M.: The effects of data quality on machine learning algorithms. ICIQ 6, 485–498 (2006)

    Google Scholar 

  33. Knorr, E.M., Ng, R.T., Tucakov, V.: Distance-based outliers: algorithms and applications. VLDB J. Int. J. Very Large Data Bases 8(3–4), 237–253 (2000)

    Article  Google Scholar 

  34. Bacioiu, A.S., Sauntry, D.M., Boyle, J.S., Wong, L.C.W., Leonard, P.F., Chandrasekar, R.: Method and apparatus for analysis and decomposition of classifier data anomalies. US Patent 7,426,497, 16 September 2008

    Google Scholar 

  35. Little, R., Rubin, D.: Statistical analysis with missing data (1987)

    Google Scholar 

  36. Arbuckle, J.L., Marcoulides, G.A., Schumacker, R.E.: Full information estimation in the presence of incomplete data. In: Advanced Structural Equation Modeling: Issues and Techniques, vol. 243, p. 277 (1996)

    Google Scholar 

  37. Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys, vol. 81. Wiley (2004)

    Google Scholar 

  38. Collins, L.M., Schafer, J.L., Kam, C.M.: A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol. Methods 6(4), 330 (2001)

    Article  Google Scholar 

  39. Graham, J.W.: Missing data theory. In: Graham, J.W. (ed.) Missing Data, pp. 3–46. Springer, New York (2012). doi:10.1007/978-1-4614-4018-5_1

    Chapter  Google Scholar 

  40. Rector, A.L., Brandt, S.: Why do it the hard way? The case for an expressive description logic for snomed. J. Am. Med. Inform. Assoc. 15(6), 744–751 (2008)

    Article  Google Scholar 

  41. Lindenauer, P.K., Lagu, T., Shieh, M.S., Pekow, P.S., Rothberg, M.B.: Association of diagnostic coding with trends in hospitalizations and mortality of patients with pneumonia, 2003–2009. JAMA 307(13), 1405–1413 (2012)

    Article  Google Scholar 

  42. Weber, G.M., Mandl, K.D., Kohane, I.S.: Finding the missing link for big biomedical data. JAMA 311(24), 2479–2480 (2014)

    Google Scholar 

  43. Stoto, M.A.: Population health in the Affordable Care Act Era, vol. 1. AcademyHealth, Washington, DC (2013)

    Google Scholar 

  44. Feldman, K., Hazekamp, N., Chawla, N.V.: Mining the clinical narrative: all text are not equal. In: 2016 IEEE International Conference on Healthcare Informatics (ICHI), pp. 271–280. IEEE (2016)

    Google Scholar 

  45. Visscher, P.M., Brown, M.A., McCarthy, M.I., Yang, J.: Five years of GWAS discovery. Am. J. Hum. Genet. 90(1), 7–24 (2012)

    Article  Google Scholar 

  46. Lewis, D.P., Jebara, T., Noble, W.S.: Support vector machine learning from heterogeneous data: an empirical analysis using protein sequence and structure. Bioinformatics 22(22), 2753–2760 (2006)

    Article  Google Scholar 

  47. Diamond, C.C., Mostashari, F., Shirky, C.: Collecting and sharing data for population health: a new paradigm. Health Aff. 28(2), 454–466 (2009)

    Article  Google Scholar 

  48. Hillestad, R.: Identity crisis: an examination of the costs and benefits of a unique patient identifier for the US health care system. Rand Corporation (2008)

    Google Scholar 

  49. Fan, J., Han, F., Liu, H.: Challenges of big data analysis. Natl. Sci. Rev. 1(2), 293–314 (2014)

    Article  Google Scholar 

  50. Johnstone, I.M., Titterington, D.M.: Statistical challenges of high-dimensional data (2009)

    Google Scholar 

  51. Lafferty, J.D., Wasserman, L.: Challenges in statistical machine learning. Statistica Sinica 16, 307 (2006)

    MathSciNet  Google Scholar 

  52. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)

    Article  Google Scholar 

  53. López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)

    Article  Google Scholar 

  54. Box, G.E.: Robustness in the strategy of scientific model building. Robust. Stat. 1, 201–236 (1979)

    Article  Google Scholar 

  55. Oreskes, N., Shrader-Frechette, K., Belitz, K., et al.: Verification, validation, and confirmation of numerical models in the earth sciences. Science 263(5147), 641–646 (1994)

    Article  Google Scholar 

  56. Szummer, M.O.: Learning from partially labeled data. PhD thesis, Massachusetts Institute of Technology (2002)

    Google Scholar 

  57. Gensinger Jr., R.A.: Analytics in Healthcare: An Introduction. HIMSS (2014). CPHIMS, FHIMSS

    Google Scholar 

  58. Glas, A.S., Lijmer, J.G., Prins, M.H., Bonsel, G.J., Bossuyt, P.M.: The diagnostic odds ratio: a single indicator of test performance. J. Clin. Epidemiol. 56(11), 1129–1135 (2003)

    Article  Google Scholar 

  59. Kulis, B., et al.: Metric learning: a survey. Found. Trends® Mach. Learn. 5(4), 287–364 (2013)

    Article  MATH  MathSciNet  Google Scholar 

  60. Arcuri, A., Fraser, G.: Parameter tuning or default values? An empirical investigation in search-based software engineering. Empir. Softw. Eng. 18(3), 594–623 (2013)

    Article  Google Scholar 

  61. Hoos, H.H.: Automated algorithm configuration and parameter tuning. In: Hamadi, Y., Monfroy, E., Saubion, F. (eds.) Autonomous Search, pp. 37–71. Springer, Heidelberg (2011). doi:10.1007/978-3-642-21434-9_3

    Chapter  Google Scholar 

  62. Kelley, C.T.: Iterative methods for optimization. SIAM (1999)

    Google Scholar 

  63. Sra, S., Nowozin, S., Wright, S.J.: Optimization for Machine Learning. MIT Press (2012)

    Google Scholar 

  64. Lange, K., Chi, E.C., Zhou, H.: A brief survey of modern optimization for statisticians. Int. Stat. Rev. 82(1), 46–70 (2014)

    Article  MathSciNet  Google Scholar 

  65. Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manage. 45(4), 427–437 (2009)

    Article  Google Scholar 

  66. Zhao, J., Papapetrou, P., Asker, L., Boström, H.: Learning from heterogeneous temporal data in electronic health records. J. Biomed. Inform. 65, 105–119 (2017)

    Article  Google Scholar 

  67. Carter, H., Hofree, M., Ideker, T.: Genotype to phenotype via network analysis. Curr. Opin. Genet. Dev. 23(6), 611–621 (2013)

    Article  Google Scholar 

  68. Feldman, K., Stiglic, G., Dasgupta, D., Kricheff, M., Obradovic, Z., Chawla, N.V.: Insights into population health management through disease diagnoses networks. Sci. Rep. 6, Article no. 30465 (2016)

    Google Scholar 

  69. Hunyadi, B., Van Huffel, S., De Vos, M.: The power of tensor decompositions in biomedical applications (2016)

    Google Scholar 

  70. Luo, Y., Wang, F., Szolovits, P.: Tensor factorization toward precision medicine. Brief. Bioinform. 18(3), 511–514 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nitesh V. Chawla .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Feldman, K., Faust, L., Wu, X., Huang, C., Chawla, N.V. (2017). Beyond Volume: The Impact of Complex Healthcare Data on the Machine Learning Pipeline. In: Holzinger, A., Goebel, R., Ferri, M., Palade, V. (eds) Towards Integrative Machine Learning and Knowledge Extraction. Lecture Notes in Computer Science(), vol 10344. Springer, Cham. https://doi.org/10.1007/978-3-319-69775-8_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-69775-8_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-69774-1

  • Online ISBN: 978-3-319-69775-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics