Beyond Volume: The Impact of Complex Healthcare Data on the Machine Learning Pipeline

Feldman, Keith; Faust, Louis; Wu, Xian; Huang, Chao; Chawla, Nitesh V.

doi:10.1007/978-3-319-69775-8_9

Beyond Volume: The Impact of Complex Healthcare Data on the Machine Learning Pipeline

Keith Feldman¹⁷,
Louis Faust¹⁷,
Xian Wu¹⁷,
Chao Huang¹⁷ &
…
Nitesh V. Chawla^17,18

Conference paper
First Online: 29 October 2017

1815 Accesses
14 Citations
53 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10344))

Abstract

From medical charts to national census, healthcare has traditionally operated under a paper-based paradigm. However, the past decade has marked a long and arduous transformation bringing healthcare into the digital age. Ranging from electronic health records, to digitized imaging and laboratory reports, to public health datasets, today, healthcare now generates an incredible amount of digital information. Such a wealth of data presents an exciting opportunity for integrated machine learning solutions to address problems across multiple facets of healthcare practice and administration. Unfortunately, the ability to derive accurate and informative insights requires more than the ability to execute machine learning models. Rather, a deeper understanding of the data on which the models are run is imperative for their success. While a significant effort has been undertaken to develop models able to process the volume of data obtained during the analysis of millions of digitalized patient records, it is important to remember that volume represents only one aspect of the data. In fact, drawing on data from an increasingly diverse set of sources, healthcare data presents an incredibly complex set of attributes that must be accounted for throughout the machine learning pipeline. This chapter focuses on highlighting such challenges, and is broken down into three distinct components, each representing a phase of the pipeline. We begin with attributes of the data accounted for during preprocessing, then move to considerations during model building, and end with challenges to the interpretation of model output. For each component, we present a discussion around data as it relates to the healthcare domain and offer insight into the challenges each may impose on the efficiency of machine learning techniques.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Yoo, I., Alafaireet, P., Marinov, M., Pena-Hernandez, K., Gopidi, R., Chang, J.F., Hua, L.: Data mining in healthcare and biomedicine: a survey of the literature. J. Med. Syst. 36(4), 2431–2448 (2012)
Article Google Scholar
Jensen, P.B., Jensen, L.J., Brunak, S.: Mining electronic health records: towards better research applications and clinical care. Nat. Rev. Genet. 13(6), 395–405 (2012)
Article Google Scholar
Hughes, G.: How big is big data in healthcare. From a Shot in the Arm Blog (2011)
Google Scholar
Raghupathi, W., Raghupathi, V.: Big data analytics in healthcare: promise and potential. Health Inf. Sci. Syst. 2(1), 3 (2014)
Article Google Scholar
Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier (2011)
Google Scholar
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)
MATH Google Scholar
Sammut, C., Webb, G.I.: Encyclopedia of Machine Learning. Springer Science & Business Media, New York (2011)
MATH Google Scholar
Kantardzic, M.: Data Mining: Concepts, Models, Methods, and Algorithms. Wiley, Chichester (2011)
Book MATH Google Scholar
Diamond, M.: Mastering Medical Coding. Elsevier Health Sciences (2013)
Google Scholar
Tan, P.N., et al.: Introduction to Data Mining. Pearson Education India (2006)
Google Scholar
Tsymbal, A.: The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin 106(2) (2004)
Google Scholar
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Google Scholar
King, L.A., Fisher, J., Jacquin, L., Zeltwanger, P.: The digital hospital: opportunities and challenges. J. Healthc. Inf. Manag. JHIM 17(1), 37–45 (2002)
Google Scholar
Andreu-Perez, J., Leff, D.R., Ip, H.M., Yang, G.Z.: From wearable sensors to smart implants–toward pervasive and personalized healthcare. IEEE Trans. Biomed. Eng. 62(12), 2750–2762 (2015)
Article Google Scholar
Kidd, C.D., Orr, R., Abowd, G.D., Atkeson, C.G., Essa, I.A., MacIntyre, B., Mynatt, E., Starner, T.E., Newstetter, W.: The aware home: a living laboratory for ubiquitous computing research. In: Streitz, N.A., Siegel, J., Hartkopf, V., Konomi, S. (eds.) CoBuild 1999. LNCS, vol. 1670, pp. 191–198. Springer, Heidelberg (1999). doi:10.1007/10705432_17
Chapter Google Scholar
Caceres, C.A.: Medical Devices-measurement, Quality Assurance, and Standards. Number 800. ASTM International (1983)
Google Scholar
Koumoundouros, E.: Clinical engineering and uncertainty in clinical measurements. Australas. Phys. Eng. Sci. Med. 37(3), 467 (2014)
Article Google Scholar
Bland, J.M., Altman, D.G.: Statistics notes: measurement error. BMJ 313(7059), 744 (1996)
Article Google Scholar
Sethi, N., Sethi, J., Torgovnick, E., Arsura, E.: Physiological and non-physiological EEG artifacts. Internet J. Neuromonitoring 5(1) (2007)
Google Scholar
Wood, A.M., White, I.R., Thompson, S.G.: Are missing outcome data adequately handled? A review of published randomized controlled trials in major medical journals. Clin. Trials 1(4), 368–376 (2004)
Article Google Scholar
Little, R.J., D’agostino, R., Cohen, M.L., Dickersin, K., Emerson, S.S., Farrar, J.T., Frangakis, C., Hogan, J.W., Molenberghs, G., Murphy, S.A., et al.: The prevention and treatment of missing data in clinical trials. N. Engl. J. Med. 367(14), 1355–1360 (2012)
Article Google Scholar
Marlin, B.M., Kale, D.C., Khemani, R.G., Wetzel, R.C.: Unsupervised pattern discovery in electronic health care data using probabilistic clustering models. In: Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium, pp. 389–398. ACM (2012)
Google Scholar
Azarm-Daigle, M., Kuziemsky, C., Peyton, L.: A review of cross organizational healthcare data sharing. Procedia Comput. Sci. 63, 425–432 (2015)
Article Google Scholar
Quan, H., Li, B., Duncan Saunders, L., Parsons, G.A., Nilsson, C.I., Alibhai, A., Ghali, W.A.: Assessing validity of ICD-9-CM and ICD-10 administrative data in recording clinical conditions in a unique dually coded database. Health Serv. Res. 43(4), 1424–1441 (2008)
Article Google Scholar
International classification of diseases, (ICD-10-CM/PCS) transition, October 2015
Google Scholar
Meyer, H.: Coding complexity: US health care gets ready for the coming of ICD-10. Health Aff. 30(5), 968–974 (2011)
Article Google Scholar
Fisher, E.S., Whaley, F.S., Krushat, W.M., Malenka, D.J., Fleming, C., Baron, J.A., Hsia, D.C.: The accuracy of medicare’s hospital claims data: progress has been made, but problems remain. Am. J. Public Health 82(2), 243–248 (1992)
Article Google Scholar
MacIntyre, C.R., Ackland, M.J., Chandraraj, E.J., Pilla, J.E.: Accuracy of ICD-9-CM codes in hospital morbidity data, victoria: implications for public health research. Aust. N. Z. J. Public Health 21(5), 477–482 (1997)
Article Google Scholar
Cortes, C., Jackel, L.D., Chiang, W.P., et al.: Limits on learning machine accuracy imposed by data quality. KDD 95, 57–62 (1995)
Google Scholar
Vapnik, V.N., Vapnik, V.: Statistical Learning Theory, vol. 1. Wiley, New York (1998)
MATH Google Scholar
Kearns, M.J., Vazirani, U.V.: An Introduction to Computational Learning Theory. MIT press (1994)
Google Scholar
Sessions, V., Valtorta, M.: The effects of data quality on machine learning algorithms. ICIQ 6, 485–498 (2006)
Google Scholar
Knorr, E.M., Ng, R.T., Tucakov, V.: Distance-based outliers: algorithms and applications. VLDB J. Int. J. Very Large Data Bases 8(3–4), 237–253 (2000)
Article Google Scholar
Bacioiu, A.S., Sauntry, D.M., Boyle, J.S., Wong, L.C.W., Leonard, P.F., Chandrasekar, R.: Method and apparatus for analysis and decomposition of classifier data anomalies. US Patent 7,426,497, 16 September 2008
Google Scholar
Little, R., Rubin, D.: Statistical analysis with missing data (1987)
Google Scholar
Arbuckle, J.L., Marcoulides, G.A., Schumacker, R.E.: Full information estimation in the presence of incomplete data. In: Advanced Structural Equation Modeling: Issues and Techniques, vol. 243, p. 277 (1996)
Google Scholar
Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys, vol. 81. Wiley (2004)
Google Scholar
Collins, L.M., Schafer, J.L., Kam, C.M.: A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol. Methods 6(4), 330 (2001)
Article Google Scholar
Graham, J.W.: Missing data theory. In: Graham, J.W. (ed.) Missing Data, pp. 3–46. Springer, New York (2012). doi:10.1007/978-1-4614-4018-5_1
Chapter Google Scholar
Rector, A.L., Brandt, S.: Why do it the hard way? The case for an expressive description logic for snomed. J. Am. Med. Inform. Assoc. 15(6), 744–751 (2008)
Article Google Scholar
Lindenauer, P.K., Lagu, T., Shieh, M.S., Pekow, P.S., Rothberg, M.B.: Association of diagnostic coding with trends in hospitalizations and mortality of patients with pneumonia, 2003–2009. JAMA 307(13), 1405–1413 (2012)
Article Google Scholar
Weber, G.M., Mandl, K.D., Kohane, I.S.: Finding the missing link for big biomedical data. JAMA 311(24), 2479–2480 (2014)
Google Scholar
Stoto, M.A.: Population health in the Affordable Care Act Era, vol. 1. AcademyHealth, Washington, DC (2013)
Google Scholar
Feldman, K., Hazekamp, N., Chawla, N.V.: Mining the clinical narrative: all text are not equal. In: 2016 IEEE International Conference on Healthcare Informatics (ICHI), pp. 271–280. IEEE (2016)
Google Scholar
Visscher, P.M., Brown, M.A., McCarthy, M.I., Yang, J.: Five years of GWAS discovery. Am. J. Hum. Genet. 90(1), 7–24 (2012)
Article Google Scholar
Lewis, D.P., Jebara, T., Noble, W.S.: Support vector machine learning from heterogeneous data: an empirical analysis using protein sequence and structure. Bioinformatics 22(22), 2753–2760 (2006)
Article Google Scholar
Diamond, C.C., Mostashari, F., Shirky, C.: Collecting and sharing data for population health: a new paradigm. Health Aff. 28(2), 454–466 (2009)
Article Google Scholar
Hillestad, R.: Identity crisis: an examination of the costs and benefits of a unique patient identifier for the US health care system. Rand Corporation (2008)
Google Scholar
Fan, J., Han, F., Liu, H.: Challenges of big data analysis. Natl. Sci. Rev. 1(2), 293–314 (2014)
Article Google Scholar
Johnstone, I.M., Titterington, D.M.: Statistical challenges of high-dimensional data (2009)
Google Scholar
Lafferty, J.D., Wasserman, L.: Challenges in statistical machine learning. Statistica Sinica 16, 307 (2006)
MathSciNet Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)
Article Google Scholar
Box, G.E.: Robustness in the strategy of scientific model building. Robust. Stat. 1, 201–236 (1979)
Article Google Scholar
Oreskes, N., Shrader-Frechette, K., Belitz, K., et al.: Verification, validation, and confirmation of numerical models in the earth sciences. Science 263(5147), 641–646 (1994)
Article Google Scholar
Szummer, M.O.: Learning from partially labeled data. PhD thesis, Massachusetts Institute of Technology (2002)
Google Scholar
Gensinger Jr., R.A.: Analytics in Healthcare: An Introduction. HIMSS (2014). CPHIMS, FHIMSS
Google Scholar
Glas, A.S., Lijmer, J.G., Prins, M.H., Bonsel, G.J., Bossuyt, P.M.: The diagnostic odds ratio: a single indicator of test performance. J. Clin. Epidemiol. 56(11), 1129–1135 (2003)
Article Google Scholar
Kulis, B., et al.: Metric learning: a survey. Found. Trends® Mach. Learn. 5(4), 287–364 (2013)
Article MATH MathSciNet Google Scholar
Arcuri, A., Fraser, G.: Parameter tuning or default values? An empirical investigation in search-based software engineering. Empir. Softw. Eng. 18(3), 594–623 (2013)
Article Google Scholar
Hoos, H.H.: Automated algorithm configuration and parameter tuning. In: Hamadi, Y., Monfroy, E., Saubion, F. (eds.) Autonomous Search, pp. 37–71. Springer, Heidelberg (2011). doi:10.1007/978-3-642-21434-9_3
Chapter Google Scholar
Kelley, C.T.: Iterative methods for optimization. SIAM (1999)
Google Scholar
Sra, S., Nowozin, S., Wright, S.J.: Optimization for Machine Learning. MIT Press (2012)
Google Scholar
Lange, K., Chi, E.C., Zhou, H.: A brief survey of modern optimization for statisticians. Int. Stat. Rev. 82(1), 46–70 (2014)
Article MathSciNet Google Scholar
Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manage. 45(4), 427–437 (2009)
Article Google Scholar
Zhao, J., Papapetrou, P., Asker, L., Boström, H.: Learning from heterogeneous temporal data in electronic health records. J. Biomed. Inform. 65, 105–119 (2017)
Article Google Scholar
Carter, H., Hofree, M., Ideker, T.: Genotype to phenotype via network analysis. Curr. Opin. Genet. Dev. 23(6), 611–621 (2013)
Article Google Scholar
Feldman, K., Stiglic, G., Dasgupta, D., Kricheff, M., Obradovic, Z., Chawla, N.V.: Insights into population health management through disease diagnoses networks. Sci. Rep. 6, Article no. 30465 (2016)
Google Scholar
Hunyadi, B., Van Huffel, S., De Vos, M.: The power of tensor decompositions in biomedical applications (2016)
Google Scholar
Luo, Y., Wang, F., Szolovits, P.: Tensor factorization toward precision medicine. Brief. Bioinform. 18(3), 511–514 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Notre Dame, Notre Dame, IN, USA
Keith Feldman, Louis Faust, Xian Wu, Chao Huang & Nitesh V. Chawla
Indiana Biosciences Research Institute, Indianapolis, IN, 46202, USA
Nitesh V. Chawla

Authors

Keith Feldman
View author publications
You can also search for this author in PubMed Google Scholar
Louis Faust
View author publications
You can also search for this author in PubMed Google Scholar
Xian Wu
View author publications
You can also search for this author in PubMed Google Scholar
Chao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Nitesh V. Chawla
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nitesh V. Chawla .

Editor information

Editors and Affiliations

Medical University Graz, Graz, Austria
Andreas Holzinger
University of Alberta, Edmonton, Alberta, Canada
Randy Goebel
Bologna University, Bologna, Italy
Massimo Ferri
Coventry University, Coventry, United Kingdom
Vasile Palade

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Feldman, K., Faust, L., Wu, X., Huang, C., Chawla, N.V. (2017). Beyond Volume: The Impact of Complex Healthcare Data on the Machine Learning Pipeline. In: Holzinger, A., Goebel, R., Ferri, M., Palade, V. (eds) Towards Integrative Machine Learning and Knowledge Extraction. Lecture Notes in Computer Science(), vol 10344. Springer, Cham. https://doi.org/10.1007/978-3-319-69775-8_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-69775-8_9
Published: 29 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69774-1
Online ISBN: 978-3-319-69775-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics