Big Health Data Mining

  • Chao ZhangEmail author
  • Shunfu Xu
  • Dong Xu
Part of the Health Information Science book series (HIS)


With the improvement of infrastructures and techniques, “Big Data” provides great opportunities to health informatics, but at the same time raises unparalleled challenges to data scientists. As an interdisciplinary field, the health data are not limited to electronic health record (EHR), as more and more molecular-level data are used for disease diagnosis and prognosis in clinics. Effectively integrating and mining these data holds an indispensable key to translate theoretical models into clinical applications in precision medicine. In this chapter, we briefly demonstrate different data levels involved in health informatics. Then we introduce some general data mining approaches applied to different levels of health data. Finally, a case study is illustrated as an example for applying computational methods on mining long-term EHR data in epidemiological studies.


  1. 1.
    A. Acharya, J.J. VanWormer, S.C. Waring, A.W. Miller, J.T. Fuehrer, G.R. Nycz, Regional epidemiologic assessment of prevalent periodontitis using an electronic health record system. Am. J. Epidemiol. 177(7), 700–707 (2013)CrossRefGoogle Scholar
  2. 2.
    U.D. Akavia, O. Litvin, J. Kim, F. Sanchez-Garcia, D. Kotliar, H.C. Causton, P. Pochanard, E. Mozes, L.A. Garraway, D. Pe’er, An integrated approach to uncover drivers of cancer. Cell 143(6), 1005–1017 (2010)CrossRefGoogle Scholar
  3. 3.
    M.J. Boot, C.H. Westerberg, J. Sanz-Ezquerro, J. Cotterell, R. Schweitzer, M. Torres, J. Sharpe, In vitro whole-organ imaging: 4D quantification of growing mouse limb buds. Nat. Methods 5(7), 609–612 (2008)CrossRefGoogle Scholar
  4. 4.
    Z. Bu, J. Ji, A current view of gastric cancer in China. Transl. Gastrointest. Cancer 1–4 (2013)Google Scholar
  5. 5.
    Cancer Genome Atlas Network, Comprehensive molecular portraits of human breast tumours. Nature 490(7418), 61–70 (2012)CrossRefGoogle Scholar
  6. 6.
    Cancer Genome Atlas Research Network, Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455(7216), 1061–1068 (2008)CrossRefGoogle Scholar
  7. 7.
    Cancer Genome Atlas Research Network, Comprehensive molecular characterization of gastric adenocarcinoma. Nature 513(7517), 202–209 (2014)CrossRefGoogle Scholar
  8. 8.
    R.M. Cantor, K. Lange, J.S. Sinsheimer, Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am. J. Hum. Genet. 86(1), 6–22 (2010)CrossRefGoogle Scholar
  9. 9.
    E. Cerami, J. Gao, U. Dogrusoz, B.E. Gross, S.O. Sumer, B.A. Aksoy, A. Jacobsen, C.J. Byrne, M.L. Heuer, E. Larsson, Y. Antipin, B. Reva, A.P. Goldberg, C. Sander, N. Schultz, The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov. 2(5), 401–404 (2012)CrossRefGoogle Scholar
  10. 10.
    A.T. Chan, S. Ogino, C.S. Fuchs, Aspirin and the risk of colorectal cancer in relation to the expression of COX-2. N. Engl. J. Med. 356(21), 2131–2142 (2007)CrossRefGoogle Scholar
  11. 11.
    ENCODE Project Consortium, An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414), 57–74 (2012)CrossRefGoogle Scholar
  12. 12.
    A.C. de Vries, N.C. van Grieken, C.W. Looman, M.K. Casparie, E. de Vries, G.A. Meijer, E.J. Kuipers, Gastric cancer risk in patients with premalignant gastric lesions: a nationwide cohort study in the Netherlands. Gastroenterology 134(4), 945–952 (2008)CrossRefGoogle Scholar
  13. 13.
    Y. Demchenko, P. Grosso, C. de Laat, P. Membrey, Addressing big data issues in scientific data infrastructure, in 2013 International Conference on Collaboration Technologies and Systems (CTS) (IEEE, 2013)Google Scholar
  14. 14.
    C.M. den Hoed, B.C. van Eijck, L.G. Capelle, H. van Dekken, K. Biermann, P.D. Siersema, E.J. Kuipers, The prevalence of premalignant gastric lesions in asymptomatic patients: predicting the future incidence of gastric cancer. Eur. J. Cancer 47(8), 1211–1218 (2011)CrossRefGoogle Scholar
  15. 15.
    J.C. Denny, Chapter 13: Mining electronic health records in the genomics era. PLoS Comput. Biol. 8(12), e1002823 (2012)Google Scholar
  16. 16.
    M.F. Dixon, R.M. Genta, J.H. Yardley, P. Correa, Classification and grading of gastritis. The updated Sydney System. International Workshop on the Histopathology of Gastritis, Houston 1994. Am. J. Surg. Pathol. 20(10), 1161–1181 (1996)CrossRefGoogle Scholar
  17. 17.
    R.H. Duerr, K.D. Taylor, S.R. Brant, J.D. Rioux, M.S. Silverberg, M.J. Daly, A.H. Steinhart, C. Abraham, M. Regueiro, A. Griffiths, T. Dassopoulos, A. Bitton, H. Yang, S. Targan, L.W. Datta, E.O. Kistner, L.P. Schumm, A.T. Lee, P.K. Gregersen, M.M. Barmada, J.I. Rotter, D.L. Nicolae, J.H. Cho, A genome-wide association study identifies IL23R as an inflammatory bowel disease gene. Science 314(5804), 1461–1463 (2006)CrossRefGoogle Scholar
  18. 18.
    V. J. Dzau, G. S. Ginsburg, K. Van Nuys, D. Agus, D. Goldman, Aligning incentives to fulfil the promise of personalised medicine. The Lancet 385(9982), 2118–2119 (2015)Google Scholar
  19. 19.
    C. Felley, H. Bouzourene, M.B. VanMelle, A. Hadengue, P. Michetti, G. Dorta, L. Spahr, E. Giostra, J.L. Frossard, Age, smoking and overweight contribute to the development of intestinal metaplasia of the cardia. World J. Gastroenterol. 18(17), 2076–2083 (2012)CrossRefGoogle Scholar
  20. 20.
    D. Garezs, M. Davis, Electronic Patient Records. EMRs and EHRs. Concepts as different as apples and oranges at least separate names. Health Informatics online (2005)Google Scholar
  21. 21.
    C.A. Gonzalez, M.L. Pardo, J.M. Liso, P. Alonso, C. Bonet, R.M. Garcia, N. Sala, G. Capella, J.M. Sanz-Anquela, Gastric cancer occurrence in preneoplastic lesions: a long-term follow-up in a high-risk area in Spain. Int. J. Cancer 127(11), 2654–2660 (2010)CrossRefGoogle Scholar
  22. 22.
    C.S. Greene, J. Tan, M. Ung, J.H. Moore, C. Cheng, Big data bioinformatics. J. Cell. Physiol. 229(12), 1896–1900 (2014)CrossRefGoogle Scholar
  23. 23.
    L. Han, Y. Yuan, S. Zheng, Y. Yang, J. Li, M.E. Edgerton, L. Diao, Y. Xu, R.G. Verhaak, H. Liang, The Pan-Cancer analysis of pseudogene expression reveals biologically and clinically relevant tumour subtypes. Nat. Commun. 5, 3963 (2014)Google Scholar
  24. 24.
    G.E. Hoffman, B.A. Logsdon, J.G. Mezey, PUMA: a unified framework for penalized multiple regression analysis of GWAS data. PLoS Comput. Biol. 9(6), e1003101 (2013)CrossRefGoogle Scholar
  25. 25.
    E.S. Huang, L.L. Strate, W.W. Ho, S.S. Lee, A.T. Chan, Long-term use of aspirin and the risk of gastrointestinal bleeding. Am. J. Med. 124(5), 426–433 (2011)CrossRefGoogle Scholar
  26. 26.
    D.J. Hunter, P. Kraft, K.B. Jacobs, D.G. Cox, M. Yeager, S.E. Hankinson, S. Wacholder, Z. Wang, R. Welch, A. Hutchinson, J. Wang, K. Yu, N. Chatterjee, N. Orr, W.C. Willett, G.A. Colditz, R.G. Ziegler, C.D. Berg, S.S. Buys, C.A. McCarty, H.S. Feigelson, E.E. Calle, M.J. Thun, R.B. Hayes, M. Tucker, D.S. Gerhard, J.F. Fraumeni Jr., R.N. Hoover, G. Thomas, S.J. Chanock, A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat. Genet. 39(7), 870–874 (2007)CrossRefGoogle Scholar
  27. 27.
    P.B. Jensen, L.J. Jensen, S. Brunak, Mining electronic health records: towards better research applications and clinical care. Nat. Rev. Genet. 13(6), 395–405 (2012)CrossRefGoogle Scholar
  28. 28.
    E.J. Kuipers, Helicobacter pylori and the risk and management of associated diseases: gastritis, ulcer disease, atrophic gastritis and gastric cancer. Aliment. Pharmacol. Ther. 11(Suppl 1), 71–88 (1997)CrossRefGoogle Scholar
  29. 29.
    A. Labrinidis, H. Jagadish, Challenges and opportunities with big data. Proc. VLDB Endowment 5(12), 2032–2033 (2012)CrossRefGoogle Scholar
  30. 30.
    P. Lemey, A. Rambaut, T. Bedford, N. Faria, F. Bielejec, G. Baele, C.A. Russell, D.J. Smith, O.G. Pybus, D. Brockmann, M.A. Suchard, Unifying viral genetics and human transportation data to predict the global transmission dynamics of human influenza H3N2. PLoS Pathog. 10(2), e1003932 (2014)CrossRefGoogle Scholar
  31. 31.
    S. Li, F. Garrett-Bakelman, A.E. Perl, S.M. Luger, C. Zhang, B.L. To, I.D. Lewis, A.L. Brown, R.J. D’Andrea, M. Ross, R. Levine, M. Carroll, A. Melnick, C.E. Mason, Dynamic evolution of clonal epialleles revealed by methclone. Genome Biol. 15(9), 472 (2014)Google Scholar
  32. 32.
    J. Listgarten, C. Lippert, C.M. Kadie, R.I. Davidson, E. Eskin, D. Heckerman, Improved linear mixed models for genome-wide association studies. Nat. Methods 9(6), 525–526 (2012)CrossRefGoogle Scholar
  33. 33.
    B.A. Logsdon, G.E. Hoffman, J.G. Mezey, A variational Bayes algorithm for fast and accurate multiple locus genome-wide association analysis. BMC Bioinform. 11, 58 (2010)CrossRefGoogle Scholar
  34. 34.
    J. Lurio, F.P. Morrison, M. Pichardo, R. Berg, M.D. Buck, W. Wu, K. Kitson, F. Mostashari, N. Calman, Using electronic health record alerts to provide public health situational awareness to clinicians. J. Am. Med. Inform. Assoc. 17(2), 217–219 (2010)CrossRefGoogle Scholar
  35. 35.
    S.G. Megason, S.E. Fraser, Imaging in systems biology. Cell 130(5), 784–795 (2007)CrossRefGoogle Scholar
  36. 36.
    J.J. Michaelson, Y. Shi, M. Gujral, H. Zheng, D. Malhotra, X. Jin, M. Jian, G. Liu, D. Greer, A. Bhandari, W. Wu, R. Corominas, A. Peoples, A. Koren, A. Gore, S. Kang, G.N. Lin, J. Estabillo, T. Gadomski, B. Singh, K. Zhang, N. Akshoomoff, C. Corsello, S. McCarroll, L.M. Iakoucheva, Y. Li, J. Wang, J. Sebat, Whole-genome sequencing in autism identifies hot spots for de novo germline mutation. Cell 151(7), 1431–1442 (2012)Google Scholar
  37. 37.
    L. Olsen, J.M. McGinnis, Redesigning the Clinical Effectiveness Research Paradigm: Innovation and Practice-Based Approaches: Workshop Summary (National Academies Press, 2010)Google Scholar
  38. 38.
    J. Pathak, R.C. Kiefer, C.G. Chute, Using linked data for mining drug-drug interactions in electronic health records. Stud. Health Technol. Inform. 192, 682–686 (2013)Google Scholar
  39. 39.
    B. Peleteiro, N. Lunet, C. Figueiredo, F. Carneiro, L. David, H. Barros, Smoking, Helicobacter pylori virulence, and type of intestinal metaplasia in Portuguese males. Cancer Epidemiol. Biomark. Prev. 16(2), 322–326 (2007)CrossRefGoogle Scholar
  40. 40.
    A. Pollock, S. Bian, C. Zhang, Z. Chen, T. Sun, Growth of the developing cerebral cortex is controlled by microRNA-7 through the p53 pathway. Cell Rep. 7(4), 1184–1196 (2014)CrossRefGoogle Scholar
  41. 41.
    K. Sakitani, Y. Hirata, H. Watabe, A. Yamada, T. Sugimoto, Y. Yamaji, H. Yoshida, S. Maeda, M. Omata, K. Koike, Gastric cancer risk according to the distribution of intestinal metaplasia and neutrophil infiltration. J. Gastroenterol. Hepatol. 26(10), 1570–1575 (2011)CrossRefGoogle Scholar
  42. 42.
    N. Savage, Bioinformatics: big data versus the big C. Nature 509(7502), S66–S67 (2014)CrossRefGoogle Scholar
  43. 43.
    S.J. Shah, D.H. Katz, S. Selvaraj, M.A. Burke, C.W. Yancy, M. Gheorghiade, R.O. Bonow, C.C. Huang, R.C. Deo, Phenomapping for novel classification of heart failure with preserved ejection fraction. Circulation 131(3), 269–279 (2015)CrossRefGoogle Scholar
  44. 44.
    P. Sipponen, M. Kekki, M. Siurala, Age-related trends of gastritis and intestinal metaplasia in gastric carcinoma patients and in controls representing the population at large. Br. J. Cancer 49(4), 521–530 (1984)CrossRefGoogle Scholar
  45. 45.
    G.W. Sledge Jr., R.S. Miller, R. Hauser, CancerLinQ and the future of cancer care. Am. Soc. Clin. Oncol. Educ. Book 430–434Google Scholar
  46. 46.
    M. Stolte, A. Meining, The updated Sydney system: classification and grading of gastritis as the basis of diagnosis and treatment. Can. J. Gastroenterol. 15(9), 591–598 (2001)CrossRefGoogle Scholar
  47. 47.
    T. Tsukui, R. Kashiwagi, M. Sakane, F. Tabata, T. Akamatsu, K. Wada, S. Futagami, K. Miyake, N. Sueoka, T. Hirakawa, M. Kobayashi, T. Fujimori, C. Sakamoto, Aging increases, and duodenal ulcer reduces the risk for intestinal metaplasia of the gastric corpus in Japanese patients with dyspepsia. J. Gastroenterol. Hepatol. 16(1), 15–21 (2001)CrossRefGoogle Scholar
  48. 48.
    X. Wang, Z. Duren, C. Zhang, L. Chen, Y. Wang, Clinical data analysis reveals three subtypes of gastric cancer, in 2012 IEEE 6th International Conference on Systems Biology (ISB) (IEEE, 2012)Google Scholar
  49. 49.
    H. Xie, M. Wang, A. de Andrade, F. Bonaldo Mde, V. Galat, K. Arndt, V. Rajaram, S. Goldman, T. Tomita, M.B. Soares, Genome-wide quantitative assessment of variation in DNA methylation patterns. Nucleic Acids Res. 39(10), 4099–4108 (2011)CrossRefGoogle Scholar
  50. 50.
    C. Zhang, S. Xu, D. Xu, Risk assessment of gastric cancer caused by Helicobacter pylori using CagA sequence markers. PLoS ONE 7(5), e36844 (2012)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2017

Authors and Affiliations

  1. 1.Institute for Computational BiomedicineWeill Cornell MedicineNew YorkUSA
  2. 2.Division of Hematology and Medical Oncology, Department of MedicineWeill Cornell MedicineNew YorkUSA
  3. 3.Department of GastroenterologyThe First Affiliated Hospital of Nanjing Medical UniversityNanjingChina
  4. 4.Department of Computer Science and C.S. Bond Life Sciences CenterUniversity of MissouriColumbiaUSA

Personalised recommendations