Skip to main content

Big Data Analytics in Healthcare: A Cloud-Based Framework for Generating Insights

  • Chapter
  • First Online:
Cloud Computing

Abstract

With exabytes of data being generated from genome sequencing, a whole new science behind genomics big data has emerged. As technology improves, the cost of sequencing a human genome has gone down considerably increasing the number of genomes being sequenced. Huge amounts of genomics data along with a vast variety of clinical data cannot be handled using existing frameworks and techniques. It is to be efficiently stored in a warehouse where a number of things have to be taken into account. Firstly, the genome data is to be integrated effectively and correctly with clinical data. The other data sources along with their formats have to be identified. Required data is then to be extracted from these other sources (such as clinical data sets) and integrated with the genome. The main challenge here is to be able to handle the integration complexity as a large number of data sets are being integrated with huge amounts of genome. Secondly, since the data is captured at disparate locations individually by clinicians and scientists, it brings the challenge of data consistency. It has to be made sure that the data consistency is not compromised as it is passed along the warehouse. Checks have to be put in place to make sure the data remains consistent from start to finish. Thirdly, to carry this out effectively, the data infrastructure has to be in the correct order. How frequently the data is accessed plays a crucial role here. Data in frequent use will be handled differently than data which is not in frequent use. Lastly, efficient browsing mechanisms have to be put in place to allow the data to be quickly retrieved. The data is then iteratively analyzed to get meaningful insights. The challenge here is to perform analysis very quickly. Cloud computing plays an important role as it is used to provide scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. (n.d.) (Illumina) Retrieved October 2016, from http://www.illumina.com/

  2. (n.d.) (454 Life Sciences) Retrieved October 2016, from http://www.454.com/

  3. (n.d.) (Complete Genomics) Retrieved October 2016, from http://www.completegenomics.com/

  4. 1000 Genomes Project Consortium (2010) A map of human genome variation from population-scale sequencing. Nature 467(7319):1061–1073

    Article  Google Scholar 

  5. (2016, August) Retrieved from Akana: https://www.akana.com/products/semantics-manager

  6. (2016, 09 01) Retrieved from Property Graph Model: https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model

  7. (2016, September) Retrieved from Giraph: http://giraph.apache.org/

  8. Akavia UD, Litvin O, Kim J, Sanchez-Garcia F, Kotliar D, Causton HC, …, Pe’er D (2010) An integrated approach to uncover drivers of cancer. Cell 1005–1017

    Google Scholar 

  9. Apache Hadoop Goes Realtime at Facebook (n.d.) Facebook

    Google Scholar 

  10. Borthakur D, Muthukkaruppan K, Ranganathan K, Rash S, Sarma JS, Spiegelberg N, …, Aiyer A (2011) Apache hadoop goes realtime at facebook proceedings of the 2011 ACM SIGMOD international conference on management of data. ACM, Athen, Greece, pp 1071–1080

    Google Scholar 

  11. Brierly C (2010) Press release for UK10K. Retrieved from http://www.wellcome.ac.uk/News/Media-office/Press-releases/2010/WTX060061.htm

  12. Crago SP, Yeung D (2016) Reducing data movement with approximate computing techniques. 2016 IEEE International Conference on Rebooting Computing (ICRC), IEEE, pp 1–4

    Google Scholar 

  13. Edifecs CDI (n.d.) Retrieved from https://www.edifecs.com/downloads/Clinical_Data_Integration_Solution_Brief_2015.pdf

  14. Fridley BL, Lund S, Genkins GD, Wang L (2012) A Bayesian integrative genomic model for pathway analysis of complex traits. Genet Epidemiol 36:352–359

    Article  Google Scholar 

  15. Guthrie S, Connelly A, Amstutz P, Berrey AF, Cesar N, Chen J et al (2015) Tiling the genome into consistently named subsequences enables precision medicine and machine learning with millions of complex individual data-sets. PeerJ Preprints 3:e1780. doi:10.7287/peerj.preprints.1426v1

    Google Scholar 

  16. Hamid JS, Hu P, Roslin NM, Ling V, Greenwood CM, Beyene J (2009) Data integration in genetics and genomics: methods and challenges. Human Genomics and Proteomics

    Google Scholar 

  17. Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG (2009) Research electronic data capture (REDCap) – a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform 42:377–381

    Article  Google Scholar 

  18. Holzinger ER, Ritchie MD (2012) Integrating heterogeneous high-throughput data for meta-dimensional pharmacogenomics and disease-related studies. Pharmacogenomics 13(2):213–222. Retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3350322/pdf/nihms357046.pdf

  19. Karasawas K, Baldock R, Burger A (2004) Bioinformatics integration and agent technology. J Biomed Inform 37:205–219

    Article  Google Scholar 

  20. Lapatas V, Stefanidakis M, Jimenez RC, Via A, Schneider MV (2015) Data integration in biological research – an overview. J Biol Res – Thessaloniki 22:1–16

    Article  Google Scholar 

  21. Lee E, Cho S, Kim K, Park T (2009) An integrated approach to infer causal associations among gene expression, genotype variation, and disease. Genomics 94:269–277

    Article  Google Scholar 

  22. Levandoski JJ, Larson P-A, Stoica R (2013) Identifying hot and cold data in main-memory databases. In: Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013) IEEE Computer Society, Washington, DC, USA, pp 26–27

    Google Scholar 

  23. Lin H, Ma X, Chandramohan P, Geist A, Samatova N (2005) Efficient data access for parallel BLAST. In: 19th IEEE international parallel and distributed processing symposium, IEEE, pp 72–82

    Google Scholar 

  24. Louie B, Mork P, Martin-Sanchez F, Halevy A, TarczyHornoch P (2005) Data integration and genomic medicine. J Biomed Inform 40:5–16

    Article  Google Scholar 

  25. Low Y, Gonzalez JE, Kyrola A, Bickson D, Guestrin CE, Hellerstein J (2014) Graphlab: a new framework for parallel machine learning arXiv preprint arXiv: 1408.2041

    Google Scholar 

  26. Lumeris CDI (n.d.) Retrieved from http://lumeris.com/wp-content/uploads/2014/05/Lumeris-SOL.CDI_.05-14.v1.pdf

  27. Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, ACM, pp 135–146

    Google Scholar 

  28. Maxam AM, Gilbert W (1977) A new method for sequencing DNA. Proc Natl Acad Sci U S A 74(2):560–564

    Article  Google Scholar 

  29. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A et al (2010) The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20:1297–1303

    Article  Google Scholar 

  30. Metzker ML (2010) Sequencing technologies – the next generation. Nat Rev Genet 11:31–46

    Article  Google Scholar 

  31. National Human Genome Research Institute (2016) National Human Genome Research Institute. Retrieved from https://www.genome.gov/27565109/the-cost-of-sequencing-a-human-genome/

  32. Nevins JR, Huang ES, Dressman H, Pittman J, Huang AT, West M (2003) Towards integrated clinico-genomic models for personalized medicine: combining gene expression signatures and clinical factors in breast cancer outcomes prediction. Human Mol Genet 12:R153–R157

    Article  Google Scholar 

  33. Nielsen TD, Jensen FV (2009) Bayesian networks and decision graphs. Springer Science & Business Media, New York

    MATH  Google Scholar 

  34. Park Y, Shankar M, Park BH, Ghosh J (2014) Graph databases for large-scale healthcare systems: a framework for efficient data management and data services. In: Data Engineering Workshops (ICDEW), IEEE, pp 12–19

    Google Scholar 

  35. Ritchie MD, Holzinger ER, Li R, Pendergrass SA, Kim D (2015) Methods of integrating data to uncover genotype-phenotype interactions. Genetics 16:85–97

    Google Scholar 

  36. Rodriguez MA, Neubauer P (2010) Constructions from dots and lines. Bull Am Soc Inf Sci Technol 36:35–41

    Article  Google Scholar 

  37. Rohm U, Blakeley JA (2009) Data management for high-throughput genomics. Conference on innovative data systems

    Google Scholar 

  38. Salem A, Ben-Abdallah H (2015) The design of valid multidimensional star schemas assisted by repair solutions. Vietnam J Comput Sci 2:169–179

    Article  Google Scholar 

  39. Sanger F, Coulson AR (1975) A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol 94(3):441–448

    Article  Google Scholar 

  40. SAS CDI (n.d.) Retrieved from [24] Louie B, Mork P, Martin-Sanchez F, Halevy A, TarczyHornoch P (2005) Data integration and genomic medicine. J Biomed Inform 40:5–16

    Google Scholar 

  41. Schapranow M (2013) HIG – an in-memory database platform enabling real-time analyses of genome data. In: IEEE international conference on big data, pp 691–696. doi:10.1109/BigData.2013.6691638

  42. Songting C (2010) Cheetah: a high performance, Custom data warehouse on top of MapReduce Proc VLDB Endow, pp 1459–1468

    Google Scholar 

  43. Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ et al (2015) Big data: astronomical or genomical? PLoS Biol 13:e1002195

    Article  Google Scholar 

  44. Subramanyam R (2015) HDFS heterogeneous storage resource management based on data temperature. 2015 international conference on cloud and autonomic computing, ICCAC, pp 232–235

    Google Scholar 

  45. Sujasnsky W (2001) Heterogeneous database integration in biomedicine. J Biomed Inform 35:285–298

    Article  Google Scholar 

  46. Wang L, Zhang A, Ramanathan M (2005) BioStar models of clinical and genomic data for biomedical data warehouse design. Int J Bioinform Res Appl 1:63–80

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ashiq Anjum .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Anjum, A. et al. (2017). Big Data Analytics in Healthcare: A Cloud-Based Framework for Generating Insights. In: Antonopoulos, N., Gillam, L. (eds) Cloud Computing. Computer Communications and Networks. Springer, Cham. https://doi.org/10.1007/978-3-319-54645-2_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-54645-2_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-54644-5

  • Online ISBN: 978-3-319-54645-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics