Big Data Analytics in Healthcare: A Cloud-Based Framework for Generating Insights

Anjum, Ashiq; Aizad, Sanna; Arshad, Bilal; Subhani, Moeez; Davies-Tagg, Dominic; Abdullah, Tariq; Antonopoulos, Nikolaos

doi:10.1007/978-3-319-54645-2_6

Ashiq Anjum⁴,
Sanna Aizad⁴,
Bilal Arshad⁴,
Moeez Subhani⁴,
Dominic Davies-Tagg⁴,
Tariq Abdullah⁴ &
…
Nikolaos Antonopoulos⁴

Part of the book series: Computer Communications and Networks ((CCN))

6402 Accesses
2 Citations

Abstract

With exabytes of data being generated from genome sequencing, a whole new science behind genomics big data has emerged. As technology improves, the cost of sequencing a human genome has gone down considerably increasing the number of genomes being sequenced. Huge amounts of genomics data along with a vast variety of clinical data cannot be handled using existing frameworks and techniques. It is to be efficiently stored in a warehouse where a number of things have to be taken into account. Firstly, the genome data is to be integrated effectively and correctly with clinical data. The other data sources along with their formats have to be identified. Required data is then to be extracted from these other sources (such as clinical data sets) and integrated with the genome. The main challenge here is to be able to handle the integration complexity as a large number of data sets are being integrated with huge amounts of genome. Secondly, since the data is captured at disparate locations individually by clinicians and scientists, it brings the challenge of data consistency. It has to be made sure that the data consistency is not compromised as it is passed along the warehouse. Checks have to be put in place to make sure the data remains consistent from start to finish. Thirdly, to carry this out effectively, the data infrastructure has to be in the correct order. How frequently the data is accessed plays a crucial role here. Data in frequent use will be handled differently than data which is not in frequent use. Lastly, efficient browsing mechanisms have to be put in place to allow the data to be quickly retrieved. The data is then iteratively analyzed to get meaningful insights. The challenge here is to perform analysis very quickly. Cloud computing plays an important role as it is used to provide scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

(n.d.) (Illumina) Retrieved October 2016, from http://www.illumina.com/
(n.d.) (454 Life Sciences) Retrieved October 2016, from http://www.454.com/
(n.d.) (Complete Genomics) Retrieved October 2016, from http://www.completegenomics.com/
1000 Genomes Project Consortium (2010) A map of human genome variation from population-scale sequencing. Nature 467(7319):1061–1073
Article Google Scholar
(2016, August) Retrieved from Akana: https://www.akana.com/products/semantics-manager
(2016, 09 01) Retrieved from Property Graph Model: https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model
(2016, September) Retrieved from Giraph: http://giraph.apache.org/
Akavia UD, Litvin O, Kim J, Sanchez-Garcia F, Kotliar D, Causton HC, …, Pe’er D (2010) An integrated approach to uncover drivers of cancer. Cell 1005–1017
Google Scholar
Apache Hadoop Goes Realtime at Facebook (n.d.) Facebook
Google Scholar
Borthakur D, Muthukkaruppan K, Ranganathan K, Rash S, Sarma JS, Spiegelberg N, …, Aiyer A (2011) Apache hadoop goes realtime at facebook proceedings of the 2011 ACM SIGMOD international conference on management of data. ACM, Athen, Greece, pp 1071–1080
Google Scholar
Brierly C (2010) Press release for UK10K. Retrieved from http://www.wellcome.ac.uk/News/Media-office/Press-releases/2010/WTX060061.htm
Crago SP, Yeung D (2016) Reducing data movement with approximate computing techniques. 2016 IEEE International Conference on Rebooting Computing (ICRC), IEEE, pp 1–4
Google Scholar
Edifecs CDI (n.d.) Retrieved from https://www.edifecs.com/downloads/Clinical_Data_Integration_Solution_Brief_2015.pdf
Fridley BL, Lund S, Genkins GD, Wang L (2012) A Bayesian integrative genomic model for pathway analysis of complex traits. Genet Epidemiol 36:352–359
Article Google Scholar
Guthrie S, Connelly A, Amstutz P, Berrey AF, Cesar N, Chen J et al (2015) Tiling the genome into consistently named subsequences enables precision medicine and machine learning with millions of complex individual data-sets. PeerJ Preprints 3:e1780. doi:10.7287/peerj.preprints.1426v1
Google Scholar
Hamid JS, Hu P, Roslin NM, Ling V, Greenwood CM, Beyene J (2009) Data integration in genetics and genomics: methods and challenges. Human Genomics and Proteomics
Google Scholar
Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG (2009) Research electronic data capture (REDCap) – a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform 42:377–381
Article Google Scholar
Holzinger ER, Ritchie MD (2012) Integrating heterogeneous high-throughput data for meta-dimensional pharmacogenomics and disease-related studies. Pharmacogenomics 13(2):213–222. Retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3350322/pdf/nihms357046.pdf
Karasawas K, Baldock R, Burger A (2004) Bioinformatics integration and agent technology. J Biomed Inform 37:205–219
Article Google Scholar
Lapatas V, Stefanidakis M, Jimenez RC, Via A, Schneider MV (2015) Data integration in biological research – an overview. J Biol Res – Thessaloniki 22:1–16
Article Google Scholar
Lee E, Cho S, Kim K, Park T (2009) An integrated approach to infer causal associations among gene expression, genotype variation, and disease. Genomics 94:269–277
Article Google Scholar
Levandoski JJ, Larson P-A, Stoica R (2013) Identifying hot and cold data in main-memory databases. In: Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013) IEEE Computer Society, Washington, DC, USA, pp 26–27
Google Scholar
Lin H, Ma X, Chandramohan P, Geist A, Samatova N (2005) Efficient data access for parallel BLAST. In: 19th IEEE international parallel and distributed processing symposium, IEEE, pp 72–82
Google Scholar
Louie B, Mork P, Martin-Sanchez F, Halevy A, TarczyHornoch P (2005) Data integration and genomic medicine. J Biomed Inform 40:5–16
Article Google Scholar
Low Y, Gonzalez JE, Kyrola A, Bickson D, Guestrin CE, Hellerstein J (2014) Graphlab: a new framework for parallel machine learning arXiv preprint arXiv: 1408.2041
Google Scholar
Lumeris CDI (n.d.) Retrieved from http://lumeris.com/wp-content/uploads/2014/05/Lumeris-SOL.CDI_.05-14.v1.pdf
Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, ACM, pp 135–146
Google Scholar
Maxam AM, Gilbert W (1977) A new method for sequencing DNA. Proc Natl Acad Sci U S A 74(2):560–564
Article Google Scholar
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A et al (2010) The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20:1297–1303
Article Google Scholar
Metzker ML (2010) Sequencing technologies – the next generation. Nat Rev Genet 11:31–46
Article Google Scholar
National Human Genome Research Institute (2016) National Human Genome Research Institute. Retrieved from https://www.genome.gov/27565109/the-cost-of-sequencing-a-human-genome/
Nevins JR, Huang ES, Dressman H, Pittman J, Huang AT, West M (2003) Towards integrated clinico-genomic models for personalized medicine: combining gene expression signatures and clinical factors in breast cancer outcomes prediction. Human Mol Genet 12:R153–R157
Article Google Scholar
Nielsen TD, Jensen FV (2009) Bayesian networks and decision graphs. Springer Science & Business Media, New York
MATH Google Scholar
Park Y, Shankar M, Park BH, Ghosh J (2014) Graph databases for large-scale healthcare systems: a framework for efficient data management and data services. In: Data Engineering Workshops (ICDEW), IEEE, pp 12–19
Google Scholar
Ritchie MD, Holzinger ER, Li R, Pendergrass SA, Kim D (2015) Methods of integrating data to uncover genotype-phenotype interactions. Genetics 16:85–97
Google Scholar
Rodriguez MA, Neubauer P (2010) Constructions from dots and lines. Bull Am Soc Inf Sci Technol 36:35–41
Article Google Scholar
Rohm U, Blakeley JA (2009) Data management for high-throughput genomics. Conference on innovative data systems
Google Scholar
Salem A, Ben-Abdallah H (2015) The design of valid multidimensional star schemas assisted by repair solutions. Vietnam J Comput Sci 2:169–179
Article Google Scholar
Sanger F, Coulson AR (1975) A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol 94(3):441–448
Article Google Scholar
SAS CDI (n.d.) Retrieved from [24] Louie B, Mork P, Martin-Sanchez F, Halevy A, TarczyHornoch P (2005) Data integration and genomic medicine. J Biomed Inform 40:5–16
Google Scholar
Schapranow M (2013) HIG – an in-memory database platform enabling real-time analyses of genome data. In: IEEE international conference on big data, pp 691–696. doi:10.1109/BigData.2013.6691638
Songting C (2010) Cheetah: a high performance, Custom data warehouse on top of MapReduce Proc VLDB Endow, pp 1459–1468
Google Scholar
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ et al (2015) Big data: astronomical or genomical? PLoS Biol 13:e1002195
Article Google Scholar
Subramanyam R (2015) HDFS heterogeneous storage resource management based on data temperature. 2015 international conference on cloud and autonomic computing, ICCAC, pp 232–235
Google Scholar
Sujasnsky W (2001) Heterogeneous database integration in biomedicine. J Biomed Inform 35:285–298
Article Google Scholar
Wang L, Zhang A, Ramanathan M (2005) BioStar models of clinical and genomic data for biomedical data warehouse design. Int J Bioinform Res Appl 1:63–80
Article Google Scholar

Download references

Author information

Authors and Affiliations

College of Engineering and Technology, University of Derby, Kedleston Road, DE22 1GB, Derby, UK
Ashiq Anjum, Sanna Aizad, Bilal Arshad, Moeez Subhani, Dominic Davies-Tagg, Tariq Abdullah & Nikolaos Antonopoulos

Authors

Ashiq Anjum
View author publications
You can also search for this author in PubMed Google Scholar
Sanna Aizad
View author publications
You can also search for this author in PubMed Google Scholar
Bilal Arshad
View author publications
You can also search for this author in PubMed Google Scholar
Moeez Subhani
View author publications
You can also search for this author in PubMed Google Scholar
Dominic Davies-Tagg
View author publications
You can also search for this author in PubMed Google Scholar
Tariq Abdullah
View author publications
You can also search for this author in PubMed Google Scholar
Nikolaos Antonopoulos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ashiq Anjum .

Editor information

Editors and Affiliations

University of Derby, Derby, Derbyshire, United Kingdom
Nick Antonopoulos
University of Surrey, Guildford, Surrey, United Kingdom
Lee Gillam

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Anjum, A. et al. (2017). Big Data Analytics in Healthcare: A Cloud-Based Framework for Generating Insights. In: Antonopoulos, N., Gillam, L. (eds) Cloud Computing. Computer Communications and Networks. Springer, Cham. https://doi.org/10.1007/978-3-319-54645-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-54645-2_6
Published: 03 June 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54644-5
Online ISBN: 978-3-319-54645-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics