Abstract
With exabytes of data being generated from genome sequencing, a whole new science behind genomics big data has emerged. As technology improves, the cost of sequencing a human genome has gone down considerably increasing the number of genomes being sequenced. Huge amounts of genomics data along with a vast variety of clinical data cannot be handled using existing frameworks and techniques. It is to be efficiently stored in a warehouse where a number of things have to be taken into account. Firstly, the genome data is to be integrated effectively and correctly with clinical data. The other data sources along with their formats have to be identified. Required data is then to be extracted from these other sources (such as clinical data sets) and integrated with the genome. The main challenge here is to be able to handle the integration complexity as a large number of data sets are being integrated with huge amounts of genome. Secondly, since the data is captured at disparate locations individually by clinicians and scientists, it brings the challenge of data consistency. It has to be made sure that the data consistency is not compromised as it is passed along the warehouse. Checks have to be put in place to make sure the data remains consistent from start to finish. Thirdly, to carry this out effectively, the data infrastructure has to be in the correct order. How frequently the data is accessed plays a crucial role here. Data in frequent use will be handled differently than data which is not in frequent use. Lastly, efficient browsing mechanisms have to be put in place to allow the data to be quickly retrieved. The data is then iteratively analyzed to get meaningful insights. The challenge here is to perform analysis very quickly. Cloud computing plays an important role as it is used to provide scalability.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
(n.d.) (Illumina) Retrieved October 2016, from http://www.illumina.com/
(n.d.) (454 Life Sciences) Retrieved October 2016, from http://www.454.com/
(n.d.) (Complete Genomics) Retrieved October 2016, from http://www.completegenomics.com/
1000 Genomes Project Consortium (2010) A map of human genome variation from population-scale sequencing. Nature 467(7319):1061–1073
(2016, August) Retrieved from Akana: https://www.akana.com/products/semantics-manager
(2016, 09 01) Retrieved from Property Graph Model: https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model
(2016, September) Retrieved from Giraph: http://giraph.apache.org/
Akavia UD, Litvin O, Kim J, Sanchez-Garcia F, Kotliar D, Causton HC, …, Pe’er D (2010) An integrated approach to uncover drivers of cancer. Cell 1005–1017
Apache Hadoop Goes Realtime at Facebook (n.d.) Facebook
Borthakur D, Muthukkaruppan K, Ranganathan K, Rash S, Sarma JS, Spiegelberg N, …, Aiyer A (2011) Apache hadoop goes realtime at facebook proceedings of the 2011 ACM SIGMOD international conference on management of data. ACM, Athen, Greece, pp 1071–1080
Brierly C (2010) Press release for UK10K. Retrieved from http://www.wellcome.ac.uk/News/Media-office/Press-releases/2010/WTX060061.htm
Crago SP, Yeung D (2016) Reducing data movement with approximate computing techniques. 2016 IEEE International Conference on Rebooting Computing (ICRC), IEEE, pp 1–4
Edifecs CDI (n.d.) Retrieved from https://www.edifecs.com/downloads/Clinical_Data_Integration_Solution_Brief_2015.pdf
Fridley BL, Lund S, Genkins GD, Wang L (2012) A Bayesian integrative genomic model for pathway analysis of complex traits. Genet Epidemiol 36:352–359
Guthrie S, Connelly A, Amstutz P, Berrey AF, Cesar N, Chen J et al (2015) Tiling the genome into consistently named subsequences enables precision medicine and machine learning with millions of complex individual data-sets. PeerJ Preprints 3:e1780. doi:10.7287/peerj.preprints.1426v1
Hamid JS, Hu P, Roslin NM, Ling V, Greenwood CM, Beyene J (2009) Data integration in genetics and genomics: methods and challenges. Human Genomics and Proteomics
Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG (2009) Research electronic data capture (REDCap) – a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform 42:377–381
Holzinger ER, Ritchie MD (2012) Integrating heterogeneous high-throughput data for meta-dimensional pharmacogenomics and disease-related studies. Pharmacogenomics 13(2):213–222. Retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3350322/pdf/nihms357046.pdf
Karasawas K, Baldock R, Burger A (2004) Bioinformatics integration and agent technology. J Biomed Inform 37:205–219
Lapatas V, Stefanidakis M, Jimenez RC, Via A, Schneider MV (2015) Data integration in biological research – an overview. J Biol Res – Thessaloniki 22:1–16
Lee E, Cho S, Kim K, Park T (2009) An integrated approach to infer causal associations among gene expression, genotype variation, and disease. Genomics 94:269–277
Levandoski JJ, Larson P-A, Stoica R (2013) Identifying hot and cold data in main-memory databases. In: Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013) IEEE Computer Society, Washington, DC, USA, pp 26–27
Lin H, Ma X, Chandramohan P, Geist A, Samatova N (2005) Efficient data access for parallel BLAST. In: 19th IEEE international parallel and distributed processing symposium, IEEE, pp 72–82
Louie B, Mork P, Martin-Sanchez F, Halevy A, TarczyHornoch P (2005) Data integration and genomic medicine. J Biomed Inform 40:5–16
Low Y, Gonzalez JE, Kyrola A, Bickson D, Guestrin CE, Hellerstein J (2014) Graphlab: a new framework for parallel machine learning arXiv preprint arXiv: 1408.2041
Lumeris CDI (n.d.) Retrieved from http://lumeris.com/wp-content/uploads/2014/05/Lumeris-SOL.CDI_.05-14.v1.pdf
Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, ACM, pp 135–146
Maxam AM, Gilbert W (1977) A new method for sequencing DNA. Proc Natl Acad Sci U S A 74(2):560–564
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A et al (2010) The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20:1297–1303
Metzker ML (2010) Sequencing technologies – the next generation. Nat Rev Genet 11:31–46
National Human Genome Research Institute (2016) National Human Genome Research Institute. Retrieved from https://www.genome.gov/27565109/the-cost-of-sequencing-a-human-genome/
Nevins JR, Huang ES, Dressman H, Pittman J, Huang AT, West M (2003) Towards integrated clinico-genomic models for personalized medicine: combining gene expression signatures and clinical factors in breast cancer outcomes prediction. Human Mol Genet 12:R153–R157
Nielsen TD, Jensen FV (2009) Bayesian networks and decision graphs. Springer Science & Business Media, New York
Park Y, Shankar M, Park BH, Ghosh J (2014) Graph databases for large-scale healthcare systems: a framework for efficient data management and data services. In: Data Engineering Workshops (ICDEW), IEEE, pp 12–19
Ritchie MD, Holzinger ER, Li R, Pendergrass SA, Kim D (2015) Methods of integrating data to uncover genotype-phenotype interactions. Genetics 16:85–97
Rodriguez MA, Neubauer P (2010) Constructions from dots and lines. Bull Am Soc Inf Sci Technol 36:35–41
Rohm U, Blakeley JA (2009) Data management for high-throughput genomics. Conference on innovative data systems
Salem A, Ben-Abdallah H (2015) The design of valid multidimensional star schemas assisted by repair solutions. Vietnam J Comput Sci 2:169–179
Sanger F, Coulson AR (1975) A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol 94(3):441–448
SAS CDI (n.d.) Retrieved from [24] Louie B, Mork P, Martin-Sanchez F, Halevy A, TarczyHornoch P (2005) Data integration and genomic medicine. J Biomed Inform 40:5–16
Schapranow M (2013) HIG – an in-memory database platform enabling real-time analyses of genome data. In: IEEE international conference on big data, pp 691–696. doi:10.1109/BigData.2013.6691638
Songting C (2010) Cheetah: a high performance, Custom data warehouse on top of MapReduce Proc VLDB Endow, pp 1459–1468
Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ et al (2015) Big data: astronomical or genomical? PLoS Biol 13:e1002195
Subramanyam R (2015) HDFS heterogeneous storage resource management based on data temperature. 2015 international conference on cloud and autonomic computing, ICCAC, pp 232–235
Sujasnsky W (2001) Heterogeneous database integration in biomedicine. J Biomed Inform 35:285–298
Wang L, Zhang A, Ramanathan M (2005) BioStar models of clinical and genomic data for biomedical data warehouse design. Int J Bioinform Res Appl 1:63–80
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Anjum, A. et al. (2017). Big Data Analytics in Healthcare: A Cloud-Based Framework for Generating Insights. In: Antonopoulos, N., Gillam, L. (eds) Cloud Computing. Computer Communications and Networks. Springer, Cham. https://doi.org/10.1007/978-3-319-54645-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-54645-2_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54644-5
Online ISBN: 978-3-319-54645-2
eBook Packages: Computer ScienceComputer Science (R0)