Abstract
All fields of science are now inundated with massive amounts of data, which have the potential to answer fundamental questions. Genomics is one particular example, exploring questions like: How does the human genome work? What genome variants make us more prone to diseases? To find answers to these questions, it is crucial to develop statistical and machine learning methods that can scale up, particularly through efficient data storage and communication. Equally crucial, but less emphasized, is the possession of data wisdom—a rebranding of the best elements of applied statistics in a recent note at ODBMS.org (http://www.odbms.org/2015/04/data-wisdom-for-data-science/). The note at ODBMS.org contains ten sets of questions a practitioner can ask to cultivate data wisdom. Although there has been much recent excitement about big data, having enough data relevant to the problem is the key to gaining meaningful answers in genomics. Data wisdom gives us the insight into how these data would look, how much information a dataset really contains, and how to extract it. In this paper, we expand on the ten sets of questions and illustrate where and how data wisdom can be integrated into computational genomics research.
Similar content being viewed by others
Notes
Sulfur-rich amino acid-containing compounds which become active in response to tissue damage, and believed to offer a protective function
Compounds of diverse biological activities such as anti-oxidants, functioning in UV protection, in defense, in auxin transport inhibition, and in flower coloring
References
Allison DB, Cui X, Page GP, Sabripour M (2006) Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 7(1):55–65
Barter RL, Yu B (2015) Superheat: Supervised heatmaps for visualizing complex data. arXiv preprint arXiv:1512.01524
Bigelow A, Drucker S, Fisher D, Meyer M (2014) Reflections on how designers design with data. Pages 17–24 of: Proceedings of the 2014 International Working Conference on Advanced Visual Interfaces. ACM
Bolstad BM, Irizarry RA, Åstrand M, Speed TP (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2):185–193
Box GEP (1976) Science and statistics. J Am Stat Assoc 71(356):791–799
Box GEP (1979) Robustness in the strategy of scientific model building. Robust Stat 1:201–236
Casadevall A, Fang FC (2012) Reforming science: methodological and cultural reforms. Infect Immun 80(3):891–896
Chavan SS, Bauer MA, Peterson EA, Heuck CJ, Johann DJ (2013) Towards the integration, annotation and association of historical microarray experiments with RNA-seq. BMC Bioinf 14(Suppl 14):S4
Chu C, Kim SK, Lin Y-A, Yu YY, Bradski G, Ng AY, Olukotun K (2007) Map-reduce for machine learning on multicore. Adv Neural Inf Process Syst 19:281
De La Fuente A, Bing N, Hoeschele I, Mendes P (2004) Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics 20(18):3565–3574
Efron B (2007) Size, power and false discovery rates. Ann Stat 35(4), 1351–1377
Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci 95(25):14863–14868
Fu X, Fu N, Guo S, Yan Z, Xu Y, Hu H, Menzel C, Chen W, Li Y, Zeng R et al (2009) Estimating accuracy of RNA-Seq and microarrays with proteomics. BMC Genom 10(1):161
Gachon CMM, Langlois-Meurinne M, Henry Y, Saindrenan P (2005) Transcriptional co-regulation of secondary metabolism enzymes in Arabidopsis: functional and evolutionary implications. Plant Mol Biol 58(2):229–245
Gagnon-Bartsch JA, Speed TP (2012) Using control genes to correct for unwanted variation in microarray data. Biostatistics 13(3):539–552
Glass DJ (2014) Experimental design for biologists. Cold Spring Harbor Laboratory Press
Grewal RK, Das S (2013) Microarray data analysis: gaining biological insights. Journal of Biomedical Science and Engineering
Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP (2003) Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 31(4):e15–e15
Jansen R, Greenbaum D, Gerstein M (2002) Relating whole-genome expression data with protein-protein interactions. Genome Res 12(1):37–46
Johnson G (2014) New truths that only one can see. The New York Times, D1
Kim K, Jiang K, Teng SL, Feldman LJ, Huang H (2012) Using biologically interrelated experiments to identify pathway genes in Arabidopsis. Bioinformatics 28(6):815–822
Kinney JB, Atwal GS (2014) Equitability, mutual information, and the maximal information coefficient. Proc Natl Acad Sci 111(9):3354–3359
Kumari S, Nie J, Chen H-S, Ma H, Stewart R, Li X, Lu M-Z, Taylor WM, Wei H (2012) Evaluation of gene association methods for coexpression network construction and biological knowledge discovery
Lee HK, Hsu AK, Sajdak J, Qin J, Pavlidis P (2004) Coexpression analysis of human genes across many microarray data sets. Genome Res 14(6):1085–1094
Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM (2012) Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proc VLDB Endow 5(8):716–727
Mantione KJ, Kream RM, Kuzelova H, Ptacek R, Raboch J, Samuel JM, Stefano GB (2014) Comparing bioinformatic gene expression profiling methods: microarray and RNA-Seq. Med Sci Monitor Basic Res 20:138
Meyer M, Wong B, Styczynski M, Munzner T, Pfister H (2010) Pathline: a tool for comparative functional genomics. In: Computer graphics forum, vol 29. Wiley Online Library, pp 1043–1052
Naoumkina MA, Zhao Q, Gallego-Giraldo L, Dai X, Zhao PX, Dixon RA (2010) Genome-wide analysis of phenylpropanoid defence pathways. Mol Plant Pathol 11(6):829–846
Oliver S (2000) Proteomics: guilt-by-association goes global. Nature 403(6770):601–603
Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334(6062):1518–1524
Rubin DB (1990) Comment: Neyman (1923) and causal inference in experiments and observational studies. Stat Sci 5(4):472–480
Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270(5235):467–470
Scherer A (2009) Batch effects and noise in microarray experiments: sources and solutions, vol 868. John Wiley & Sons
Singull M, Ahmad MR, von Rosen D (2012) More on the Kronecker structured covariance matrix. Commun Stat Theory Methods 41(13–14):2512–2523
Sønderby IE, Geu-Flores F, Halkier BA (2010) Biosynthesis of glucosinolates-gene discovery and beyond. Trends Plant Sci 15(5):283–290
Speed TP (2011) Controls. Proceedings of Leeds Annual Statistical Research Workshop
Swindell WR, Xing X, Voorhees JJ, Elder JT, Johnston A, Gudjonsson JE (2014) Integrative RNA-seq and microarray data analysis reveals GC content and gene length biases in the psoriasis transcriptome. Physiol Genom 46(15):533–546
Teng SL, Huang H (2009) A statistical framework to infer functional gene relationships from biologically interrelated microarray experiments. J Am Stat Assoc 104(486):465–473
Tukey JW (1962) The future of data analysis. The Annals of Mathematical Statistics, 1–67
Tukey JW (1977) Exploratory data analysis
Wang C, Chen MH, Schifano E, Wu J, Yan J (2015) A Survey of Statistical Methods and Computing for Big Data. arXiv preprint arXiv:1502.07989
Wang YXR, Waterman MS, Huang H (2014) Gene coexpression measures in large heterogeneous samples using count statistics. Proc Natl Acad Sci 111(46):16371–16376
Wang YXR, Jiang K, Feldman LJ, Bickel PJ, Huang H (2015) Inferring gene-gene interactions and functional modules using sparse canonical correlation analysis. Ann Appl Stat 9(1):300–323
Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10(1):57–63
Wickham H (2009) ggplot2: elegant graphics for data analysis. Springer Science & Business Media
Woo H-H, Jeong BR, Hawes MC (2005) Flavonoids: from cell cycle regulation to biotechnology. Biotechnol Lett 27(6):365–374
Yan X, Chen S (2007) Regulation of plant glucosinolate metabolism. Planta 226(6):1343–1352
Yu B (2013) Stability. Bernoulli 19(4):1484–1500
Acknowledgements
The authors thank Christine Ho, Elizabeth Purdom, Terry Speed, Karl Kumbier, Rachel Wang and Courtney Schiffman for their constructive comments and help on revising the manuscript. The work is partly supported by NIH U01-HG007031 and NSF DMS-1160319. This work is also supported in part by the Center for Science of Information (CSoI), an NSF Science and Technology Center, under grant agreement CCF-0939370.
Author information
Authors and Affiliations
Corresponding authors
Rights and permissions
About this article
Cite this article
Huang, H., Yu, B. Data Wisdom in Computational Genomics Research. Stat Biosci 9, 646–661 (2017). https://doi.org/10.1007/s12561-016-9173-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12561-016-9173-9