Data Wisdom in Computational Genomics Research

Huang, Haiyan; Yu, Bin

doi:10.1007/s12561-016-9173-9

Data Wisdom in Computational Genomics Research

Published: 21 August 2017

Volume 9, pages 646–661, (2017)
Cite this article

Statistics in Biosciences Aims and scope Submit manuscript

Haiyan Huang¹ &
Bin Yu^1,2

255 Accesses
Explore all metrics

Abstract

All fields of science are now inundated with massive amounts of data, which have the potential to answer fundamental questions. Genomics is one particular example, exploring questions like: How does the human genome work? What genome variants make us more prone to diseases? To find answers to these questions, it is crucial to develop statistical and machine learning methods that can scale up, particularly through efficient data storage and communication. Equally crucial, but less emphasized, is the possession of data wisdom—a rebranding of the best elements of applied statistics in a recent note at ODBMS.org (http://www.odbms.org/2015/04/data-wisdom-for-data-science/). The note at ODBMS.org contains ten sets of questions a practitioner can ask to cultivate data wisdom. Although there has been much recent excitement about big data, having enough data relevant to the problem is the key to gaining meaningful answers in genomics. Data wisdom gives us the insight into how these data would look, how much information a dataset really contains, and how to extract it. In this paper, we expand on the ten sets of questions and illustrate where and how data wisdom can be integrated into computational genomics research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bioinformatics Advances Biology and Medicine by Turning Big Data Troves into Knowledge

Bioinformatics advances biology and medicine by turning big data troves into knowledge

Article 22 March 2017

Bioinformatics from a Big Data Perspective: Meeting the Challenge

Notes

http://www.odbms.org/2015/04/data-wisdom-for-data-science/
www.arabidopsis.org/portals/expression/microarray/ATGenExpress.jsp.
Sulfur-rich amino acid-containing compounds which become active in response to tissue damage, and believed to offer a protective function
Compounds of diverse biological activities such as anti-oxidants, functioning in UV protection, in defense, in auxin transport inhibition, and in flower coloring

References

Allison DB, Cui X, Page GP, Sabripour M (2006) Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 7(1):55–65
Article Google Scholar
Barter RL, Yu B (2015) Superheat: Supervised heatmaps for visualizing complex data. arXiv preprint arXiv:1512.01524
Bigelow A, Drucker S, Fisher D, Meyer M (2014) Reflections on how designers design with data. Pages 17–24 of: Proceedings of the 2014 International Working Conference on Advanced Visual Interfaces. ACM
Bolstad BM, Irizarry RA, Åstrand M, Speed TP (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2):185–193
Article Google Scholar
Box GEP (1976) Science and statistics. J Am Stat Assoc 71(356):791–799
Article MathSciNet MATH Google Scholar
Box GEP (1979) Robustness in the strategy of scientific model building. Robust Stat 1:201–236
Article Google Scholar
Casadevall A, Fang FC (2012) Reforming science: methodological and cultural reforms. Infect Immun 80(3):891–896
Article Google Scholar
Chavan SS, Bauer MA, Peterson EA, Heuck CJ, Johann DJ (2013) Towards the integration, annotation and association of historical microarray experiments with RNA-seq. BMC Bioinf 14(Suppl 14):S4
Article Google Scholar
Chu C, Kim SK, Lin Y-A, Yu YY, Bradski G, Ng AY, Olukotun K (2007) Map-reduce for machine learning on multicore. Adv Neural Inf Process Syst 19:281
Google Scholar
De La Fuente A, Bing N, Hoeschele I, Mendes P (2004) Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics 20(18):3565–3574
Efron B (2007) Size, power and false discovery rates. Ann Stat 35(4), 1351–1377
Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci 95(25):14863–14868
Article Google Scholar
Fu X, Fu N, Guo S, Yan Z, Xu Y, Hu H, Menzel C, Chen W, Li Y, Zeng R et al (2009) Estimating accuracy of RNA-Seq and microarrays with proteomics. BMC Genom 10(1):161
Article Google Scholar
Gachon CMM, Langlois-Meurinne M, Henry Y, Saindrenan P (2005) Transcriptional co-regulation of secondary metabolism enzymes in Arabidopsis: functional and evolutionary implications. Plant Mol Biol 58(2):229–245
Article Google Scholar
Gagnon-Bartsch JA, Speed TP (2012) Using control genes to correct for unwanted variation in microarray data. Biostatistics 13(3):539–552
Article Google Scholar
Glass DJ (2014) Experimental design for biologists. Cold Spring Harbor Laboratory Press
Grewal RK, Das S (2013) Microarray data analysis: gaining biological insights. Journal of Biomedical Science and Engineering
Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP (2003) Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 31(4):e15–e15
Article Google Scholar
Jansen R, Greenbaum D, Gerstein M (2002) Relating whole-genome expression data with protein-protein interactions. Genome Res 12(1):37–46
Article Google Scholar
Johnson G (2014) New truths that only one can see. The New York Times, D1
Kim K, Jiang K, Teng SL, Feldman LJ, Huang H (2012) Using biologically interrelated experiments to identify pathway genes in Arabidopsis. Bioinformatics 28(6):815–822
Article Google Scholar
Kinney JB, Atwal GS (2014) Equitability, mutual information, and the maximal information coefficient. Proc Natl Acad Sci 111(9):3354–3359
Article MathSciNet MATH Google Scholar
Kumari S, Nie J, Chen H-S, Ma H, Stewart R, Li X, Lu M-Z, Taylor WM, Wei H (2012) Evaluation of gene association methods for coexpression network construction and biological knowledge discovery
Lee HK, Hsu AK, Sajdak J, Qin J, Pavlidis P (2004) Coexpression analysis of human genes across many microarray data sets. Genome Res 14(6):1085–1094
Article Google Scholar
Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM (2012) Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proc VLDB Endow 5(8):716–727
Article Google Scholar
Mantione KJ, Kream RM, Kuzelova H, Ptacek R, Raboch J, Samuel JM, Stefano GB (2014) Comparing bioinformatic gene expression profiling methods: microarray and RNA-Seq. Med Sci Monitor Basic Res 20:138
Article Google Scholar
Meyer M, Wong B, Styczynski M, Munzner T, Pfister H (2010) Pathline: a tool for comparative functional genomics. In: Computer graphics forum, vol 29. Wiley Online Library, pp 1043–1052
Naoumkina MA, Zhao Q, Gallego-Giraldo L, Dai X, Zhao PX, Dixon RA (2010) Genome-wide analysis of phenylpropanoid defence pathways. Mol Plant Pathol 11(6):829–846
Oliver S (2000) Proteomics: guilt-by-association goes global. Nature 403(6770):601–603
Article Google Scholar
Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334(6062):1518–1524
Rubin DB (1990) Comment: Neyman (1923) and causal inference in experiments and observational studies. Stat Sci 5(4):472–480
Article MathSciNet MATH Google Scholar
Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270(5235):467–470
Article Google Scholar
Scherer A (2009) Batch effects and noise in microarray experiments: sources and solutions, vol 868. John Wiley & Sons
Singull M, Ahmad MR, von Rosen D (2012) More on the Kronecker structured covariance matrix. Commun Stat Theory Methods 41(13–14):2512–2523
Article MathSciNet MATH Google Scholar
Sønderby IE, Geu-Flores F, Halkier BA (2010) Biosynthesis of glucosinolates-gene discovery and beyond. Trends Plant Sci 15(5):283–290
Article Google Scholar
Speed TP (2011) Controls. Proceedings of Leeds Annual Statistical Research Workshop
Swindell WR, Xing X, Voorhees JJ, Elder JT, Johnston A, Gudjonsson JE (2014) Integrative RNA-seq and microarray data analysis reveals GC content and gene length biases in the psoriasis transcriptome. Physiol Genom 46(15):533–546
Article Google Scholar
Teng SL, Huang H (2009) A statistical framework to infer functional gene relationships from biologically interrelated microarray experiments. J Am Stat Assoc 104(486):465–473
Article MATH Google Scholar
Tukey JW (1962) The future of data analysis. The Annals of Mathematical Statistics, 1–67
Tukey JW (1977) Exploratory data analysis
Wang C, Chen MH, Schifano E, Wu J, Yan J (2015) A Survey of Statistical Methods and Computing for Big Data. arXiv preprint arXiv:1502.07989
Wang YXR, Waterman MS, Huang H (2014) Gene coexpression measures in large heterogeneous samples using count statistics. Proc Natl Acad Sci 111(46):16371–16376
Article Google Scholar
Wang YXR, Jiang K, Feldman LJ, Bickel PJ, Huang H (2015) Inferring gene-gene interactions and functional modules using sparse canonical correlation analysis. Ann Appl Stat 9(1):300–323
Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10(1):57–63
Article Google Scholar
Wickham H (2009) ggplot2: elegant graphics for data analysis. Springer Science & Business Media
Woo H-H, Jeong BR, Hawes MC (2005) Flavonoids: from cell cycle regulation to biotechnology. Biotechnol Lett 27(6):365–374
Article Google Scholar
Yan X, Chen S (2007) Regulation of plant glucosinolate metabolism. Planta 226(6):1343–1352
Article Google Scholar
Yu B (2013) Stability. Bernoulli 19(4):1484–1500
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors thank Christine Ho, Elizabeth Purdom, Terry Speed, Karl Kumbier, Rachel Wang and Courtney Schiffman for their constructive comments and help on revising the manuscript. The work is partly supported by NIH U01-HG007031 and NSF DMS-1160319. This work is also supported in part by the Center for Science of Information (CSoI), an NSF Science and Technology Center, under grant agreement CCF-0939370.

Author information

Authors and Affiliations

Department of Statistics, University of California, Berkeley, USA
Haiyan Huang & Bin Yu
Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA
Bin Yu

Authors

Haiyan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Bin Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Haiyan Huang or Bin Yu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, H., Yu, B. Data Wisdom in Computational Genomics Research. Stat Biosci 9, 646–661 (2017). https://doi.org/10.1007/s12561-016-9173-9

Download citation

Received: 13 September 2016
Accepted: 24 September 2016
Published: 21 August 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s12561-016-9173-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data Wisdom in Computational Genomics Research

Abstract

Access this article

Similar content being viewed by others

Bioinformatics Advances Biology and Medicine by Turning Big Data Troves into Knowledge

Bioinformatics advances biology and medicine by turning big data troves into knowledge

Bioinformatics from a Big Data Perspective: Meeting the Challenge

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data Wisdom in Computational Genomics Research

Abstract

Access this article

Similar content being viewed by others

Bioinformatics Advances Biology and Medicine by Turning Big Data Troves into Knowledge

Bioinformatics advances biology and medicine by turning big data troves into knowledge

Bioinformatics from a Big Data Perspective: Meeting the Challenge

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation