Skip to main content
Log in

Data Wisdom in Computational Genomics Research

  • Published:
Statistics in Biosciences Aims and scope Submit manuscript

Abstract

All fields of science are now inundated with massive amounts of data, which have the potential to answer fundamental questions. Genomics is one particular example, exploring questions like: How does the human genome work? What genome variants make us more prone to diseases? To find answers to these questions, it is crucial to develop statistical and machine learning methods that can scale up, particularly through efficient data storage and communication. Equally crucial, but less emphasized, is the possession of data wisdom—a rebranding of the best elements of applied statistics in a recent note at ODBMS.org (http://www.odbms.org/2015/04/data-wisdom-for-data-science/). The note at ODBMS.org contains ten sets of questions a practitioner can ask to cultivate data wisdom. Although there has been much recent excitement about big data, having enough data relevant to the problem is the key to gaining meaningful answers in genomics. Data wisdom gives us the insight into how these data would look, how much information a dataset really contains, and how to extract it. In this paper, we expand on the ten sets of questions and illustrate where and how data wisdom can be integrated into computational genomics research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. http://www.odbms.org/2015/04/data-wisdom-for-data-science/

  2. www.arabidopsis.org/portals/expression/microarray/ATGenExpress.jsp.

  3. Sulfur-rich amino acid-containing compounds which become active in response to tissue damage, and believed to offer a protective function

  4. Compounds of diverse biological activities such as anti-oxidants, functioning in UV protection, in defense, in auxin transport inhibition, and in flower coloring

References

  1. Allison DB, Cui X, Page GP, Sabripour M (2006) Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 7(1):55–65

    Article  Google Scholar 

  2. Barter RL, Yu B (2015) Superheat: Supervised heatmaps for visualizing complex data. arXiv preprint arXiv:1512.01524

  3. Bigelow A, Drucker S, Fisher D, Meyer M (2014) Reflections on how designers design with data. Pages 17–24 of: Proceedings of the 2014 International Working Conference on Advanced Visual Interfaces. ACM

  4. Bolstad BM, Irizarry RA, Åstrand M, Speed TP (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2):185–193

    Article  Google Scholar 

  5. Box GEP (1976) Science and statistics. J Am Stat Assoc 71(356):791–799

    Article  MathSciNet  MATH  Google Scholar 

  6. Box GEP (1979) Robustness in the strategy of scientific model building. Robust Stat 1:201–236

    Article  Google Scholar 

  7. Casadevall A, Fang FC (2012) Reforming science: methodological and cultural reforms. Infect Immun 80(3):891–896

    Article  Google Scholar 

  8. Chavan SS, Bauer MA, Peterson EA, Heuck CJ, Johann DJ (2013) Towards the integration, annotation and association of historical microarray experiments with RNA-seq. BMC Bioinf 14(Suppl 14):S4

    Article  Google Scholar 

  9. Chu C, Kim SK, Lin Y-A, Yu YY, Bradski G, Ng AY, Olukotun K (2007) Map-reduce for machine learning on multicore. Adv Neural Inf Process Syst 19:281

    Google Scholar 

  10. De La Fuente A, Bing N, Hoeschele I, Mendes P (2004) Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics 20(18):3565–3574

  11. Efron B (2007) Size, power and false discovery rates. Ann Stat 35(4), 1351–1377

  12. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci 95(25):14863–14868

    Article  Google Scholar 

  13. Fu X, Fu N, Guo S, Yan Z, Xu Y, Hu H, Menzel C, Chen W, Li Y, Zeng R et al (2009) Estimating accuracy of RNA-Seq and microarrays with proteomics. BMC Genom 10(1):161

    Article  Google Scholar 

  14. Gachon CMM, Langlois-Meurinne M, Henry Y, Saindrenan P (2005) Transcriptional co-regulation of secondary metabolism enzymes in Arabidopsis: functional and evolutionary implications. Plant Mol Biol 58(2):229–245

    Article  Google Scholar 

  15. Gagnon-Bartsch JA, Speed TP (2012) Using control genes to correct for unwanted variation in microarray data. Biostatistics 13(3):539–552

    Article  Google Scholar 

  16. Glass DJ (2014) Experimental design for biologists. Cold Spring Harbor Laboratory Press

  17. Grewal RK, Das S (2013) Microarray data analysis: gaining biological insights. Journal of Biomedical Science and Engineering

  18. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP (2003) Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 31(4):e15–e15

    Article  Google Scholar 

  19. Jansen R, Greenbaum D, Gerstein M (2002) Relating whole-genome expression data with protein-protein interactions. Genome Res 12(1):37–46

    Article  Google Scholar 

  20. Johnson G (2014) New truths that only one can see. The New York Times,  D1

  21. Kim K, Jiang K, Teng SL, Feldman LJ, Huang H (2012) Using biologically interrelated experiments to identify pathway genes in Arabidopsis. Bioinformatics 28(6):815–822

    Article  Google Scholar 

  22. Kinney JB, Atwal GS (2014) Equitability, mutual information, and the maximal information coefficient. Proc Natl Acad Sci 111(9):3354–3359

    Article  MathSciNet  MATH  Google Scholar 

  23. Kumari S, Nie J, Chen H-S, Ma H, Stewart R, Li X, Lu M-Z, Taylor WM, Wei H (2012) Evaluation of gene association methods for coexpression network construction and biological knowledge discovery

  24. Lee HK, Hsu AK, Sajdak J, Qin J, Pavlidis P (2004) Coexpression analysis of human genes across many microarray data sets. Genome Res 14(6):1085–1094

    Article  Google Scholar 

  25. Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM (2012) Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proc VLDB Endow 5(8):716–727

    Article  Google Scholar 

  26. Mantione KJ, Kream RM, Kuzelova H, Ptacek R, Raboch J, Samuel JM, Stefano GB (2014) Comparing bioinformatic gene expression profiling methods: microarray and RNA-Seq. Med Sci Monitor Basic Res 20:138

    Article  Google Scholar 

  27. Meyer M, Wong B, Styczynski M, Munzner T, Pfister H (2010) Pathline: a tool for comparative functional genomics. In: Computer graphics forum, vol 29. Wiley Online Library, pp 1043–1052

  28. Naoumkina MA, Zhao Q, Gallego-Giraldo L, Dai X, Zhao PX, Dixon RA (2010) Genome-wide analysis of phenylpropanoid defence pathways. Mol Plant Pathol 11(6):829–846

  29. Oliver S (2000) Proteomics: guilt-by-association goes global. Nature 403(6770):601–603

    Article  Google Scholar 

  30. Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334(6062):1518–1524

  31. Rubin DB (1990) Comment: Neyman (1923) and causal inference in experiments and observational studies. Stat Sci 5(4):472–480

    Article  MathSciNet  MATH  Google Scholar 

  32. Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270(5235):467–470

    Article  Google Scholar 

  33. Scherer A (2009) Batch effects and noise in microarray experiments: sources and solutions, vol 868. John Wiley & Sons

  34. Singull M, Ahmad MR, von Rosen D (2012) More on the Kronecker structured covariance matrix. Commun Stat Theory Methods 41(13–14):2512–2523

    Article  MathSciNet  MATH  Google Scholar 

  35. Sønderby IE, Geu-Flores F, Halkier BA (2010) Biosynthesis of glucosinolates-gene discovery and beyond. Trends Plant Sci 15(5):283–290

    Article  Google Scholar 

  36. Speed TP (2011) Controls. Proceedings of Leeds Annual Statistical Research Workshop

  37. Swindell WR, Xing X, Voorhees JJ, Elder JT, Johnston A, Gudjonsson JE (2014) Integrative RNA-seq and microarray data analysis reveals GC content and gene length biases in the psoriasis transcriptome. Physiol Genom 46(15):533–546

    Article  Google Scholar 

  38. Teng SL, Huang H (2009) A statistical framework to infer functional gene relationships from biologically interrelated microarray experiments. J Am Stat Assoc 104(486):465–473

    Article  MATH  Google Scholar 

  39. Tukey JW (1962) The future of data analysis. The Annals of Mathematical Statistics, 1–67

  40. Tukey JW (1977) Exploratory data analysis

  41. Wang C, Chen MH, Schifano E, Wu J, Yan J (2015) A Survey of Statistical Methods and Computing for Big Data. arXiv preprint arXiv:1502.07989

  42. Wang YXR, Waterman MS, Huang H (2014) Gene coexpression measures in large heterogeneous samples using count statistics. Proc Natl Acad Sci 111(46):16371–16376

    Article  Google Scholar 

  43. Wang YXR, Jiang K, Feldman LJ, Bickel PJ, Huang H (2015) Inferring gene-gene interactions and functional modules using sparse canonical correlation analysis. Ann Appl Stat 9(1):300–323

  44. Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10(1):57–63

    Article  Google Scholar 

  45. Wickham H (2009) ggplot2: elegant graphics for data analysis. Springer Science & Business Media

  46. Woo H-H, Jeong BR, Hawes MC (2005) Flavonoids: from cell cycle regulation to biotechnology. Biotechnol Lett 27(6):365–374

    Article  Google Scholar 

  47. Yan X, Chen S (2007) Regulation of plant glucosinolate metabolism. Planta 226(6):1343–1352

    Article  Google Scholar 

  48. Yu B (2013) Stability. Bernoulli 19(4):1484–1500

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors thank Christine Ho, Elizabeth Purdom, Terry Speed, Karl Kumbier, Rachel Wang and Courtney Schiffman for their constructive comments and help on revising the manuscript. The work is partly supported by NIH U01-HG007031 and NSF DMS-1160319. This work is also supported in part by the Center for Science of Information (CSoI), an NSF Science and Technology Center, under grant agreement CCF-0939370.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Haiyan Huang or Bin Yu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, H., Yu, B. Data Wisdom in Computational Genomics Research. Stat Biosci 9, 646–661 (2017). https://doi.org/10.1007/s12561-016-9173-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12561-016-9173-9

Keywords

Navigation