Guidelines for Bioinformatics and the Statistical Analysis of Omic Data

  • Surajit Bhattacharya
  • Heather Gordish-DressmanEmail author
Part of the Methods in Physiology book series (METHPHYS)


This chapter is a resource for those designing omics experiments and those analyzing the data from such experiments. It is organized into two parts, one with a focus on bioinformatics tools and techniques, and the other with a focus on statistical analyses. It is intended to be a high-level instructional chapter for those who are interested in performing their own analyses, not a comprehensive discussion of either area. The first section discusses the bioinformatics tools and algorithms used in genomics and transcriptomics. It describes typical workflows and the tools available for performing an omic experiment and underscores the importance of both the tools being used and a clear understanding of the underlying algorithm. The second section describes general study design principles that should be taken into account before an experiment is begun. It describes some basic principles of statistical analysis and commonly used methods. It is not a comprehensive discussion of statistical theory nor does it describe more complex statistical models. The guidance of a statistician is advised for complex study designs, hypotheses, or statistical models.


  1. 1.
    Hood, L., & Galas, D. (2003). The digital code of DNA. Nature, 421(6921), 444–448.PubMedCrossRefPubMedCentralGoogle Scholar
  2. 2.
    Dahm, R. (2008). Discovering DNA: Friedrich Miescher and the early years of nucleic acid research. Human Genetics, 122(6), 565–581.PubMedCrossRefPubMedCentralGoogle Scholar
  3. 3.
    Levy, S. E., & Myers, R. M. (2016). Advancements in next-generation sequencing. Annual Review of Genomics and Human Genetics, 17(1), 95–115.PubMedCrossRefPubMedCentralGoogle Scholar
  4. 4.
    Reis-Filho, J. S. (2009). Next-generation sequencing. Breast Cancer Research, 11(S3), S12.PubMedCrossRefPubMedCentralGoogle Scholar
  5. 5.
    Cock, P. J. A., Fields, C. J., Goto, N., Heuer, M. L., & Rice, P. M. (2010). The sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research, 38(6), 1767–1771.PubMedCrossRefPubMedCentralGoogle Scholar
  6. 6.
    Ewing, B., Hillier, L., Wendl, M. C., & Green, P. (1998). Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Research, 8(3), 175–185.PubMedCrossRefPubMedCentralGoogle Scholar
  7. 7.
    Ewing, B., & Green, P. (1998). Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Research, 8(3), 186–194.PubMedCrossRefPubMedCentralGoogle Scholar
  8. 8.
    Andrews, S. (2010). FastQC a quality control tool for high throughput sequence data. Retrieved November 25, 2018 from
  9. 9.
    Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet Journal, 17(1), 10.CrossRefGoogle Scholar
  10. 10.
    Joshi, N. A., & Fass, J. N. (2011). Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ files.Google Scholar
  11. 11.
    Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14), 1754–1760.PubMedPubMedCentralCrossRefGoogle Scholar
  12. 12.
    Adjeroh, D., Bell, T., & Mukherjee, A. (2008). The Burrows-Wheeler transform: Data compression, suffix arrays, and pattern matching. New York: Springer.CrossRefGoogle Scholar
  13. 13.
    Lam, T. W., Sung, W. K., Tam, S. L., Wong, C. K., & Yiu, S. M. (2008). Compressed indexing and local alignment of DNA. Bioinformatics, 24(6), 791–797.PubMedCrossRefPubMedCentralGoogle Scholar
  14. 14.
    Li, H., et al. (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078–2079.PubMedPubMedCentralCrossRefGoogle Scholar
  15. 15.
    McKenna, A., et al. (2010). The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), 1297–1303.PubMedPubMedCentralCrossRefGoogle Scholar
  16. 16.
    Garrison, E., & Marth, G. (2016). Haplotype-based variant detection from short-read sequencing.Google Scholar
  17. 17.
    Kobayashi, M., et al. (2017). Heap: A highly sensitive and accurate SNP detection tool for low-coverage high-throughput sequencing data. DNA Research, 24(4), 397–405.PubMedPubMedCentralCrossRefGoogle Scholar
  18. 18.
    Tattini, L., D’Aurizio, R., & Magi, A. (2015). Detection of genomic structural variants from next-generation sequencing data. Frontiers in Bioengineering and Biotechnology, 3, 92.PubMedPubMedCentralCrossRefGoogle Scholar
  19. 19.
    Chen, K., et al. (2009). BreakDancer: An algorithm for high-resolution mapping of genomic structural variation. Nature Methods, 6(9), 677–681.PubMedPubMedCentralCrossRefGoogle Scholar
  20. 20.
    Korbel, J. O., et al. (2009). PEMer: A computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biology, 10(2), R23.PubMedPubMedCentralCrossRefGoogle Scholar
  21. 21.
    Lee, S., Hormozdiari, F., Alkan, C., & Brudno, M. (2009). MoDIL: Detecting small indels from clone-end sequencing with mixtures of distributions. Nature Methods, 6(7), 473–474.PubMedCrossRefPubMedCentralGoogle Scholar
  22. 22.
    Magi, A., Tattini, L., Pippucci, T., Torricelli, F., & Benelli, M. (2012). Read count approach for DNA copy number variants detection. Bioinformatics, 28(4), 470–478.PubMedCrossRefPubMedCentralGoogle Scholar
  23. 23.
    Magi, A., et al. (2013). EXCAVATOR: Detecting copy number variants from whole-exome sequencing data. Genome Biology, 14(10), R120.PubMedPubMedCentralCrossRefGoogle Scholar
  24. 24.
    Abyzov, A., Urban, A. E., Snyder, M., & Gerstein, M. (2011). CNVnator: An approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Research, 21(6), 974–984.PubMedPubMedCentralCrossRefGoogle Scholar
  25. 25.
    Schröder, J., et al. (2014). Socrates: Identification of genomic rearrangements in tumour genomes by re-aligning soft clipped reads. Bioinformatics, 30(8), 1064–1072.PubMedPubMedCentralCrossRefGoogle Scholar
  26. 26.
    Karakoc, E., et al. (2012). Detection of structural variants and indels within exome data. Nature Methods, 9(2), 176–178.CrossRefGoogle Scholar
  27. 27.
    Earl, D., et al. (2011). Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Research, 21(12), 2224–2241.PubMedPubMedCentralCrossRefGoogle Scholar
  28. 28.
    Iqbal, Z., Caccamo, M., Turner, I., Flicek, P., & McVean, G. (2012). De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature Genetics, 44(2), 226–232.PubMedPubMedCentralCrossRefGoogle Scholar
  29. 29.
    Nijkamp, J. F., van den Broek, M. A., Geertman, J.-M. A., Reinders, M. J. T., Daran, J.-M. G., & de Ridder, D. (2012). De novo detection of copy number variation by co-assembly. Bioinformatics, 28(24), 3195–3202.PubMedCrossRefPubMedCentralGoogle Scholar
  30. 30.
    Rausch, T., Zichner, T., Schlattl, A., Stutz, A. M., Benes, V., & Korbel, J. O. (2012). DELLY: Structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics, 28(18), i333–i339.PubMedPubMedCentralCrossRefGoogle Scholar
  31. 31.
    Layer, R. M., Chiang, C., Quinlan, A. R., & Hall, I. M. (2014). LUMPY: a probabilistic framework for structural variant discovery. Genome Biology, 15(6), R84.PubMedPubMedCentralCrossRefGoogle Scholar
  32. 32.
    Wong, K., Keane, T. M., Stalker, J., & Adams, D. J. (2010). Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly. Genome Biology, 11(12), R128.PubMedPubMedCentralCrossRefGoogle Scholar
  33. 33.
    Jeffares, D. C., et al. (2017). Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nature Communications, 8, 14061.PubMedPubMedCentralCrossRefGoogle Scholar
  34. 34.
    English, A. C., et al. (2015). Assessing structural variation in a personal genome—towards a human reference diploid genome. BMC Genomics, 16(1), 286.PubMedPubMedCentralCrossRefGoogle Scholar
  35. 35.
    Wang, K., Li, M., & Hakonarson, H. (2010). ANNOVAR: Functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research, 38(16), e164.PubMedPubMedCentralCrossRefGoogle Scholar
  36. 36.
    Sherry, S. T., et al. (2001). dbSNP: The NCBI database of genetic variation. Nucleic Acids Research, 29(1), 308–311.PubMedPubMedCentralCrossRefGoogle Scholar
  37. 37.
    MacDonald, J. R., Ziman, R., Yuen, R. K. C., Feuk, L., & Scherer, S. W. (2014). The database of genomic variants: A curated collection of structural variation in the human genome. Nucleic Acids Research, 42(Database issue), D986–D992.PubMedCrossRefPubMedCentralGoogle Scholar
  38. 38.
    Landrum, M. J., et al. (2018). ClinVar: Improving access to variant interpretations and supporting evidence. Nucleic Acids Research, 46(D1), D1062–D1067.PubMedCrossRefPubMedCentralGoogle Scholar
  39. 39.
    Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J., & Kircher, M. (2018). CADD: Predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Research, 47(D1), D886–D894.PubMedCentralCrossRefGoogle Scholar
  40. 40.
    Kircher, M., Witten, D. M., Jain, P., O’Roak, B. J., Cooper, G. M., & Shendure, J. (2014). A general framework for estimating the relative pathogenicity of human genetic variants. Nature Genetics, 46(3), 310–315.PubMedPubMedCentralCrossRefGoogle Scholar
  41. 41.
    Cingolani, P., et al. (2012). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly, 6(2), 80–92.PubMedPubMedCentralCrossRefGoogle Scholar
  42. 42.
    Cingolani, P., et al. (2012). Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift. Frontiers in Genetics, 3, 35.PubMedPubMedCentralCrossRefGoogle Scholar
  43. 43.
    Geoffroy, V., et al. (2018). AnnotSV: An integrated tool for structural variations annotation. Bioinformatics, 34(20), 3572–3574.PubMedCrossRefPubMedCentralGoogle Scholar
  44. 44.
    Freeman, W. M., Walker, S. J., & Vrana, K. E. (1999). Quantitative RT-PCR: Pitfalls and potential. BioTechniques, 26(1), 112–125.PubMedCrossRefPubMedCentralGoogle Scholar
  45. 45.
    Bumgarner, R. (2013). Overview of DNA microarrays: Types, applications, and their future. Current Protocols in Molecular Biology, 101(1), 22–21.Google Scholar
  46. 46.
    Solomon, M. J., Larsen, P. L., & Varshavsky, A. (1988). Mapping protein-DNA interactions in vivo with formaldehyde: Evidence that histone H4 is retained on a highly transcribed gene. Cell, 53(6), 937–947.PubMedCrossRefPubMedCentralGoogle Scholar
  47. 47.
    Van Gelder, R. N., von Zastrow, M. E., Yool, A., Dement, W. C., Barchas, J. D., & Eberwine, J. H. (1990). Amplified RNA synthesized from limited quantities of heterogeneous cDNA. Proceedings of the National Academy of Sciences of the United States of America, 87(5), 1663–1667.PubMedPubMedCentralCrossRefGoogle Scholar
  48. 48.
    Shalon, D., Smith, S. J., & Brown, P. O. (1996). A DNA microarray system for analyzing complex DNA samples using two-color fluorescent probe hybridization. Genome Research, 6(7), 639–645.PubMedCrossRefPubMedCentralGoogle Scholar
  49. 49.
    Ritchie, M. E., et al. (2015). Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, 43(7), e47.PubMedPubMedCentralCrossRefGoogle Scholar
  50. 50.
    Gautier, L., Cope, L., Bolstad, B. M., & Irizarry, R. A. (2004). Affy--analysis of Affymetrix GeneChip data at the probe level. Bioinformatics, 20(3), 307–315.PubMedCrossRefPubMedCentralGoogle Scholar
  51. 51.
    Dunning, M. J., Smith, M. L., Ritchie, M. E., & Tavare, S. (2007). Beadarray: R classes and methods for Illumina bead-based data. Bioinformatics, 23(16), 2183–2184.PubMedCrossRefPubMedCentralGoogle Scholar
  52. 52.
    Bolstad, B. M., Irizarry, R., Astrand, M., & Speed, T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2), 185–193.CrossRefGoogle Scholar
  53. 53.
    Carvalho, B. S., & Irizarry, R. A. (2010). A framework for oligonucleotide microarray preprocessing. Bioinformatics, 26(19), 2363–2367.PubMedPubMedCentralCrossRefGoogle Scholar
  54. 54.
    Warnes, G. R., Bolker, B., Bonebakker, L., Gentleman, R., Huber, W., & Liaw, A. (2009). gplots: Various R programming tools for plotting data. R Packag. version 2.Google Scholar
  55. 55.
    Student. (1908). The probable error of a mean. Biometrika. Retreived May 07, 2016, from
  56. 56.
    Fisher, R. A. (1919). XV.—The correlation between relatives on the supposition of Mendelian inheritance. Transactions of the Royal Society of Edinburgh, 52(02), 399–433.CrossRefGoogle Scholar
  57. 57.
    Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B, 57(1), 289–300.Google Scholar
  58. 58.
    Bonferroni, C. (1936). Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze, 8, 3–62.Google Scholar
  59. 59.
    Schadt, E. E., Turner, S., & Kasarskis, A. (2010). A window into third-generation sequencing. Human Molecular Genetics, 19(R2), R227–R240.PubMedCrossRefPubMedCentralGoogle Scholar
  60. 60.
    Mikheyev, A. S., & Tin, M. M. Y. (2014). A first look at the Oxford Nanopore MinION sequencer. Molecular Ecology Resources, 14(6), 1097–1102.PubMedCrossRefPubMedCentralGoogle Scholar
  61. 61.
    Eisenstein, M. (2012). Oxford Nanopore announcement sets sequencing sector abuzz. Nature Biotechnology, 30(4), 295–296.PubMedCrossRefPubMedCentralGoogle Scholar
  62. 62.
    Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., & Salzberg, S. L. (2013). TopHat2: Accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology, 14(4), R36.PubMedPubMedCentralCrossRefGoogle Scholar
  63. 63.
    Trapnell, C., et al. (2012). Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols, 7(3), 562–578.PubMedPubMedCentralCrossRefGoogle Scholar
  64. 64.
    Trapnell, C., et al. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology, 28(5), 511–515.PubMedPubMedCentralCrossRefGoogle Scholar
  65. 65.
    Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4), 357–359.PubMedPubMedCentralCrossRefGoogle Scholar
  66. 66.
    Ferragina, P., & Manzini, G. (2001). An experimental study of a compressed index. Information Sciences, 135(1–2), 13–28.CrossRefGoogle Scholar
  67. 67.
    Dobin, A., et al. (2013). STAR: Ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), 15–21.PubMedPubMedCentralCrossRefGoogle Scholar
  68. 68.
    Kim, D., Langmead, B., & Salzberg, S. L. (2015). HISAT: A fast spliced aligner with low memory requirements. Nature Methods, 12(4), 357–360.PubMedPubMedCentralCrossRefGoogle Scholar
  69. 69.
    Pertea, M., Kim, D., Pertea, G. M., Leek, J. T., & Salzberg, S. L. (2016). Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nature Protocols, 11(9), 1650–1667.PubMedPubMedCentralCrossRefGoogle Scholar
  70. 70.
    Pertea, M., Pertea, G. M., Antonescu, C. M., Chang, T.-C., Mendell, J. T., & Salzberg, S. L. (2015). StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nature Biotechnology, 33(3), 290–295.PubMedPubMedCentralCrossRefGoogle Scholar
  71. 71.
    Frazee, A. C., Pertea, G., Jaffe, A. E., Langmead, B., Salzberg, S. L., & Leek, J. T. (2015). Ballgown bridges the gap between transcriptome assembly and expression analysis. Nature Biotechnology, 33(3), 243–246.PubMedPubMedCentralCrossRefGoogle Scholar
  72. 72.
    Wang, L., Wang, S., & Li, W. (2012). RSeQC: Quality control of RNA-seq experiments. Bioinformatics, 28(16), 2184–2185.PubMedCrossRefPubMedCentralGoogle Scholar
  73. 73.
    Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods, 5(7), 621–628.PubMedCrossRefPubMedCentralGoogle Scholar
  74. 74.
    Li, B., Ruotti, V., Stewart, R. M., Thomson, J. A., & Dewey, C. N. (2010). RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics, 26(4), 493–500.PubMedCrossRefPubMedCentralGoogle Scholar
  75. 75.
    Li, B., & Dewey, C. N. (2011). RSEM: Accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 12(1), 323.PubMedPubMedCentralCrossRefGoogle Scholar
  76. 76.
    Dempster, A. P., Laird, N. M., & Rubin, D. B. (1976). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1–22.Google Scholar
  77. 77.
    Anders, S., Pyl, P. T., & Huber, W. (2015). HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics, 31(2), 166–169.PubMedPubMedCentralCrossRefGoogle Scholar
  78. 78.
    Liao, Y., Smyth, G. K., & Shi, W. (2014). featureCounts: An efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7), 923–930.PubMedPubMedCentralCrossRefGoogle Scholar
  79. 79.
    Lawrence, M., et al. (2013). Software for computing and annotating genomic ranges. PLoS Computational Biology, 9(8), e1003118.PubMedPubMedCentralCrossRefGoogle Scholar
  80. 80.
    Soneson, C., Love, M. I., & Robinson, M. D. (2015). Differential analyses for RNA-seq: Transcript-level estimates improve gene-level inferences. F1000Research, 4, 1521.PubMedCrossRefPubMedCentralGoogle Scholar
  81. 81.
    Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140.PubMedPubMedCentralCrossRefGoogle Scholar
  82. 82.
    Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550.PubMedPubMedCentralCrossRefGoogle Scholar
  83. 83.
    Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society, Series A, 135(3), 370.CrossRefGoogle Scholar
  84. 84.
    Wald, A. (1945). Sequential tests of statistical hypotheses. Annals of Mathematical Statistics, 16(2), 117–186.CrossRefGoogle Scholar
  85. 85.
    Feng, J., Meyer, C. A., Wang, Q., Liu, J. S., Shirley Liu, X., & Zhang, Y. (2012). GFOLD: A generalized fold change for ranking differentially expressed genes from RNA-seq data. Bioinformatics, 28(21), 2782–2788.PubMedCrossRefPubMedCentralGoogle Scholar
  86. 86.
    Tarazona, S., et al. (2015). Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package. Nucleic Acids Research, 43(21), e140.PubMedPubMedCentralGoogle Scholar
  87. 87.
    Toedling, J., & Huber, W. (2008). Analyzing ChIP-chip data using bioconductor. PLoS Computational Biology, 4(11), e1000227.PubMedPubMedCentralCrossRefGoogle Scholar
  88. 88.
    Toedling, J., Sklyar, O., & Huber, W. (2007). Ringo – an R/bioconductor package for analyzing ChIP-chip readouts. BMC Bioinformatics, 8(1), 221.PubMedPubMedCentralCrossRefGoogle Scholar
  89. 89.
    Durinck, S., et al. (2005). BioMart and bioconductor: A powerful link between biological databases and microarray data analysis. Bioinformatics, 21(16), 3439–3440.PubMedCrossRefPubMedCentralGoogle Scholar
  90. 90.
    Alexa, A., Rahnenfuhrer, J., & Lengauer, T. (2006). Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics, 22(13), 1600–1607.PubMedCrossRefPubMedCentralGoogle Scholar
  91. 91.
    Zhang, Y., et al. (2008). Model-based Analysis of ChIP-Seq (MACS). Genome Biology, 9(9), R137.PubMedPubMedCentralCrossRefGoogle Scholar
  92. 92.
    Xu, S., Grullon, S., Ge, K., & Peng, W. (2014). Spatial clustering for identification of ChIP-enriched regions (SICER) to map regions of histone methylation patterns in embryonic stem cells. Methods in Molecular Biology, 1150, 97.PubMedCrossRefPubMedCentralGoogle Scholar
  93. 93.
    Hayatsu, H. (2008). Discovery of bisulfite-mediated cytosine conversion to uracil, the key reaction for DNA methylation analysis – a personal account. Proceedings of the Japan Academy. Series B, Physical and Biological Sciences, 84(8), 321–330.PubMedPubMedCentralCrossRefGoogle Scholar
  94. 94.
    Morris, T. J., et al. (2014). ChAMP: 450k chip analysis methylation pipeline. Bioinformatics, 30(3), 428–430.PubMedCrossRefPubMedCentralGoogle Scholar
  95. 95.
    Tian, Y., et al. (2017). ChAMP: Updated methylation analysis pipeline for illumina BeadChips. Bioinformatics, 33(24), 3982–3984.PubMedPubMedCentralCrossRefGoogle Scholar
  96. 96.
    Aryee, M. J., et al. (2014). Minfi: A flexible and comprehensive bioconductor package for the analysis of infinium DNA methylation microarrays. Bioinformatics, 30(10), 1363–1369.PubMedPubMedCentralCrossRefGoogle Scholar
  97. 97.
    Leek, J. T., Johnson, W. E., Parker, H. S., Jaffe, A. E., & Storey, J. D. (2012). The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics, 28(6), 882–883.PubMedPubMedCentralCrossRefGoogle Scholar
  98. 98.
    Carson Sievert, P. T. I., Parmer, C., Hocking, T., Chamberlain, S., Ram, K., Corvellec, M., & Despouy, P. (2018). Create interactive web graphics via ‘plotly.js’ [R package plotly version 4.8.0]. Comprehensive R Archive Network (CRAN).Google Scholar
  99. 99.
    Krueger, F., & Andrews, S. R. (2011). Bismark: A flexible aligner and methylation caller for bisulfite-seq applications. Bioinformatics, 27(11), 1571–1572.PubMedPubMedCentralCrossRefGoogle Scholar
  100. 100.
    Chen, P.-Y., Cokus, S. J., & Pellegrini, M. (2010). BS Seeker: Precise mapping for bisulfite sequencing. BMC Bioinformatics, 11(1), 203.PubMedPubMedCentralCrossRefGoogle Scholar
  101. 101.
    Kreck, B., Marnellos, G., Richter, J., Krueger, F., Siebert, R., & Franke, A. (2012). B-SOLANA: An approach for the analysis of two-base encoding bisulfite sequencing data. Bioinformatics, 28(3), 428–429.PubMedCrossRefPubMedCentralGoogle Scholar
  102. 102.
    Frith, M. C., Mori, R., & Asai, K. (2012). A mostly traditional approach improves alignment of bisulfite-converted DNA. Nucleic Acids Research, 40(13), e100.PubMedPubMedCentralCrossRefGoogle Scholar
  103. 103.
    Saito, Y., Tsuji, J., & Mituyama, T. (2014). Bisulfighter: Accurate detection of methylated cytosines and differentially methylated regions. Nucleic Acids Research, 42(6), e45.PubMedPubMedCentralCrossRefGoogle Scholar
  104. 104.
    Xi, Y., & Li, W. (2009). BSMAP: Whole genome bisulfite sequence MAPping program. BMC Bioinformatics, 10(1), 232.PubMedPubMedCentralCrossRefGoogle Scholar
  105. 105.
    Assenov, Y., Müller, F., Lutsik, P., Walter, J., Lengauer, T., & Bock, C. (2014). Comprehensive analysis of DNA methylation data with RnBeads. Nature Methods, 11(11), 1138–1140.PubMedPubMedCentralCrossRefGoogle Scholar
  106. 106.
    Saito, Y., & Mituyama, T. (2015). Detection of differentially methylated regions from bisulfite-seq data by hidden Markov models incorporating genome-wide methylation level distributions. BMC Genomics, 16(Suppl 12), S3.PubMedPubMedCentralCrossRefGoogle Scholar
  107. 107.
    Song, Q., Decato, B., Hong, E. E., Zhou, M., & Fang, F. (2013). A reference methylome database and analysis pipeline to facilitate integrative and comparative epigenomics. PLoS One, 8(12), 81148.CrossRefGoogle Scholar
  108. 108.
    Hansen, K. D., Langmead, B., & Irizarry, R. A. (2012). BSmooth: From whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biology, 13(10), R83.PubMedPubMedCentralCrossRefGoogle Scholar
  109. 109.
    Hebestreit, K., Dugas, M., & Klein, H.-U. (2013). Detection of significantly differentially methylated regions in targeted bisulfite sequencing data. Bioinformatics, 29(13), 1647–1653.PubMedCrossRefPubMedCentralGoogle Scholar
  110. 110.
    Wreczycka, K., Gosdschan, A., Yusuf, D., Grüning, B., Assenov, Y., & Akalin, A. (2017). Strategies for analyzing bisulfite sequencing data. Journal of Biotechnology, 261, 105–115.PubMedCrossRefPubMedCentralGoogle Scholar
  111. 111.
    Tsuji, J., & Weng, Z. (2015). Evaluation of preprocessing, mapping and postprocessing algorithms for analyzing whole genome bisulfite sequencing data. Briefings in Bioinformatics, 17(6), bbv103.CrossRefGoogle Scholar
  112. 112.
    Eberwine, J., et al. (1992). Analysis of gene expression in single live neurons. Proceedings of the National Academy of Sciences of the United States of America, 89(7), 3010–3014.PubMedPubMedCentralCrossRefGoogle Scholar
  113. 113.
    Hwang, B., Lee, J. H., & Bang, D. (2018). Single-cell RNA sequencing technologies and bioinformatics pipelines. Experimental and Molecular Medicine, 50(8), 96.PubMedCrossRefPubMedCentralGoogle Scholar
  114. 114.
    Van Der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.Google Scholar
  115. 115.
    Butler, A., Hoffman, P., Smibert, P., Papalexi, E., & Satija, R. (2018). Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature Biotechnology, 36(5), 411–420.PubMedPubMedCentralCrossRefGoogle Scholar
  116. 116.
    Afgan, E., et al. (2018). The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Research, 46(W1), W537–W544.PubMedPubMedCentralCrossRefGoogle Scholar
  117. 117.
    Ashburner, M., et al. (2000). Gene ontology: Tool for the unification of biology. Nature Genetics, 25(1), 25–29.PubMedPubMedCentralCrossRefGoogle Scholar
  118. 118.
    Huang, D. W., Sherman, B. T., & Lempicki, R. A. (2009). Bioinformatics enrichment tools: Paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Research, 37(1), 1–13.CrossRefGoogle Scholar
  119. 119.
    Fisher, R. A. (1922). On the interpretation of χ2 from contingency tables, and the calculation of P. Journal of the Royal Statistical Society, 85(1), 87.CrossRefGoogle Scholar
  120. 120.
    Ludbrook, J. (2008). Analysis of 2 × 2 tables of frequencies: Matching test to experimental design. International Journal of Epidemiology, 37(6), 1430–1435.PubMedCrossRefPubMedCentralGoogle Scholar
  121. 121.
    Huang, D. W., Sherman, B. T., & Lempicki, R. A. (2009). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protocols, 4(1), 44–57.CrossRefGoogle Scholar
  122. 122.
    Falcon, S., & Gentleman, R. (2007). Using GOstats to test gene lists for GO term association. Bioinformatics, 23(2), 257–258.PubMedCrossRefPubMedCentralGoogle Scholar
  123. 123.
    Maere, S., Heymans, K., & Kuiper, M. (2005). BiNGO: A cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics, 21(16), 3448–3449.PubMedCrossRefPubMedCentralGoogle Scholar
  124. 124.
    Subramanian, A., et al. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43), 15545–15550.CrossRefGoogle Scholar
  125. 125.
    Lee, H. K., Braynen, W., Keshav, K., & Pavlidis, P. (2005). ErmineJ: Tool for functional analysis of gene expression data sets. BMC Bioinformatics, 6(1), 269.PubMedPubMedCentralCrossRefGoogle Scholar
  126. 126.
    Al-Shahrour, F., et al. (2007). From genes to functional classes in the study of biological systems. BMC Bioinformatics, 8, 114.PubMedPubMedCentralCrossRefGoogle Scholar
  127. 127.
    Nam, D., Kim, S.-B., Kim, S.-K., Yang, S., Kim, S.-Y., & Chu, I.-S. (2006). ADGO: Analysis of differentially expressed gene sets using composite GO annotation. Bioinformatics, 22(18), 2249–2253.PubMedCrossRefPubMedCentralGoogle Scholar
  128. 128.
    Nogales-Cadenas, R., et al. (2009). GeneCodis: Interpreting gene lists through enrichment analysis and integration of diverse biological information. Nucleic Acids Research, 37(Web Server issue), W317–W322.PubMedPubMedCentralCrossRefGoogle Scholar
  129. 129.
    Kanehisa, M., & Goto, S. (2000). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 28(1), 27–30.PubMedPubMedCentralCrossRefGoogle Scholar
  130. 130.
    Finn, R. D., et al. (2014). Pfam: The protein families database. Nucleic Acids Research, 42(Database issue), D222–D230.PubMedCrossRefPubMedCentralGoogle Scholar
  131. 131.
    Matys, V., et al. (2003). TRANSFAC: Transcriptional regulation, from patterns to profiles. Nucleic Acids Research, 31(1), 374–378.PubMedPubMedCentralCrossRefGoogle Scholar
  132. 132.
    Warde-Farley, D., et al. (2010). The GeneMANIA prediction server: Biological network integration for gene prioritization and predicting gene function. Nucleic Acids Research, 38(Web Server issue), W214–W220.PubMedPubMedCentralCrossRefGoogle Scholar
  133. 133.
    Stark, C., Breitkreutz, B.-J., Reguly, T., Boucher, L., Breitkreutz, A., & Tyers, M. (2006). BioGRID: A general repository for interaction datasets. Nucleic Acids Research, 34(Database issue), D535–D539.PubMedCrossRefPubMedCentralGoogle Scholar
  134. 134.
    Zhang, B., & Horvath, S. (2005). A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology, 4(1), Article17.PubMedCrossRefPubMedCentralGoogle Scholar
  135. 135.
    Langfelder, P., & Horvath, S. (2008). WGCNA: An R package for weighted correlation network analysis. BMC Bioinformatics, 9(1), 559.PubMedPubMedCentralCrossRefGoogle Scholar
  136. 136.
    Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19), 3047–3048.PubMedPubMedCentralCrossRefGoogle Scholar
  137. 137.
    Gregory, R., Warnes, R., Bolker, B., Bonebakker, L., Gentleman, M., Liaw, W. H. A., Lumley, T., Maechler, B., Magnusson, A., Moeller, S., Schwartz, M., & Venables, B. (2016). Various R programming tools for plotting data. R Package Version, 2(4), 1.Google Scholar
  138. 138.
    Walter, W., Sánchez-Cabo, F., & Ricote, M. (2015). GOplot: An R package for visually combining expression data with functional analysis. Bioinformatics, 31(17), 2912–2914.PubMedCrossRefPubMedCentralGoogle Scholar
  139. 139.
    Ghosh, D., & Poisson, L. M. (2009). “Omics” data and levels of evidence for biomarker discovery. Genomics, 93, 13–16.PubMedCrossRefPubMedCentralGoogle Scholar
  140. 140.
    Wheelock, A. M., & Wheelock, C. E. (2013). Trials and tribulations of ‘omics data analysis: Assessing quality of SIMCA-based multivariate models using examples from pulmonary medicine. Molecular BioSystems, 9, 2589.PubMedCrossRefPubMedCentralGoogle Scholar
  141. 141.
    Kraus, L. (2015). Editorial: Would you like a hypothesis with those data? Omics and the age of discovery science. Molecular Endocrinology, 29(11), 1531–1534.PubMedPubMedCentralCrossRefGoogle Scholar
  142. 142.
    Vaux, D. L., Fidler, F., & Cumming, G. (2012). Replicates and repeats—What is the difference and is it significant? A brief discussion of statistics and experimental design. EMBO Reports, 13(4), 291.PubMedPubMedCentralCrossRefGoogle Scholar
  143. 143.
    Bell, G. (2016). Comment: Replicates and repeats. BMC Biology, 14, 28.PubMedPubMedCentralCrossRefGoogle Scholar
  144. 144.
    Whitley, E., & Ball, J. (2002). Statistics review 4: Sample size calculations. Critical Care, 6(4), 335.PubMedPubMedCentralCrossRefGoogle Scholar
  145. 145.
    Billoir, E., Navratil, V., & Blaise, B. J. (2015). Sample size calculation in metabolic phenotyping studies. Briefings in Bioinformatics, 16(5), 813–819.PubMedCrossRefPubMedCentralGoogle Scholar
  146. 146.
    Urdan, T. C. (2010). Statistics in plain English (3rd ed.). New York: Routledge.Google Scholar
  147. 147.
    Pett, M. A. (1997). Nonparametric statistics for health care research: Statistics for small samples and unusual distributions. Thousand Oaks, CA: Sage.Google Scholar
  148. 148.
    Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society. Series B, 64(Part 3), 479–498.CrossRefGoogle Scholar
  149. 149.
    Feise, R. J. (2002). Do multiple outcome measures require p-value adjustment? BMC Medical Research Methodology, 2, 8.PubMedPubMedCentralCrossRefGoogle Scholar
  150. 150.
    Chen, S. Y., Feng, Z., & Yi, X. (2017). A general introduction to adjustment for multiple comparisons. Journal of Thoracic Disease, 9(6), 1725–1729.PubMedPubMedCentralCrossRefGoogle Scholar
  151. 151.
    Forshed, J. (2017). Experimental design in clinical ‘omics biomarker discovery. Journal of Proteome Research, 16, 3954–3960.PubMedCrossRefPubMedCentralGoogle Scholar
  152. 152.
    Guyatt, G., Jaeschke, R., Heddle, N., Cook, D., Shannon, H., & Walter, S. (1995). Basic statistics for clinicians: 1. Hypothesis testing. CMAJ, 152(1), 27–32.PubMedPubMedCentralGoogle Scholar
  153. 153.
    Guyatt, G., Jaeschke, R., Heddle, N., Cook, D., Shannon, H., & Walter, S. (1995). Basic statistics for clinicians: 2. Interpreting study results: Confidence intervals. CMAJ, 152(2), 169–173.PubMedPubMedCentralGoogle Scholar
  154. 154.
    Guyatt, G., Walkter, S., Shannon, H., Cook, D., Jaeschke, R., & Heddle, N. (1995). Basic statistics for clinicians: 4. Correlation and regression. CMAJ, 152(4), 497–504.PubMedPubMedCentralGoogle Scholar
  155. 155.
    Hanley, J. A., & Moodie, E. E. M. (2011). Sample size, precision and power calculations: A unified approach. Journal of Biometrics and Biostatistics, 2, 5.CrossRefGoogle Scholar
  156. 156.
    Ioannidis, J. P. A., Tarone, R., & McLaughlin, J. K. (2011). The false-positive to false-negative ratio in epidemiologic studies. Epidemiology, 22(4), 450–456.PubMedCrossRefPubMedCentralGoogle Scholar
  157. 157.
    Jarschke, R., Guyatt, G., Shannon, H., Walter, S., Cook, D., & Heddle, N. (1995). Basic statistics for clinicians: 3. Assessing the effects of treatment: Measures of association. CMAJ, 152(3), 351–357.Google Scholar
  158. 158.
    Mazzocchi, F. (2015). Could big data be the end of theory in science? A few remarks on the epistemology of data-driven science. EMBO Reports, 16(10), 1250–1255.PubMedPubMedCentralCrossRefGoogle Scholar
  159. 159.
    Rajasundaram, D., & Selbig, J. (2016). More effort — More results: Recent advances in integrative ‘omics’ data analysis. Current Opinion in Plant Biology, 30, 57–61.PubMedCrossRefPubMedCentralGoogle Scholar
  160. 160.
    Senn, S., & Bretz, F. (2007). Power and sample size when multiple endpoints are considered. Pharmaceutical Statistics, 6, 161–170.PubMedCrossRefPubMedCentralGoogle Scholar
  161. 161.
    Signe, A., Esteban, F. J., Stavreus-Evers, A., Simon, C., Giudice, L., Lessey, B. A., Horcajadas, J. A., Macklon, N. S., D’Hooghe, T., Campoy, C., Fauser, B. C., Salamonsen, L. A., & Salumets, A. (2014). Guidelines for the design, analysis and interpretation of ‘omics’ data: Focus on human endometrium. Human Reproduction Update, 20(1), 12–28.CrossRefGoogle Scholar

Copyright information

© The American Physiological Society 2019

Authors and Affiliations

  • Surajit Bhattacharya
    • 1
  • Heather Gordish-Dressman
    • 2
    • 3
    Email author
  1. 1.Center for Genetic Medicine ResearchChildren’s National Medical CenterWashington, DCUSA
  2. 2.Center for Translational Research, Children’s National Medical CenterWashington, DCUSA
  3. 3.Department of PediatricsThe George Washington University School of Medicine and Health SciencesWashington, DCUSA

Personalised recommendations