A Guide for Designing and Analyzing RNA-Seq Data

  • Aniruddha Chatterjee
  • Antonio Ahn
  • Euan J. Rodger
  • Peter A. Stockwell
  • Michael R. Eccles
Part of the Methods in Molecular Biology book series (MIMB, volume 1783)


The identity of a cell or an organism is at least in part defined by its gene expression and therefore analyzing gene expression remains one of the most frequently performed experimental techniques in molecular biology. The development of the RNA-Sequencing (RNA-Seq) method allows an unprecedented opportunity to analyze expression of protein-coding, noncoding RNA and also de novo transcript assembly of a new species or organism. However, the planning and design of RNA-Seq experiments has important implications for addressing the desired biological question and maximizing the value of the data obtained. In addition, RNA-Seq generates a huge volume of data and accurate analysis of this data involves several different steps and choices of tools. This can be challenging and overwhelming, especially for bench scientists. In this chapter, we describe an entire workflow for performing RNA-Seq experiments. We describe critical aspects of wet lab experiments such as RNA isolation, library preparation and the initial design of an experiment. Further, we provide a step-by-step description of the bioinformatics workflow for different steps involved in RNA-Seq data analysis. This includes power calculations, setting up a computational environment, acquisition and processing of publicly available data if desired, quality control measures, preprocessing steps for the raw data, differential expression analysis, and data visualization. We particularly mention important considerations for each step to provide a guide for designing and analyzing RNA-Seq data.

Key words

RNA-Seq Genome Gene expression Differential expression Transcript Sequencing Sequenced read 



A.C. and M.R.E. are grateful to the New Zealand Institute for Cancer Research Trust for supporting their respective positions.


  1. 1.
    Ozsolak F, Milos PM (2011) RNA sequencing: advances, challenges and opportunities. Nat Rev Genet 12:87–98CrossRefPubMedGoogle Scholar
  2. 2.
    Bustin SA, Benes V, Garson JA et al (2009) The MIQE guidelines: minimum information for publication of quantitative real-time PCR experiments. Clin Chem 55:611–622CrossRefPubMedGoogle Scholar
  3. 3.
    Schena M, Shalon D, Davis RW et al (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467–470CrossRefPubMedGoogle Scholar
  4. 4.
    Murphy D (2002) Gene expression studies using microarrays: principles, problems, and prospects. Adv Physiol Educ 26:256–270CrossRefPubMedGoogle Scholar
  5. 5.
    Abdullah-Sayani A, Bueno-de-Mesquita JM, van de Vijver MJ (2006) Technology insight: tuning into the genetic orchestra using microarrays—limitations of DNA microarrays in clinical practice. Nat Clin Pract Oncol 3:501–516CrossRefPubMedGoogle Scholar
  6. 6.
    Goodwin S, McPherson JD, McCombie WR (2016) Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet 17:333–351CrossRefPubMedGoogle Scholar
  7. 7.
    Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63CrossRefPubMedPubMedCentralGoogle Scholar
  8. 8.
    Crick F (1970) Central dogma of molecular biology. Nature 227:561–563CrossRefPubMedGoogle Scholar
  9. 9.
    Crick FH (1958) On protein synthesis. Symp Soc Exp Biol 12:138–163PubMedGoogle Scholar
  10. 10.
    ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489:57–74Google Scholar
  11. 11.
    Chatterjee A, Eccles MR (2015) DNA methylation and epigenomics: new technologies and emerging concepts. Genome Biol 16:103CrossRefPubMedPubMedCentralGoogle Scholar
  12. 12.
    Chatterjee A, Stockwell PA, Rodger EJ et al (2016) scan_tcga tools for integrated epigenomic and transcriptomic analysis of tumor subgroups. Epigenomics 8(10):1315–1330CrossRefPubMedGoogle Scholar
  13. 13.
    Chatterjee A, Stockwell PA, Rodger EJ et al (2016) Genome-scale DNA methylome and transcriptome profiling of human neutrophils. Sci Data 3:160019CrossRefPubMedPubMedCentralGoogle Scholar
  14. 14.
    Chatterjee A, Stockwell PA, Rodger EJ et al (2015) Genome-wide DNA methylation map of human neutrophils reveals widespread inter-individual epigenetic variation. Sci Rep 5:17328CrossRefPubMedPubMedCentralGoogle Scholar
  15. 15.
    Leichter AL, Purcell RV, Sullivan MJ et al (2015) Multi-platform microRNA profiling of hepatoblastoma patients using formalin fixed paraffin embedded archival samples. Gigascience 4:54CrossRefPubMedPubMedCentralGoogle Scholar
  16. 16.
    Chatterjee A, Leichter AL, Fan V et al (2015) A cross comparison of technologies for the detection of microRNAs in clinical FFPE samples of hepatoblastoma patients. Sci Rep 5:10438CrossRefPubMedPubMedCentralGoogle Scholar
  17. 17.
    Schroeder A, Mueller O, Stocker S et al (2006) The RIN: an RNA integrity number for assigning integrity values to RNA measurements. BMC Mol Biol 7:3CrossRefPubMedPubMedCentralGoogle Scholar
  18. 18.
    Walther C, Hofvander J, Nilsson J et al (2015) Gene fusion detection in formalin-fixed paraffin-embedded benign fibrous histiocytomas using fluorescence in situ hybridization and RNA sequencing. Lab Investig 95:1071–1076CrossRefPubMedGoogle Scholar
  19. 19.
    Puls F, Hofvander J, Magnusson L et al (2016) FN1-EGF gene fusions are recurrent in calcifying aponeurotic fibroma. J Pathol 238:502–507CrossRefPubMedGoogle Scholar
  20. 20.
    Huang W, Goldfischer M, Babyeva S et al (2015) Identification of a novel PARP14-TFE3 gene fusion from 10-year-old FFPE tissue by RNA-seq. Genes Chromosomes Cancer.
  21. 21.
    Quinlan AR, Boland MJ, Leibowitz ML et al (2011) Genome sequencing of mouse induced pluripotent stem cells reveals retroelement stability and infrequent DNA rearrangement during reprogramming. Cell Stem Cell 9:366–373CrossRefPubMedPubMedCentralGoogle Scholar
  22. 22.
    Zhao S, Zhang Y, Gordon W et al (2015) Comparison of stranded and non-stranded RNA-seq transcriptome profiling and investigation of gene overlap. BMC Genomics 16:675CrossRefPubMedPubMedCentralGoogle Scholar
  23. 23.
    Hansen KD, Wu Z, Irizarry RA et al (2011) Sequencing technology does not eliminate biological variability. Nat Biotechnol 29:572–573CrossRefPubMedPubMedCentralGoogle Scholar
  24. 24.
    Liu Y, Zhou J, White KP (2014) RNA-seq differential expression studies: more sequence or more replication? Bioinformatics 30:301–304CrossRefPubMedGoogle Scholar
  25. 25.
    Conesa A, Madrigal P, Tarazona S et al (2016) A survey of best practices for RNA-seq data analysis. Genome Biol 17:13CrossRefPubMedPubMedCentralGoogle Scholar
  26. 26.
    Schurch NJ, Schofield P, Gierlinski M et al (2016) How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA 22:839–851CrossRefPubMedPubMedCentralGoogle Scholar
  27. 27.
    Ching T, Huang S, Garmire LX (2014) Power analysis and sample size estimation for RNA-Seq differential expression. RNA 20:1684–1696CrossRefPubMedPubMedCentralGoogle Scholar
  28. 28.
    Busby MA, Stewart C, Miller CA et al (2013) Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression. Bioinformatics 29:656–657CrossRefPubMedPubMedCentralGoogle Scholar
  29. 29.
    Patel RK, Jain M (2012) NGS QC toolkit: a toolkit for quality control of next generation sequencing data. PLoS One 7:e30619CrossRefPubMedPubMedCentralGoogle Scholar
  30. 30.
    Stockwell PA, Chatterjee A, Rodger EJ et al (2014) DMAP: differential methylation analysis package for RRBS and WGBS data. Bioinformatics 30:1814–1822CrossRefPubMedGoogle Scholar
  31. 31.
    Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120CrossRefPubMedPubMedCentralGoogle Scholar
  32. 32.
    DeLuca DS, Levin JZ, Sivachenko A et al (2012) RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics 28:1530–1532CrossRefPubMedPubMedCentralGoogle Scholar
  33. 33.
    Wang L, Wang S, Li W (2012) RSeQC: quality control of RNA-seq experiments. Bioinformatics 28:2184–2185CrossRefPubMedGoogle Scholar
  34. 34.
    Okonechnikov K, Conesa A, Garcia-Alcalde F (2016) Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics 32:292–294PubMedGoogle Scholar
  35. 35.
    Johnson WE, Li C, Rabinovic A (2007) Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8:118–127CrossRefPubMedGoogle Scholar
  36. 36.
    Kim D, Pertea G, Trapnell C et al (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14:R36CrossRefPubMedPubMedCentralGoogle Scholar
  37. 37.
    Dobin A, Davis CA, Schlesinger F et al (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15–21CrossRefPubMedGoogle Scholar
  38. 38.
    Kim D, Langmead B, Salzberg SL (2015) HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12:357–360CrossRefPubMedPubMedCentralGoogle Scholar
  39. 39.
    Langmead B, Trapnell C, Pop M et al (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25CrossRefPubMedPubMedCentralGoogle Scholar
  40. 40.
    Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760CrossRefPubMedPubMedCentralGoogle Scholar
  41. 41.
    Wu TD, Nacu S (2010) Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26:873–881CrossRefPubMedPubMedCentralGoogle Scholar
  42. 42.
    Grabherr MG, Haas BJ, Yassour M et al (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29:644–652CrossRefPubMedPubMedCentralGoogle Scholar
  43. 43.
    Schulz MH, Zerbino DR, Vingron M et al (2012) Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28:1086–1092CrossRefPubMedPubMedCentralGoogle Scholar
  44. 44.
    Patro R, Mount SM, Kingsford C (2014) Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat Biotechnol 32:462–464CrossRefPubMedPubMedCentralGoogle Scholar
  45. 45.
    Trapnell C, Hendrickson DG, Sauvageau M et al (2013) Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol 31:46–53CrossRefPubMedGoogle Scholar
  46. 46.
    Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–140CrossRefPubMedGoogle Scholar
  47. 47.
    Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15:550CrossRefPubMedPubMedCentralGoogle Scholar
  48. 48.
    Law CW, Chen Y, Shi W et al (2014) voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 15:R29CrossRefPubMedPubMedCentralGoogle Scholar
  49. 49.
    Robinson JT, Thorvaldsdottir H, Winckler W et al (2011) Integrative genomics viewer. Nat Biotechnol 29:24–26CrossRefPubMedPubMedCentralGoogle Scholar
  50. 50.
    Kim SH, Das A, Chai JC et al (2016) Transcriptome sequencing wide functional analysis of human mesenchymal stem cells in response to TLR4 ligand. Sci Rep 6:30311CrossRefPubMedPubMedCentralGoogle Scholar
  51. 51.
    Kopylova E, Noe L, Touzet H (2012) SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data. Bioinformatics 28:3211–3217CrossRefPubMedGoogle Scholar
  52. 52.
    Pertea M, Kim D, Pertea GM et al (2016) Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat Protoc 11:1650–1667CrossRefPubMedPubMedCentralGoogle Scholar
  53. 53.
    Xie Y, Wu G, Tang J et al (2014) SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads. Bioinformatics 30:1660–1666CrossRefPubMedGoogle Scholar
  54. 54.
    Engstrom PG, Steijger T, Sipos B et al (2013) Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods 10:1185–1191CrossRefPubMedPubMedCentralGoogle Scholar
  55. 55.
    Medina I, Tarraga J, Martinez H et al (2016) Highly sensitive and ultrafast read mapping for RNA-seq analysis. DNA Res 23:93–100CrossRefPubMedPubMedCentralGoogle Scholar
  56. 56.
    Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18:1851–1858CrossRefPubMedPubMedCentralGoogle Scholar
  57. 57.
    Haas BJ, Papanicolaou A, Yassour M et al (2013) De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc 8:1494–1512CrossRefGoogle Scholar
  58. 58.
    Robertson G, Schein J, Chiu R et al (2010) De novo assembly and analysis of RNA-seq data. Nat Methods 7:909–912CrossRefPubMedGoogle Scholar
  59. 59.
    Li B, Dewey CN (2011) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12:323CrossRefPubMedPubMedCentralGoogle Scholar
  60. 60.
    Mortazavi A, Williams BA, McCue K et al (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5:621–628CrossRefPubMedGoogle Scholar
  61. 61.
    Trapnell C, Roberts A, Goff L et al (2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7:562–578CrossRefPubMedPubMedCentralGoogle Scholar
  62. 62.
    Wagner GP, Kin K, Lynch VJ (2012) Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci 131:281–285CrossRefPubMedGoogle Scholar
  63. 63.
    Bray NL, Pimentel H, Melsted P et al (2016) Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 34:525–527CrossRefPubMedGoogle Scholar
  64. 64.
    Soneson C, Delorenzi M (2013) A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics 14:91CrossRefPubMedPubMedCentralGoogle Scholar
  65. 65.
    Guo Y, Li CI, Ye F et al (2013) Evaluation of read count based RNAseq analysis methods. BMC Genomics 14(Suppl 8):S2CrossRefPubMedPubMedCentralGoogle Scholar
  66. 66.
    Seyednasrollah F, Laiho A, Elo LL (2015) Comparison of software packages for detecting differential expression in RNA-seq studies. Brief Bioinform 16:59–70CrossRefPubMedGoogle Scholar
  67. 67.
    Zhang ZH, Jhaveri DJ, Marshall VM et al (2014) A comparative study of techniques for differential expression analysis on RNA-Seq data. PLoS One 9:e103207CrossRefPubMedPubMedCentralGoogle Scholar
  68. 68.
    Khang TF, Lau CY (2015) Getting the most out of RNA-seq data analysis. PeerJ 3:e1360CrossRefPubMedPubMedCentralGoogle Scholar
  69. 69.
    Ghosh S, Chan CK (2016) Analysis of RNA-Seq data using TopHat and cufflinks. Methods Mol Biol 1374:339–361CrossRefPubMedGoogle Scholar
  70. 70.
    Chatterjee A, Stockwell PA, Rodger EJ et al (2012) Comparison of alignment software for genome-wide bisulphite sequence data. Nucleic Acids Res 40:e79CrossRefPubMedPubMedCentralGoogle Scholar
  71. 71.
    Love MI, Anders S, Kim V et al (2015) RNA-Seq workflow: gene-level exploratory analysis and differential expression. F1000Res 4:1070CrossRefPubMedPubMedCentralGoogle Scholar
  72. 72.
    Carvalho BS, Irizarry RA (2010) A framework for oligonucleotide microarray preprocessing. Bioinformatics 26:2363–2367CrossRefPubMedPubMedCentralGoogle Scholar
  73. 73.
    Andersson R, Gebhard C, Miguel-Escalada I et al (2014) An atlas of active enhancers across human cell types and tissues. Nature 507:455–461CrossRefPubMedPubMedCentralGoogle Scholar
  74. 74.
    Lun AT, Chen Y, Smyth GK (2016) It’s DE-licious: a recipe for differential expression analyses of RNA-seq experiments using quasi-likelihood methods in edgeR. Methods Mol Biol 1418:391–416CrossRefPubMedGoogle Scholar
  75. 75.
    Chen Y, Lun AT, Smyth GK (2016) From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Res 5:1438PubMedPubMedCentralGoogle Scholar
  76. 76.
    Li H, Handsaker B, Wysoker A et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079CrossRefPubMedPubMedCentralGoogle Scholar
  77. 77.
    Chatterjee A, Stockwell PA, Ahn A et al (2017) Genome-wide methylation sequencing of paired primary and metastatic cell lines identifies common DNA methylation changes and a role for EBF3 as a candidate epigenetic driver of melanoma metastasis. Oncotarget 8(4):6085–6101CrossRefPubMedGoogle Scholar
  78. 78.
    Li B, Ruotti V, Stewart RM et al (2010) RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26:493–500CrossRefPubMedGoogle Scholar
  79. 79.
    Al Ameri A, Koller C, Kantarjian H et al (2010) Acute pulmonary failure during remission induction chemotherapy in adults with acute myeloid leukemia or high-risk myelodysplastic syndrome. Cancer 116:93–97PubMedGoogle Scholar
  80. 80.
    Robinson MD, Oshlack A (2010) A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11:R25CrossRefPubMedPubMedCentralGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Aniruddha Chatterjee
    • 1
    • 2
  • Antonio Ahn
    • 1
  • Euan J. Rodger
    • 1
    • 2
  • Peter A. Stockwell
    • 3
  • Michael R. Eccles
    • 1
    • 2
  1. 1.Department of Pathology, Dunedin School of MedicineUniversity of OtagoDunedinNew Zealand
  2. 2.Maurice Wilkins Centre for Molecular BiodiscoveryAucklandNew Zealand
  3. 3.Department of BiochemistryUniversity of OtagoDunedinNew Zealand

Personalised recommendations