Abstract
RNA-Seq is arising as a powerful method for transcriptome analyses that will eventually make microarrays obsolete for gene expression analyses. Improvements in high-throughput sequencing and efficient sample barcoding are now enabling tens of samples to be run in a cost-effective manner, competing with microarrays in price, excelling in performance. Still, most studies use microarrays, partly due to the ease of data analyses using programs and modules that quickly turn raw microarray data into spreadsheets of gene expression values and significant differentially expressed genes. Instead RNA-Seq data analyses are still in its infancy and the researchers are facing new challenges and have to combine different tools to carry out an analysis. In this chapter, we provide a tutorial on RNA-Seq data analysis to enable researchers to quantify gene expression, identify splice junctions, and find novel transcripts using publicly available software. We focus on the analyses performed in organisms where a reference genome is available and discuss issues with current methodology that have to be solved before RNA-Seq data can utilize its full potential.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63
Wang ET, Sandberg R, Luo S et al (2008) Alternative isoform regulation in human tissue transcriptomes. Nature 456:470–476
Pan Q, Shai O, Lee L et al (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet 40:1413–1415
Yoder-Himes DR, Chain PSG, Zhu Y et al (2009) Mapping the Burkholderia cenocepacia niche response via high-throughput sequencing. Proc Natl Acad Sci USA 106:3976–3981
Armour CD, Castle JC, Chen R et al (2009) Digital transcriptome profiling using selective hexamer priming for cDNA synthesis. Nat Methods 6:647–649
Core LJ, Waterfall JJ and Lis JT (2008) Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science 322:1845–1848
Ingolia NT, Ghaemmaghami S, Newman JRS et al (2009) Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324:218–223
Metzker ML (2010) Sequencing technologies – the next generation. Nat Rev Genet 11:31–46
Mortazavi A, Williams BA, McCue K et al (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5:621–628
Guttman M, Garber M, Levin JZ et al (2010) Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol 28:503–510
Trapnell C, Williams BA, Pertea G et al (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28:511–515
Sequence Read Archive. http://www.ncbi.nlm.nih.gov/sra.
Gene Expression Omnibus. http://www.ncbi.nlm.nih.gov/geo.
Ewing B, Hillier L, Wendl MC et al (1998) Base-calling of automated sequencer traces using phred I accuracy assessment. Genome Res 8:175–185
Cock PJA, Fields CJ, Goto N et al (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38:1767–1771
Giardine B, Riemer C, Hardison RC et al (2005) Galaxy: a platform for interactive large-scale genome analysis. Genome Res 15:1451–1455
Stajich JE, Block D, Boulez K et al (2002) The Bioperl toolkit: Perl modules for the life sciences. Genome Res 12:1611–1618
Cock PJA, Antao T, Chang JT et al (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25:1422–1423
NCBI (2010) Sequence Read Archive Submission Guidelines. http://www.ncbi.nlm.nih.gov/Traces/sra/static/SRA_Submission_Guidelines.pdf. Accessed 2 Nov 2010
SOLiD Sequence Read Format package. http://solidsoftwaretools.com/gf/project/srf/
Staden IO module. http://staden.sourceforge.net/
Sequenceread package http://sourceforge.net/projects/sequenceread/
Pepke S, Wold B, Mortazavi A (2009) Computation for ChIP-seq and RNA-seq studies. Nat Methods 6:S22-S32
Dohm JC, Lottaz C, Borodina T et al (2008) Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 36:e105
Langmead B, Trapnell C, Pop M et al (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25
Trapnell C, Pachter L and Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25:1105–1111
Chen Y, Souaiaia T and Chen T (2009) PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds. Bioinformatics 25:2514–2521
Galaxy. http://g2.bx.psu.edu
Galaxy Experimental Features. http://test.g2.bx.psu.edu
Novoalign. http://www.novocraft.com
Homer N, Merriman B, Nelson SF (2009) BFAST: an alignment tool for large scale genome resequencing. PLoS ONE 4:e7767
Ozsolak F, Platt AR, Jones DR et al (2009) Direct RNA sequencing. Nature 461:814–818
UCSC Genome Browser FAQ File Formats. http://genome.ucsc.edu/FAQ/FAQformathtml#format1
RNA-Seq files at sandberg lab homepage. http://sandberg.cmb.ki.se/rnaseq/
Python. http://www.python.org
Li H, Handsaker B, Wysoker A et al (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078–2079
UCSC Genome Browser Downloads. http://hgdownload.cse.ucsc.edu/downloads.html
van Bakel H, Nislow C, Blencowe BJ et al (2010) Most “dark matter” transcripts are associated with known genes. PLoS Biol 8:e1000371
Integrative Genome Browser. http://www.broadinstitute.org/igv
Sandberg R, Neilson JR, Sarma A et al (2008) Proliferating cells express mRNAs with shortened 3′ untranslated regions and fewer microRNA target sites. Science 320:1643–7
Neilson JR and Sandberg R (2010) Heterogeneity in mammalian RNA 3′ end formation. Exp Cell Res 316:1357–1364
Ramsköld D, Wang ET, Burge CB et al (2009) An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Comput Biol 5:e1000598
Montgomery SB, Sammeth M, Gutierrez-Arcelus M et al (2010) Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 464:773–777
NumPy. http://numpy.scipy.org
Kent WJ, Zweig AS, Barber G et al (2010) BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics 26:2204–2207
UCSC stand-alone bioinformatic programs. http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/
UCSC Mappability Data. http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/
Marioni JC, Mason CE, Mane SM et al (2008) RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 18:1509–1517
Allison DB, Cui X, Page GP et al (2006) Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 7:55–65
Robinson MD, McCarthy DJ and Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–140
Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol 11:R106
Bioconductor, http://www.bioconductor.org/
Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18:1851–1858
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer Science+Business Media, LLC
About this protocol
Cite this protocol
Ramsköld, D., Kavak, E., Sandberg, R. (2012). How to Analyze Gene Expression Using RNA-Sequencing Data. In: Wang, J., Tan, A., Tian, T. (eds) Next Generation Microarray Bioinformatics. Methods in Molecular Biology, vol 802. Humana Press. https://doi.org/10.1007/978-1-61779-400-1_17
Download citation
DOI: https://doi.org/10.1007/978-1-61779-400-1_17
Published:
Publisher Name: Humana Press
Print ISBN: 978-1-61779-399-8
Online ISBN: 978-1-61779-400-1
eBook Packages: Springer Protocols