How to Analyze Gene Expression Using RNA-Sequencing Data

Ramsköld, Daniel; Kavak, Ersen; Sandberg, Rickard

doi:10.1007/978-1-61779-400-1_17

Daniel Ramsköld⁴,
Ersen Kavak⁴ &
Rickard Sandberg⁴

Part of the book series: Methods in Molecular Biology ((MIMB,volume 802))

18k Accesses
13 Citations
2 Altmetric

Abstract

RNA-Seq is arising as a powerful method for transcriptome analyses that will eventually make microarrays obsolete for gene expression analyses. Improvements in high-throughput sequencing and efficient sample barcoding are now enabling tens of samples to be run in a cost-effective manner, competing with microarrays in price, excelling in performance. Still, most studies use microarrays, partly due to the ease of data analyses using programs and modules that quickly turn raw microarray data into spreadsheets of gene expression values and significant differentially expressed genes. Instead RNA-Seq data analyses are still in its infancy and the researchers are facing new challenges and have to combine different tools to carry out an analysis. In this chapter, we provide a tutorial on RNA-Seq data analysis to enable researchers to quantify gene expression, identify splice junctions, and find novel transcripts using publicly available software. We focus on the analyses performed in organisms where a reference genome is available and discuss issues with current methodology that have to be solved before RNA-Seq data can utilize its full potential.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 159.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63
Article PubMed CAS Google Scholar
Wang ET, Sandberg R, Luo S et al (2008) Alternative isoform regulation in human tissue transcriptomes. Nature 456:470–476
Article PubMed CAS Google Scholar
Pan Q, Shai O, Lee L et al (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet 40:1413–1415
Article PubMed CAS Google Scholar
Yoder-Himes DR, Chain PSG, Zhu Y et al (2009) Mapping the Burkholderia cenocepacia niche response via high-throughput sequencing. Proc Natl Acad Sci USA 106:3976–3981
Article PubMed CAS Google Scholar
Armour CD, Castle JC, Chen R et al (2009) Digital transcriptome profiling using selective hexamer priming for cDNA synthesis. Nat Methods 6:647–649
Article PubMed CAS Google Scholar
Core LJ, Waterfall JJ and Lis JT (2008) Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science 322:1845–1848
Article PubMed CAS Google Scholar
Ingolia NT, Ghaemmaghami S, Newman JRS et al (2009) Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324:218–223
Article PubMed CAS Google Scholar
Metzker ML (2010) Sequencing technologies – the next generation. Nat Rev Genet 11:31–46
Article PubMed CAS Google Scholar
Mortazavi A, Williams BA, McCue K et al (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5:621–628
Article PubMed CAS Google Scholar
Guttman M, Garber M, Levin JZ et al (2010) Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol 28:503–510
Article PubMed CAS Google Scholar
Trapnell C, Williams BA, Pertea G et al (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28:511–515
Article PubMed CAS Google Scholar
Sequence Read Archive. http://www.ncbi.nlm.nih.gov/sra.
Gene Expression Omnibus. http://www.ncbi.nlm.nih.gov/geo.
Ewing B, Hillier L, Wendl MC et al (1998) Base-calling of automated sequencer traces using phred I accuracy assessment. Genome Res 8:175–185
PubMed CAS Google Scholar
Cock PJA, Fields CJ, Goto N et al (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38:1767–1771
Article PubMed CAS Google Scholar
Giardine B, Riemer C, Hardison RC et al (2005) Galaxy: a platform for interactive large-scale genome analysis. Genome Res 15:1451–1455
Article PubMed CAS Google Scholar
Stajich JE, Block D, Boulez K et al (2002) The Bioperl toolkit: Perl modules for the life sciences. Genome Res 12:1611–1618
Article PubMed CAS Google Scholar
Cock PJA, Antao T, Chang JT et al (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25:1422–1423
Article PubMed CAS Google Scholar
NCBI (2010) Sequence Read Archive Submission Guidelines. http://www.ncbi.nlm.nih.gov/Traces/sra/static/SRA_Submission_Guidelines.pdf. Accessed 2 Nov 2010
SOLiD Sequence Read Format package. http://solidsoftwaretools.com/gf/project/srf/
Staden IO module. http://staden.sourceforge.net/
Sequenceread package http://sourceforge.net/projects/sequenceread/
Pepke S, Wold B, Mortazavi A (2009) Computation for ChIP-seq and RNA-seq studies. Nat Methods 6:S22-S32
Article PubMed CAS Google Scholar
Dohm JC, Lottaz C, Borodina T et al (2008) Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 36:e105
Article PubMed Google Scholar
Langmead B, Trapnell C, Pop M et al (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25
Article PubMed Google Scholar
Trapnell C, Pachter L and Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25:1105–1111
Article PubMed CAS Google Scholar
Chen Y, Souaiaia T and Chen T (2009) PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds. Bioinformatics 25:2514–2521
Article PubMed CAS Google Scholar
Galaxy. http://g2.bx.psu.edu
Galaxy Experimental Features. http://test.g2.bx.psu.edu
Novoalign. http://www.novocraft.com
Homer N, Merriman B, Nelson SF (2009) BFAST: an alignment tool for large scale genome resequencing. PLoS ONE 4:e7767
Article PubMed Google Scholar
Mosaik. http://bioinformatics.bc.edu/marthlab/Mosaik
Ozsolak F, Platt AR, Jones DR et al (2009) Direct RNA sequencing. Nature 461:814–818
Article PubMed CAS Google Scholar
Tophat. http://tophat.cbcb.umd.edu/index.html
UCSC Genome Browser FAQ File Formats. http://genome.ucsc.edu/FAQ/FAQformathtml#format1
Bowtie. http://bowtie-bio.sourceforge.net
RNA-Seq files at sandberg lab homepage. http://sandberg.cmb.ki.se/rnaseq/
PerM. http://code.google.com/p/perm/
Python. http://www.python.org
Li H, Handsaker B, Wysoker A et al (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078–2079
Article PubMed Google Scholar
UCSC Genome Browser Downloads. http://hgdownload.cse.ucsc.edu/downloads.html
van Bakel H, Nislow C, Blencowe BJ et al (2010) Most “dark matter” transcripts are associated with known genes. PLoS Biol 8:e1000371
Article PubMed Google Scholar
Integrative Genome Browser. http://www.broadinstitute.org/igv
Sandberg R, Neilson JR, Sarma A et al (2008) Proliferating cells express mRNAs with shortened 3′ untranslated regions and fewer microRNA target sites. Science 320:1643–7
Article PubMed CAS Google Scholar
Neilson JR and Sandberg R (2010) Heterogeneity in mammalian RNA 3′ end formation. Exp Cell Res 316:1357–1364
Article PubMed CAS Google Scholar
Ramsköld D, Wang ET, Burge CB et al (2009) An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Comput Biol 5:e1000598
Article PubMed Google Scholar
Montgomery SB, Sammeth M, Gutierrez-Arcelus M et al (2010) Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 464:773–777
Article PubMed CAS Google Scholar
NumPy. http://numpy.scipy.org
Kent WJ, Zweig AS, Barber G et al (2010) BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics 26:2204–2207
Article PubMed CAS Google Scholar
UCSC stand-alone bioinformatic programs. http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/
UCSC Mappability Data. http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/
Marioni JC, Mason CE, Mane SM et al (2008) RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 18:1509–1517
Article PubMed CAS Google Scholar
Allison DB, Cui X, Page GP et al (2006) Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 7:55–65
Article PubMed CAS Google Scholar
Robinson MD, McCarthy DJ and Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–140
Article PubMed CAS Google Scholar
Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol 11:R106
Article PubMed CAS Google Scholar
Scripture. http://www.broadinstitute.org/software/scripture
R, http://www.r-project.org/
Bioconductor, http://www.bioconductor.org/
Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18:1851–1858
Article PubMed CAS Google Scholar

Download references

Author information

Authors and Affiliations

Department of Cell and Molecular Biology, Karolinska Institutet and Ludwig Institute for Cancer Research, Stockholm, Sweden
Daniel Ramsköld, Ersen Kavak & Rickard Sandberg

Authors

Daniel Ramsköld
View author publications
You can also search for this author in PubMed Google Scholar
Ersen Kavak
View author publications
You can also search for this author in PubMed Google Scholar
Rickard Sandberg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rickard Sandberg .

Editor information

Editors and Affiliations

Norwegian Radium Hospital, Oslo University Hospital, Montebello, Oslo, 0310, Norway
Junbai Wang
School of Medicine, University of Colorado Denver, E. 17th Avenue 12801, Aurora, 80010, Colorado, USA
Aik Choon Tan
School of Mathematics and Statistics, University of Glasgow, Glasgow, 3800, United Kingdom
Tianhai Tian

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Ramsköld, D., Kavak, E., Sandberg, R. (2012). How to Analyze Gene Expression Using RNA-Sequencing Data. In: Wang, J., Tan, A., Tian, T. (eds) Next Generation Microarray Bioinformatics. Methods in Molecular Biology, vol 802. Humana Press. https://doi.org/10.1007/978-1-61779-400-1_17

Download citation

DOI: https://doi.org/10.1007/978-1-61779-400-1_17
Published: 18 November 2011
Publisher Name: Humana Press
Print ISBN: 978-1-61779-399-8
Online ISBN: 978-1-61779-400-1
eBook Packages: Springer Protocols

Publish with us

Policies and ethics