Skip to main content

How to Analyze Gene Expression Using RNA-Sequencing Data

  • Protocol
  • First Online:
Next Generation Microarray Bioinformatics

Part of the book series: Methods in Molecular Biology ((MIMB,volume 802))

Abstract

RNA-Seq is arising as a powerful method for transcriptome analyses that will eventually make microarrays obsolete for gene expression analyses. Improvements in high-throughput sequencing and efficient sample barcoding are now enabling tens of samples to be run in a cost-effective manner, competing with microarrays in price, excelling in performance. Still, most studies use microarrays, partly due to the ease of data analyses using programs and modules that quickly turn raw microarray data into spreadsheets of gene expression values and significant differentially expressed genes. Instead RNA-Seq data analyses are still in its infancy and the researchers are facing new challenges and have to combine different tools to carry out an analysis. In this chapter, we provide a tutorial on RNA-Seq data analysis to enable researchers to quantify gene expression, identify splice junctions, and find novel transcripts using publicly available software. We focus on the analyses performed in organisms where a reference genome is available and discuss issues with current methodology that have to be solved before RNA-Seq data can utilize its full potential.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63

    Article  PubMed  CAS  Google Scholar 

  2. Wang ET, Sandberg R, Luo S et al (2008) Alternative isoform regulation in human tissue transcriptomes. Nature 456:470–476

    Article  PubMed  CAS  Google Scholar 

  3. Pan Q, Shai O, Lee L et al (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet 40:1413–1415

    Article  PubMed  CAS  Google Scholar 

  4. Yoder-Himes DR, Chain PSG, Zhu Y et al (2009) Mapping the Burkholderia cenocepacia niche response via high-throughput sequencing. Proc Natl Acad Sci USA 106:3976–3981

    Article  PubMed  CAS  Google Scholar 

  5. Armour CD, Castle JC, Chen R et al (2009) Digital transcriptome profiling using selective hexamer priming for cDNA synthesis. Nat Methods 6:647–649

    Article  PubMed  CAS  Google Scholar 

  6. Core LJ, Waterfall JJ and Lis JT (2008) Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science 322:1845–1848

    Article  PubMed  CAS  Google Scholar 

  7. Ingolia NT, Ghaemmaghami S, Newman JRS et al (2009) Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324:218–223

    Article  PubMed  CAS  Google Scholar 

  8. Metzker ML (2010) Sequencing technologies – the next generation. Nat Rev Genet 11:31–46

    Article  PubMed  CAS  Google Scholar 

  9. Mortazavi A, Williams BA, McCue K et al (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5:621–628

    Article  PubMed  CAS  Google Scholar 

  10. Guttman M, Garber M, Levin JZ et al (2010) Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat Biotechnol 28:503–510

    Article  PubMed  CAS  Google Scholar 

  11. Trapnell C, Williams BA, Pertea G et al (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28:511–515

    Article  PubMed  CAS  Google Scholar 

  12. Sequence Read Archive. http://www.ncbi.nlm.nih.gov/sra.

  13. Gene Expression Omnibus. http://www.ncbi.nlm.nih.gov/geo.

  14. Ewing B, Hillier L, Wendl MC et al (1998) Base-calling of automated sequencer traces using phred I accuracy assessment. Genome Res 8:175–185

    PubMed  CAS  Google Scholar 

  15. Cock PJA, Fields CJ, Goto N et al (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res 38:1767–1771

    Article  PubMed  CAS  Google Scholar 

  16. Giardine B, Riemer C, Hardison RC et al (2005) Galaxy: a platform for interactive large-scale genome analysis. Genome Res 15:1451–1455

    Article  PubMed  CAS  Google Scholar 

  17. Stajich JE, Block D, Boulez K et al (2002) The Bioperl toolkit: Perl modules for the life sciences. Genome Res 12:1611–1618

    Article  PubMed  CAS  Google Scholar 

  18. Cock PJA, Antao T, Chang JT et al (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25:1422–1423

    Article  PubMed  CAS  Google Scholar 

  19. NCBI (2010) Sequence Read Archive Submission Guidelines. http://www.ncbi.nlm.nih.gov/Traces/sra/static/SRA_Submission_Guidelines.pdf. Accessed 2 Nov 2010

  20. SOLiD Sequence Read Format package. http://solidsoftwaretools.com/gf/project/srf/

  21. Staden IO module. http://staden.sourceforge.net/

  22. Sequenceread package http://sourceforge.net/projects/sequenceread/

  23. Pepke S, Wold B, Mortazavi A (2009) Computation for ChIP-seq and RNA-seq studies. Nat Methods 6:S22-S32

    Article  PubMed  CAS  Google Scholar 

  24. Dohm JC, Lottaz C, Borodina T et al (2008) Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 36:e105

    Article  PubMed  Google Scholar 

  25. Langmead B, Trapnell C, Pop M et al (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25

    Article  PubMed  Google Scholar 

  26. Trapnell C, Pachter L and Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25:1105–1111

    Article  PubMed  CAS  Google Scholar 

  27. Chen Y, Souaiaia T and Chen T (2009) PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds. Bioinformatics 25:2514–2521

    Article  PubMed  CAS  Google Scholar 

  28. Galaxy. http://g2.bx.psu.edu

  29. Galaxy Experimental Features. http://test.g2.bx.psu.edu

  30. Novoalign. http://www.novocraft.com

  31. Homer N, Merriman B, Nelson SF (2009) BFAST: an alignment tool for large scale genome resequencing. PLoS ONE 4:e7767

    Article  PubMed  Google Scholar 

  32. Mosaik. http://bioinformatics.bc.edu/marthlab/Mosaik

  33. Ozsolak F, Platt AR, Jones DR et al (2009) Direct RNA sequencing. Nature 461:814–818

    Article  PubMed  CAS  Google Scholar 

  34. Tophat. http://tophat.cbcb.umd.edu/index.html

  35. UCSC Genome Browser FAQ File Formats. http://genome.ucsc.edu/FAQ/FAQformathtml#format1

  36. Bowtie. http://bowtie-bio.sourceforge.net

  37. RNA-Seq files at sandberg lab homepage. http://sandberg.cmb.ki.se/rnaseq/

  38. PerM. http://code.google.com/p/perm/

  39. Python. http://www.python.org

  40. Li H, Handsaker B, Wysoker A et al (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078–2079

    Article  PubMed  Google Scholar 

  41. UCSC Genome Browser Downloads. http://hgdownload.cse.ucsc.edu/downloads.html

  42. van Bakel H, Nislow C, Blencowe BJ et al (2010) Most “dark matter” transcripts are associated with known genes. PLoS Biol 8:e1000371

    Article  PubMed  Google Scholar 

  43. Integrative Genome Browser. http://www.broadinstitute.org/igv

  44. Sandberg R, Neilson JR, Sarma A et al (2008) Proliferating cells express mRNAs with shortened 3′ untranslated regions and fewer microRNA target sites. Science 320:1643–7

    Article  PubMed  CAS  Google Scholar 

  45. Neilson JR and Sandberg R (2010) Heterogeneity in mammalian RNA 3′ end formation. Exp Cell Res 316:1357–1364

    Article  PubMed  CAS  Google Scholar 

  46. Ramsköld D, Wang ET, Burge CB et al (2009) An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Comput Biol 5:e1000598

    Article  PubMed  Google Scholar 

  47. Montgomery SB, Sammeth M, Gutierrez-Arcelus M et al (2010) Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 464:773–777

    Article  PubMed  CAS  Google Scholar 

  48. NumPy. http://numpy.scipy.org

  49. Kent WJ, Zweig AS, Barber G et al (2010) BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics 26:2204–2207

    Article  PubMed  CAS  Google Scholar 

  50. UCSC stand-alone bioinformatic programs. http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/

  51. UCSC Mappability Data. http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/

  52. Marioni JC, Mason CE, Mane SM et al (2008) RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 18:1509–1517

    Article  PubMed  CAS  Google Scholar 

  53. Allison DB, Cui X, Page GP et al (2006) Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 7:55–65

    Article  PubMed  CAS  Google Scholar 

  54. Robinson MD, McCarthy DJ and Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–140

    Article  PubMed  CAS  Google Scholar 

  55. Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol 11:R106

    Article  PubMed  CAS  Google Scholar 

  56. Scripture. http://www.broadinstitute.org/software/scripture

  57. R, http://www.r-project.org/

  58. Bioconductor, http://www.bioconductor.org/

  59. Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18:1851–1858

    Article  PubMed  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rickard Sandberg .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Science+Business Media, LLC

About this protocol

Cite this protocol

Ramsköld, D., Kavak, E., Sandberg, R. (2012). How to Analyze Gene Expression Using RNA-Sequencing Data. In: Wang, J., Tan, A., Tian, T. (eds) Next Generation Microarray Bioinformatics. Methods in Molecular Biology, vol 802. Humana Press. https://doi.org/10.1007/978-1-61779-400-1_17

Download citation

  • DOI: https://doi.org/10.1007/978-1-61779-400-1_17

  • Published:

  • Publisher Name: Humana Press

  • Print ISBN: 978-1-61779-399-8

  • Online ISBN: 978-1-61779-400-1

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics