Computational Methods for Quality Check, Preprocessing and Normalization of RNA-Seq Data for Systems Biology and Analysis

Mazzoni, Gianluca; Kadarmideen, Haja N.

doi:10.1007/978-3-319-43332-5_3

Computational Methods for Quality Check, Preprocessing and Normalization of RNA-Seq Data for Systems Biology and Analysis

Gianluca Mazzoni² &
Haja N. Kadarmideen²

Chapter
First Online: 29 October 2016

799 Accesses
2 Citations

Abstract

The use of RNA sequencing (RNA-Seq) technologies is increasing mainly due to the development of new next-generation sequencing machines that have reduced the costs and the time needed for data generation.

Nevertheless, microarrays are still the more common choice and one of the reasons is the complexity of the RNA-Seq data analysis. Furthermore, numerous biases can arise from RNA-Seq technology, and these biases have to be identified and removed properly in order to obtain accurate results.

Nowadays, many tools have been developed which allow to perform each step without high-level programming skills. However, each step of the pipeline needs to be performed carefully and requires a good knowledge of both the technology and the algorithms.

In this comprehensive review, we describe the fundamental steps of the pipeline for RNA-Seq analysis to identify differentially expressed genes: raw data quality control, trimming and filtering procedures, alignment, postmapping quality control, counting, normalization and differential expression test.

For each step, we present the most common tools and we give a complete description of their main characteristics and advantages focusing on the statistics that they perform and the assumptions that they make about the data.

The choice of the right tool can have a big impact on the final results. Until now, no gold standard has been established for this type of analysis.

In conclusion, this review can be useful for both educational purposes as well as for less experienced practitioners of animal genomic research. In the absence of a commonly accepted standard procedure, the general overview presented in this review can help to make the best choices during the implementation of an RNA-Seq pipeline.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol 11(10):R106
Article CAS PubMed PubMed Central Google Scholar
Anders S, Pyl PT, Huber W (2014) HTSeq–A Python framework to work with high-throughput sequencing data. Bioinformatics btu638, 31(2):166–9
Google Scholar
Andrews S (2010) FastQC: a quality control tool for high throughput sequence data., Reference Source
Google Scholar
Benjamini Y, Speed TP (2012) Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res gks001, 40(10):e72
Google Scholar
Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC (2001) Minimum information about a microarray experiment (MIAME)—toward standards for microarray data. Nat Genet 29(4):365–371
Article CAS PubMed Google Scholar
Bullard JH, Purdom E, Hansen KD, Dudoit S (2010) Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinform 11(1):94
Article Google Scholar
Cochrane GR, Galperin MY (2010) The 2010 nucleic acids research database issue and online database collection: a community of data resources. Nucleic Acids Res 38(suppl 1):D1–D4
Article CAS PubMed Google Scholar
DeLuca DS, Levin JZ, Sivachenko A, Fennell T, Nazaire M-D, Williams C, Reich M, Winckler W, Getz G (2012) RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics 28(11):1530–1532
Article CAS PubMed PubMed Central Google Scholar
Dillies M-A, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J (2013) A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform 14(6):671–683
Article CAS PubMed Google Scholar
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29(1):15–21
Article CAS PubMed Google Scholar
Engström PG, Steijger T, Sipos B, Grant GR, Kahles A, Rätsch G, Goldman N, Hubbard TJ, Harrow J, Guigó R (2013) Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods 10(12):1185–1191
Article PubMed PubMed Central Google Scholar
FAANG (Functional Annotation of Animal Genomes). http://www.faang.org/
Fang Z, Martin J, Wang Z (2012) Statistical methods for identifying differentially expressed genes in RNA-Seq experiments. Cell Biosci 2(1):26
Article CAS PubMed PubMed Central Google Scholar
Garber M, Grabherr MG, Guttman M, Trapnell C (2011) Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods 8(6):469–477
Article CAS PubMed Google Scholar
García-Alcalde F, Okonechnikov K, Carbonell J, Cruz LM, Götz S, Tarazona S, Dopazo J, Meyer TF, Conesa A (2012) Qualimap: evaluating next-generation sequencing alignment data. Bioinformatics 28(20):2678–2679
Article PubMed Google Scholar
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29(7):644–652
Article CAS PubMed PubMed Central Google Scholar
Hansen KD, Irizarry RA, Zhijin W (2012) Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics 13(2):204–216
Article PubMed PubMed Central Google Scholar
Hardcastle TJ, Kelly KA (2010) baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 11(1):422
Google Scholar
Kim D, Salzberg SL (2011) TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biol 12(8):R72
Article CAS PubMed PubMed Central Google Scholar
Kroll KW, Mokaram NE, Pelletier AR, Frankhouser DE, Westphal MS, Stump PA, Stump CL, Bundschuh R, Blachly JS, Yan P (2014) Quality control for RNA-seq (QuaCRS): an integrated quality control pipeline. Cancer Inform 13(Suppl 3):7
PubMed PubMed Central Google Scholar
Kvam VM, Liu P, Si Y (2012) A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. Am J Bot 99(2):248–256
Article PubMed Google Scholar
Langfelder P, Horvath S (2008) WGCNA: an R package for weighted correlation network analysis. BMC Bioinform 9(1):559
Article Google Scholar
Lassmann T, Hayashizaki Y, Daub CO (2011) SAMStat: monitoring biases in next generation sequencing data. Bioinformatics 27(1):130–131
Article CAS PubMed Google Scholar
Lin SM, Du P, Huber W, Kibbe WA (2008) Model-based variance-stabilizing transformation for Illumina microarray data. Nucleic Acids Res 36(2):e11–e11
Article PubMed PubMed Central Google Scholar
Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15(12):1–21
Article Google Scholar
Mazzoni G, Kogelman L, Suravajhala P, Kadarmideen H (2015) Systems genetics of complex diseases using RNA-sequencing methods. Int J Biosci Biochem Bioinform 5(4):264
Google Scholar
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5(7):621–628
Article CAS PubMed Google Scholar
Mutz K-O, Heilkenbrinker A, Lönne M, Walter J-G, Stahl F (2013) Transcriptome analysis using next-generation sequencing. Curr Opin Biotechnol 24(1):22–30
Article CAS PubMed Google Scholar
Oshlack A, Wakefield MJ (2009) Transcript length bias in RNA-seq data confounds systems biology. Biol Direct 4(1):14
Article PubMed PubMed Central Google Scholar
Oshlack A, Robinson MD, Young MD (2010) From RNA-seq reads to differential expression results. Genome Biol 11(12):220
Article CAS PubMed PubMed Central Google Scholar
Risso D, Schwartz K, Sherlock G, Dudoit S (2011) GC-content normalization for RNA-Seq data. BMC Bioinform 12(1):480
Article CAS Google Scholar
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015) limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res gkv007, 43(7):e47
Google Scholar
Robinson MD, Oshlack A (2010) A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11(3):R25
Article PubMed PubMed Central Google Scholar
Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1):139–140
Article CAS PubMed Google Scholar
Seyednasrollah F, Laiho A, Elo LL (2015) Comparison of software packages for detecting differential expression in RNA-seq studies. Brief Bioinform 16(1):59–70
Article PubMed Google Scholar
Soneson C, Delorenzi M (2013) A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinform 14(1):91
Article Google Scholar
Tarazona S, García F, Ferrer A, Dopazo J, Conesa A (2012) NOIseq: a RNA-seq differential expression method robust for sequencing depth biases. EMBnet J 17(B):18–19
Article Google Scholar
Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L (2013) Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol 31(1):46–53
Article CAS PubMed Google Scholar
Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10(1):57–63
Article CAS PubMed PubMed Central Google Scholar
Wang L, Wang S, Li W (2012) RSeQC: quality control of RNA-seq experiments. Bioinformatics 28(16):2184–2185
Article CAS PubMed Google Scholar
Williams AG, Thomas S, Wyman SK, Holloway AK (2014) RNA¯seq data: challenges in and recommendations for experimental design and analysis. Curr Protoc Hum Genet 11.13. 11–11.13. 20
Google Scholar
Williams CR, Baccarella A, Parrish JZ, Kim CC (2016) Trimming of sequence reads alters RNA-Seq gene expression estimates. BMC Bioinform 17(1):1
Article Google Scholar
Wysoker A, Tibbetts K, Fennell T (2012) Picard. http://picard.sourceforge.net.
Zhang ZH, Jhaveri DJ, Marshall VM, Bauer DC, Edson J, Narayanan RK, Robinson GJ, Lundberg AE, Bartlett PF, Wray NR (2014) A comparative study of techniques for differential expression analysis on RNA-Seq data 9(8):e103207
Google Scholar
Zhao S, Fung-Leung W-P, Bittner A, Ngo K, Liu X (2014) Comparison of RNA-Seq and microarray in transcriptome profiling of activated T cells. PLoS One 9(1)
Google Scholar
Zheng W, Chung LM, Zhao H (2011) Bias detection and correction in RNA-Sequencing data. Bmc Bioinform 12(1):290
Article CAS Google Scholar

Download references

Acknowledgments

We thank Programme Commission on Health, Food and Welfare of the Danish Council for Strategic Research (Innovationsfonden) for financial support within the GIFT project.

Author information

Authors and Affiliations

Department of Large Animal Sciences, University of Copenhagen, Groennegaardsvej 7, Frederiksberg C, 1870, Denmark
Gianluca Mazzoni & Haja N. Kadarmideen

Authors

Gianluca Mazzoni
View author publications
You can also search for this author in PubMed Google Scholar
Haja N. Kadarmideen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gianluca Mazzoni .

Editor information

Editors and Affiliations

Faculty of Health and Medical Sciences, University of Copenhagen , Frederiksberg C, Denmark
Haja N. Kadarmideen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mazzoni, G., Kadarmideen, H.N. (2016). Computational Methods for Quality Check, Preprocessing and Normalization of RNA-Seq Data for Systems Biology and Analysis. In: Kadarmideen, H. (eds) Systems Biology in Animal Production and Health, Vol. 2. Springer, Cham. https://doi.org/10.1007/978-3-319-43332-5_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-43332-5_3
Published: 29 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43330-1
Online ISBN: 978-3-319-43332-5
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics