The Role of Spike-In Standards in the Normalization of RNA-seq
Normalization of RNA-seq data is essential to ensure accurate inference of expression levels, by adjusting for sequencing depth and other more complex nuisance effects, both within and between samples. Recently, the External RNA Control Consortium (ERCC) developed a set of 92 synthetic spike-in standards that are commercially available and relatively easy to add to a typical library preparation. In this chapter, we compare the performance of several state-of-the-art normalization methods, including adaptations that directly use spike-in sequences as controls. We show that although the ERCC spike-ins could in principle be valuable for assessing accuracy in RNA-seq experiments, their read counts are not stable enough to be used for normalization purposes. We propose a novel approach to normalization that can successfully make use of control sequences to remove unwanted effects and lead to accurate estimation of expression fold-changes and tests of differential expression.
KeywordsRead Count Empirical Cumulative Distribution Function Unwanted Variation Loess Normalization Negative Control Sequence
We thank Leming Shi for providing the SEQC pilot data and Laurent Jacob for his help with the software implementation of the RUV method.
- Anders, S., Pyl, P.T., Huber, W.: HTSeq: a Python framework to work with high-throughput sequencing data. Technical Report, bioRxiv preprint (2014). doi:10.1101/002824Google Scholar
- Dillies, M.A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., Keime, C., Marot, G., Castel, D., Estelle, J., et al.: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinform. 14(6), 671–683 (2013)CrossRefGoogle Scholar
- Gagnon-Bartsch, J., Jacob, L., Speed, T.P.: Removing unwanted variation from high dimensional data with negative controls. Technical Report 820, Department of Statistics, University of California, Berkeley (2013)Google Scholar
- Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R.A., Leisch, F., Li, C., Maechler, M., Rossini, A.J., Sawitzki, G., Smith, C., Smyth, G.K., Tierney, L., Yang, Y.H., Zhang, J.: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5(10), R80 (2004)CrossRefGoogle Scholar
- R Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2013). http://www.R-project.org
- Risso, D., Ngai, J., Speed, T., Dudoit, S.: Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol. (2014, in press).Google Scholar
- Robinson, M.D., McCarthy, D.J., Smyth, G.K.: edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1), 139–140 (2010)Google Scholar
- Su, Z., Labaj, P., Li, S., Thierry-Mieg, J., Thierry-Mieg, D., Shi, W., et al.: Power and limitations of RNA-Seq. Nat. Biotechnol. (2014, in press)Google Scholar