Advertisement

Quality Control Metrics for Extraction-Free Targeted RNA-Seq Under a Compositional Framework

  • Dominic LaRocheEmail author
  • Dean Billheimer
  • Kurt Michels
  • Bonnie LaFleur
Conference paper
Part of the Springer Proceedings in Mathematics & Statistics book series (PROMS, volume 218)

Abstract

The rapid rise in the use of RNA sequencing technology (RNA-seq) for scientific discovery has led to its consideration as a clinical diagnostic tool. However, as a new technology the analytical accuracy and reproducibility of RNA-seq must be established before it can realize its full clinical utility (SEQC/MAQC-III Consortium, 2014; VanKeuren-Jensen et al. 2014). We respond to the need for reliable diagnostics, quality control metrics and improved reproducibility of RNA-seq data by recognizing and capitalizing on the relative frequency nature of RNA-Seq data. Problems with sample quality, library preparation, or sequencing may result in a low number of reads allocated to a given sample within a sequencing run. We propose a method, based on outlier detection of Centered Log-Ratio (CLR) transformed counts, for objectively identifying problematic samples based on the total number of reads allocated to the sample. Normalization and standardization methods for RNA-Seq generally assume that the total number of reads assigned to a sample does not affect the observed relative frequencies of probes within an assay. This assumpion, known as Compositional Invariance, is an important property for RNA-Seq data which enables the comparison of samples with differing read depths. Violations of the invariance property can lead to spurious differential expression results, even after normalization. We develop a diagnostic method to identify violations of the Compositional Invariance property. Batch effects arising from differing laboratory conditions or operator differences have been identified as a problem in high-throughput measurement systems (Leek et al. in Genome Biol 15, R29 [14]; Chen et al. in PLoS One 6 [10]). Batch effects are typically identified with a hierarchical clustering (HC) method or principal components analysis (PCA). For both methods, the multivariate distance between the samples is visualized, either in a biplot for PCA or a dendrogram for HC, to check for the existence of clusters of samples related to batch. We show that CLR transformed RNA-Seq data is appropriate for evaluation in a PCA biplot and improves batch effect detection over current methods. As RNA-Seq makes the transition from the research laboratory to the clinic there is a need for robust quality control metrics. The realization that RNA-Seq data are compositional opens the door to the existing body of theory and methods developed by Aitchison (The statistical analysis of compositional data, Chapman & Hall Ltd., 1986) and others. We show that the properties of compositional data can be leveraged to develop new metrics and improve existing methods.

Keywords

RNA-Seq Next generation sequencing Composition Quality control Relative abundance Normalization 

References

  1. 1.
    Aitchison, J.: The Statistical Analysis of Compositional Data. Chapman & Hall, Ltd. (1986). http://dl.acm.org/citation.cfm?id=17272
  2. 2.
    Aitchison, J.: On criteria for measures of compositional difference. Math Geol 24(4), 365–379 (1992).  https://doi.org/10.1007/BF00891269. http://link.springer.com/10.1007/BF00891269MathSciNetCrossRefGoogle Scholar
  3. 3.
    Aitchison, J., Greenacre, M.: Biplots of compositional data. J R Stat Soc Series C (Appl Stat) 51(4), 375–392 (2002).  https://doi.org/10.1111/1467-9876.00275. http://doi.wiley.com/10.1111/1467-9876.00275MathSciNetCrossRefGoogle Scholar
  4. 4.
    Aitchison, J., Barceló-Vidal, C., Martín-Fernández, J.A., Pawlowsky-Glahn, V.: Logratio analysis and compositional distance. Math Geol 32(3), 271–275 (2000).  https://doi.org/10.1023/A:1007529726302CrossRefzbMATHGoogle Scholar
  5. 5.
    Aitchison, J., Shen, S.: Logistic-normal distributions: some properties and uses. Biometrika 67(2), 261–272 (1980).  https://doi.org/10.1093/biomet/67.2.261. https://www.researchgate.net/publication/229099731_Logistic-Normal_Distributions_Some_Properties_and_UsesMathSciNetCrossRefGoogle Scholar
  6. 6.
    Anders, S., Huber, W.: Differential expression analysis for sequence count data. Genome Biol 11(10), R106 (2010).  https://doi.org/10.1186/gb-2010-11-10-r106. http://www.biomedcentral.com/content/pdf/gb-2010-11-10-r106.pdfCrossRefGoogle Scholar
  7. 7.
    Ben-Gal, I.: Outlier detection. In: Data Mining and Knowledge Discovery Handbook, pp. 117–130. Springer, US (2009).  https://doi.org/10.1007/978-0-387-09823-4_7. http://link.springer.com/10.1007/978-0-387-09823-4_7CrossRefGoogle Scholar
  8. 8.
    Billheimer, D., Guttorp, P., Fagan, W.F.: Statistical interpretation of species composition. J Am Stat Assoc 96(456), 1205–1214 (2001).  https://doi.org/10.1198/016214501753381850. http://www.jstor.org/stable/3085883MathSciNetCrossRefGoogle Scholar
  9. 9.
    Bolstad, B.M., Irizarry, R.A., Astrand, M., Speed, T.P.: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics (Oxford, England) 19(2), 185–193 (2003). http://www.ncbi.nlm.nih.gov/pubmed/12538238CrossRefGoogle Scholar
  10. 10.
    Chen, C., Grennan, K., Badner, J., Zhang, D., Gershon, E., Jin, L., Liu, C.: Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. PLoS One 6(2) (2011).  https://doi.org/10.1371/journal.pone.0017238CrossRefGoogle Scholar
  11. 11.
    Dillies, M.A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., Keime, C., Marot, N.S., Castel, D., Estelle, J., Guernec, G., Jagla, B., Jouneau, L., Laloe, D., Le Gall, C., Schaeffer, B., Le Crom, S., Guedj, M., Jaffrezic, F.: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefngs Bioinform 14(6), 671–683 (2013).  https://doi.org/10.1093/bib/bbs046CrossRefGoogle Scholar
  12. 12.
    Hawkins, D.M.: Identification of Outliers. Springer Netherlands, Dordrecht (1980).  https://doi.org/10.1007/978-94-015-3994-4. http://link.springer.com/10.1007/978-94-015-3994-4CrossRefGoogle Scholar
  13. 13.
    Law, C.W., Chen, Y., Shi, W., Smyth, G.K.: voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 15(2), R29 (2014).  https://doi.org/10.1186/gb-2014-15-2-r29. http://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29CrossRefGoogle Scholar
  14. 14.
    Leek, J.T., Scharpf, R.B., Bravo, H.C., Simcha, D., Langmead, B., Johnson, W.E., Geman, D., Baggerly, K., Irizarry, R.A.: Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet 11(10), 733–739 (2010).  https://doi.org/10.1038/nrg2825. http://dx.doi.org/10.1038/nrg2825CrossRefGoogle Scholar
  15. 15.
    Lovell, D., Müller, W., Taylor, J., Zwart, A., Helliwell, C.: Proportions, percentages, PPM: do the molecular biosciences treat compositional data right? In: Compositional Data Analysis: Theory and Applications, pp. 191–207. Wiley (2011).  https://doi.org/10.1002/9781119976462.ch14. http://dx.doi.org/10.1002/9781119976462.ch14CrossRefGoogle Scholar
  16. 16.
    Lovell, D., Pawlowsky-Glahn, V., Egozcue, J.J., Marguerat, S., Bähler, J.: Proportionality: a valid alternative to correlation for relative data. PLoS Comput Biol 11(3), e1004,075 (2015).  https://doi.org/10.1371/journal.pcbi.1004075. http://www.ncbi.nlm.nih.gov/pubmed/25775355CrossRefGoogle Scholar
  17. 17.
    Luo, J., Schumacher, M., Scherer, A., Sanoudou, D., Megherbi, D., Davison, T., Shi, T., Tong, W., Shi, L., Hong, H., Zhao, C., Elloumi, F., Shi, W., Thomas, R., Lin, S., Tillinghast, G., Liu, G., Zhou, Y., Herman, D., Li, Y., Deng, Y., Fang, H., Bushel, P., Woods, M., Zhang, J.: A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data. Pharmacogenomics J 10(4), 278–291 (2010).  https://doi.org/10.1038/tpj.2010.57. http://www.ncbi.nlm.nih.gov/pubmed/20676067www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC2920074CrossRefGoogle Scholar
  18. 18.
    Martín-Fernández, J.A., Barceló-Vidal, C., Pawlowsky-Glahn, V.: Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Math Geol 35(3), 253–278 (2000).  https://doi.org/10.1023/A:1023866030544. http://link.springer.com/article/10.1023/A%3A1023866030544CrossRefGoogle Scholar
  19. 19.
    Martín-Fernández, J.A., Barceló-Vidal, C., Pawlowsky-Glahn, V., Buccianti, A., Nardi, G., Potenza, R.: Measures of difference for compositional data and hierarchical clustering methods. In: Proceedings of IAMG, vol. 98, no. 1, pp. 526–531 (1998)Google Scholar
  20. 20.
    Martn-Fernndez, J.A., Hron, K., Templ, M., Filzmoser, P., Palarea-Albaladejo, J.: Bayesian multiplicative treatment of count zeros in compositional data sets. Stat Model 15(2), 134–158 (2015). http://ezproxy.library.arizona.edu/login?url=https://search-proquest-com.ezproxy1.library.arizona.edu/docview/1673859465?accountid=8360. (Copyright-SAGE Publications Apr 2015; Last updated 19 Sep 2015)
  21. 21.
    Pearson, K.: Mathematical contributions to the theory of evolution.-On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proc R Soc Lond 60, 489–498 (1896). https://archive.org/details/philtrans00847732 (Free Download & Streaming: Internet Archive.)
  22. 22.
    Robinson, M.D., Oshlack, A.: A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11(3), R25 (2010).  https://doi.org/10.1186/gb-2010-11-3-r25CrossRefGoogle Scholar
  23. 23.
    Robinson, M.D., Smyth, G.K.: Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9(2), 321–332 (2007).  https://doi.org/10.1093/biostatistics/kxm030. http://biostatistics.oxfordjournals.org/cgi/doi/10.1093/biostatistics/kxm030CrossRefGoogle Scholar
  24. 24.
    Sanford, R.F., Pierson, C.T., Crovelli, R.A.: An objective replacement method for censored geochemical data. Math Geol 25(1), 59–80 (1993).  https://doi.org/10.1007/BF00890676. http://link.springer.com/10.1007/BF00890676CrossRefGoogle Scholar
  25. 25.
    Sims, D., Sudbery, I., Ilott, N.E., Heger, A., Ponting, C.P.: Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet 15(2), 121–132 (2014).  https://doi.org/10.1038/nrg3642. http://www.nature.com/doifinder/10.1038/nrg3642CrossRefGoogle Scholar
  26. 26.
    Tarazona, S., García-Alcalde, F., Dopazo, J., Ferrer, A., Conesa, A.: Differential expression in RNA-seq: a matter of depth. Genome Res 21(12), 2213–2223 (2011).  https://doi.org/10.1101/gr.124321111. http://www.ncbi.nlm.nih.gov/pubmed/21903743www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC3227109
  27. 27.
    Tukey, J.W.J.W.: Exploratory Data Analysis. Addison-Wesley Publication, Co (1977)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Dominic LaRoche
    • 1
    Email author
  • Dean Billheimer
    • 2
  • Kurt Michels
    • 1
  • Bonnie LaFleur
    • 1
  1. 1.HTG Molecular Diagnostics, Inc.TucsonUSA
  2. 2.Department of BiostatisticsMel and Enid Zuckerman College of Public Health, University of ArizonaTucsonUSA

Personalised recommendations