A censored-Poisson model based approach to the analysis of RNA-seq data

Chen, Xing; Lai, Yinglei

doi:10.1007/s40484-020-0208-3

A censored-Poisson model based approach to the analysis of RNA-seq data

Research Article
Published: 15 June 2020

Volume 8, pages 155–171, (2020)
Cite this article

Download PDF

Quantitative Biology

A censored-Poisson model based approach to the analysis of RNA-seq data

Download PDF

Xing Chen¹ &
Yinglei Lai¹

618 Accesses
1 Citation
Explore all metrics

Abstract

Background

With the recent advance of sequencing technology, the collection of RNA expression (RNA-seq) data has been growing rapidly. RNA-seq data are statistically count-type measurements. Poisson distribution is a basic probability distribution for modeling count-type data. With Poisson regression models, various experimental factors, GC content as well as alternative splicing isoforms can be flexibly considered in RNA-seq data analysis. Due to the biochemical and technical limitations of sequencing technology, the biases among RNA-seq data have been recognized.

Methods

In this study, an artificial censoring approach has been proposed to an isoform-specific Poisson regression model for analyzing RNA-seq data. Low expression values can be grouped (censored) into one probability category, and high expression values can also be grouped (censored) into another probability category. We have implemented the related Newton-Raphson numeric computing procedure to achieve the maximum likelihood estimation for our censored-Poisson regression model. The related mathematical simplifications have been derived for the consideration of stable and convenient numerical computing.

Results

The advantages of our artificial censoring approach have been demonstrated in both simulation studies and application analysis of experimental data.

Conclusions

Our proposed artificial censoring approach allows us to focus on the majority of data. As the extreme values (tails) of data are artificially censored, more efficient analysis results can be obtained, even from relatively simple Poisson regression models. Our proposed artificial censoring approach can certainly be considered for other well-developed models or methods for RNA-seq data analysis.

Article PDF

Single-Cell RNA Sequencing Analysis: A Step-by-Step Overview

RNA-Seq Data Analysis in Galaxy

A survey of best practices for RNA-seq data analysis

Article Open access 26 January 2016

References

Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D. R., Pimentel, H., Salzberg, S. L., Rinn, J. L., and Pachter, L. (2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc., 7, 562–578
Article CAS Google Scholar
Alkhateeb, A., and Rueda, L. (2017) Zseq: An approach for preprocessing next-generation sequencing data. J. Comput. Biol., 24, 746–755
Article CAS Google Scholar
Pérez-Rubio, P., Lottaz, C., and Engelmann, J. C. (2019) FastqPuri: high-performance preprocessing of RNA-seq data. BMC Bioinformatics, 20, 226
Article Google Scholar
Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. and Wold, B. (2008) Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods, 5, 621–628
Article CAS Google Scholar
Li, J., Jiang, H., and Wong, W. H. (2010) Modeling non-uniformity in short-read rates in RNA-seq data. Genome Biol., 11, R50
Article Google Scholar
Li, B. and Dewey, C. N. (2011) RSEM: accurate transcript quantification from RNA-seq data with or without a reference genome. BMC Bioinformatics, 12, 323
Article CAS Google Scholar
Jiang, H. and Wong, W. H. (2009) Statistical inferences for isoform expression in RNA-seq. Bioinformatics, 25, 1026–1032
Article CAS Google Scholar
Salzman, J., Jiang, H. and Wong, W. H. (2011) Statistical modeling of RNA-seq data. Stat. Sci., 26, 62–83
Article Google Scholar
Shi, Y. and Jiang, H. (2013) rSeqDiff: detecting differential isoform expression from RNA-seq data using hierarchical likelihood ratio test. PLoS One, 8, e79448
Article Google Scholar
Dohm, J. C., Lottaz, C., Borodina, T. and Himmelbauer, H. (2008) Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res., 36, e105
Article Google Scholar
Aird, D., Ross, M. G., Chen, W. S., Danielsson, M., Fennell, T., Russ, C., Jaffe, D. B., Nusbaum, C. and Gnirke, A. (2011) Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol., 12, R18
Article CAS Google Scholar
Benjamini, Y. and Speed, T. P. (2012) Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res., 40, e72
Article CAS Google Scholar
Hansen, K. D., Irizarry, R. A. and Wu, Z. (2012) Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics, 13, 204–216
Article Google Scholar
Robinson, M. D. and Smyth, G. K. (2007) Moderated statistical tests for assessing differences in tag abundance. Bioinformatics, 23, 2881–2887
Article CAS Google Scholar
Robinson, M. D. and Smyth, G. K. (2008) Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics, 9, 321–332
Article Google Scholar
Anders, S. and Huber, W. (2010) Differential expression analysis for sequence count data. Genome Biol., 11, R106
Article CAS Google Scholar
Anders, S., McCarthy, D. J., Chen, Y., Okoniewski, M., Smyth G. K., Huber, W. and Robinson, M. D. (2013) Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nat. Protoc., 8, 1765–1786
Article Google Scholar
Rau, A., Maugis-Rabusseau, C., Martin-Magniette, M.-L. and Celeux G. (2015) Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics, 31, 1420–1427
Article CAS Google Scholar
Pertea, M., Kim, D., Pertea, G. M., Leek, J. T. and Salzberg, S. L. (2016) Transcript-level expression analysis of rna-seq experiments with hisat, stringtie and ballgown. Nat. Protoc., 11, 1650–1667
Article CAS Google Scholar
Kazakiewicz, D., Claesen, J., Görczak, K., Plewczynski, D. and Burzykowski, T. (2019) A multivariate negative-binomial model with random effects for differential gene-expression analysis of correlated mrna sequencing data. J. Comput. Biol., 26, 1339–1348
Article CAS Google Scholar
Li, B., Ruotti, V., Stewart, R. M., Thomson, J. A. and Dewey, C. N. (2010) RNA-seq gene expression estimation with read mapping uncertainty. Bioinformatics, 26, 493–500
Article Google Scholar
Khoury, M. P. and Bourdon, J.-C. (2011) p53 isoforms: An intracellular microprocessor? Genes Cancer, 2, 453–465
Article CAS Google Scholar
Cancer Genome Atlas Network. (2012) Comprehensive molecular portraits of human breast tumours. Nature, 490, 61–70
Article Google Scholar
Rosenbloom, K. R., Armstrong, J., Barber, G.P., Casper, J., Clawson, H., Diekhans, M., Dreszer, T.R., Fujita, P.A., Guruvadoo, L., Haeussler, M., et al. (2015) The UCSC Genome Browser database: 2015 update. Nucleic Acids Res., 43, D670–D681
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, The George Washington University, Washington, DC, 20052, USA
Xing Chen & Yinglei Lai

Authors

Xing Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yinglei Lai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yinglei Lai.

Additional information

Author summary: RNA sequencing (RNA-seq) expression data have been increasingly collected for various biomedical studies. Due to the biochemical and technical limitations of sequencing technology, the biases among RNA-seq data have been recognized. We have developed an artificial censoring approach to the analysis of isoform-specific RNA-seq expression data. Low and high expression values can be grouped (censored) into the related probability categories. This approach allows us to focus on the majority of data and to obtain more efficient analysis results. Our proposed artificial censoring approach can also be considered in other RNA-seq data analysis scenarios.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, X., Lai, Y. A censored-Poisson model based approach to the analysis of RNA-seq data. Quant Biol 8, 155–171 (2020). https://doi.org/10.1007/s40484-020-0208-3

Download citation

Received: 06 January 2020
Revised: 31 March 2020
Accepted: 04 May 2020
Published: 15 June 2020
Issue Date: June 2020
DOI: https://doi.org/10.1007/s40484-020-0208-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A censored-Poisson model based approach to the analysis of RNA-seq data