NBLDA: negative binomial linear discriminant analysis for RNASeq data
 1.4k Downloads
 4 Citations
Abstract
Background
RNAsequencing (RNASeq) has become a powerful technology to characterize gene expression profiles because it is more accurate and comprehensive than microarrays. Although statistical methods that have been developed for microarray data can be applied to RNASeq data, they are not ideal due to the discrete nature of RNASeq data. The Poisson distribution and negative binomial distribution are commonly used to model count data. Recently, Witten (Annals Appl Stat 5:2493–2518, 2011) proposed a Poisson linear discriminant analysis for RNASeq data. The Poisson assumption may not be as appropriate as the negative binomial distribution when biological replicates are available and in the presence of overdispersion (i.e., when the variance is larger than or equal to the mean). However, it is more complicated to model negative binomial variables because they involve a dispersion parameter that needs to be estimated.
Results
In this paper, we propose a negative binomial linear discriminant analysis for RNASeq data. By Bayes’ rule, we construct the classifier by fitting a negative binomial model, and propose some plugin rules to estimate the unknown parameters in the classifier. The relationship between the negative binomial classifier and the Poisson classifier is explored, with a numerical investigation of the impact of dispersion on the discriminant score. Simulation results show the superiority of our proposed method. We also analyze two real RNASeq data sets to demonstrate the advantages of our method in realworld applications.
Conclusions
We have developed a new classifier using the negative binomial model for RNAseq data classification. Our simulation results show that our proposed classifier has a better performance than existing works. The proposed classifier can serve as an effective tool for classifying RNAseq data. Based on the comparison results, we have provided some guidelines for scientists to decide which method should be used in the discriminant analysis of RNASeq data. R code is available at http://www.comp.hkbu.edu.hk/~xwan/NBLDA.Ror https://github.com/yangchadam/NBLDA
Keywords
RNASeq Negative binomial distribution Linear discriminant analysisBackground
RNAsequencing (RNASeq) is a revolutionary technology that uses the capabilities of nextgeneration sequencing to infer gene expression levels [1, 2, 3]. Compared to microarray technology, RNASeq has many advantages including the detection of novel transcripts, low background signal, and the increased specificity and sensitivity. Due to reduced sequencing cost, RNASeq has been widely used in biomedical research in recent years [4]. In general, three major steps are involved in RNAseq: (1) RNA is isolated from biopsy or serum sample and segmented to an average length of 200 nucleotides; (2) The RNA segments are converted into cDNA; and (3) The cDNA is sequenced. RNAseq usually produces millions of short reads, between 25 and 300 basepairs in length. The reads are then mapped to genomic or transcriptomic regions of interest.
RNAseq is different to the microarray technology that measures the level of gene expression on a continuous scale. It counts the number of reads that are mapped to one gene and measures the level of gene expression with nonnegative integers. As a result, popular tools that assume a Gaussian distribution in microrray data analysis, such as linear discriminant analysis, may not perform as well as those methods that adopt appropriate discrete distributions for RNASeq data. Law et al. [5] recently proposed applying normalbased microarraylike statistical methods to the data transformed from RNAseq read counts. Simulation studies show that this approach performs as well or better than countbased RNAseq methods particularly when the number of replicates is large. However, conducting transformation may remove the count nature of the data [6, 7]. McCarthy et al. [8] pointed out that the transformation is not fully tuned to the characteristics of read count data, and provided more detailed reasons why the transformation is inappropriate. First, they stated that for very small counts, they are far from normally distributed after transformation. Second, the strong meanvariance relationship which the count data shows is ignored and this will lead to inefficient inference.
For RNASeq data, the Poisson distribution and negative binomial distribution are two common distributions considered in the expression detection and classification. Many methods have been proposed to detect differentially expressed genes, including edgeR [9, 10], DESeq2 [11], baySeq [12], BBSeq [13], SAMseq [14], DSS [15], AMAP [16], sSeq [17], and LFCseq [18]. However, there is less progress in classification using RNASeq data until recently. Witten [19] proposed a Poisson linear discriminant analysis (PLDA) which assumes that RNASeq data follow the Poisson distribution. Tan et al. [20] further discussed many methods, such as logistic regression and partial least squares, and showed that PLDA is a comparable method. The Poisson distribution is suitable for modeling RNASeq data when biological replicates are not available. However, if biological replicates are available, the Poisson distribution may not be a proper choice owing to the overdispersion issue, where the variances of such data are likely to exceed their means [11, 16]. The overdispersion issue can have a significant effect on classification accuracies. In realworld applications, biological replicates can provide more convincing results than technical replicates. Therefore, it is necessary to look for some solutions to take the overdispersion issue into consideration.
We note that Witten [19] has considered this problem and pointed out that the classification accuracy can be further improved for overdispersed data by extending the Poisson model to the negative binomial model. However, to construct an appropriate negative binomial classifier for practical use, two major issues remain to be solved. The first issue is that the probability density function (pdf) of the negative binomial distribution is more complicated than that of the Poisson distribution, which gives rise to a more complicated classifier. The second issue is that the negative binomial distribution contains a dispersion parameter, which controls how much its variance exceeds its mean. To construct the classifier using the negative binomial model, we need to estimate the dispersion parameter. To avoid fitting the complicated negative binomial model, Witten [19] proposed a transformation method for the overdispersed data and found that this method works well if the overdispersion is mild.
In light of the importance of the dispersion in modelling RNASeq data with the negative binomial distribution, some dispersion estimation methods have been proposed recently in the literature. For example, Wu et al. [15] proposed a dispersion estimator using the empirical Bayes method and applied it to find differentially expressed genes. Yu et al. [17] proposed a shrinkage estimator of dispersion which shrinks the estimates obtained by the method of moments towards a target value, and also applied it to detect differentially expressed genes. These new methods for estimating the dispersion parameter make it possible to construct a negative binomial classifier to achieve better classification accuracy on RNASeq data.
 1.
We extend Witten’s method [19] by building a new classifier based on the negative binomial model. Under the assumption of independent genes, we define the discriminant score by Bayes’ rule and propose some plugin rules to estimate the unknown parameters in the classifier.
 2.
We further explore the relationship between NBLDA and PLDA. A numerical comparison is conducted to explore how the dispersion changes the discriminant score. The comparison results will provide some guidelines for scientists to decide which method should be used in the discriminant analysis of RNASeq data.
To demonstrate the performance of our proposed method, we conduct several simulation studies under different numbers of genes, sample sizes, and proportions of differentially expressed genes. Simulation results show that the proposed NBLDA outperforms existing methods in many settings. Three real RNASeq data sets are also analyzed to demonstrate the advantages of NBLDA. Specifically, we propose the negative binomial classifier, explore the relationship between NBLDA and PLDA, and present the parameter estimation in Section “Methods”. Simulation studies and real data analysis are conducted in Sections “Results” and “Discussion”, respectively. We conclude the paper with some discussions in Section “Conclusions”.
Methods
Let X _{ ig } denote the number of reads mapped to gene g in sample i, i=1,…,n and g=1,…,G. To identify which class a new observation belongs to, Witten [19] proposed a PLDA for classifying RNASeq data. In this section, we propose a new discriminant analysis for RNASeq data by assuming that the data follow the negative binomial distribution.
Negative binomial linear discriminant analysis
where s _{ i } is the size factor which is used to scale gene counts for the ith sample due to different sequencing depth, λ _{ g } is the total number of reads per gene, and ϕ _{ g }≥0 is the dispersion parameter. We have E(X _{ ig })=μ _{ ig } and \(\text {Var}(X_{ig})=\mu _{ig}+\mu _{ig}^{2}\phi _{g}\). Note that the variance is larger than the mean for the negative binomial distribution. Noting that X _{ ig } ∼ Poisson(μ _{ ig }) in [19].
where d _{ kg } are gene and classspecific parameters that allow for differential expression among the K classes, and y _{ i }=k∈{1,…,K} represents the label of sample i. We also follow the independence assumption in PLDA [19] that all genes are independent of each other. Note that the independence assumption is frequently assumed in microarray data analysis. In realworld applications, the gene expression profile of an individual can be used to test whether this individual has a disease and/or a specific type of disease, which is essentially a classification problem.
where C is a constant independent of k. We then assign the new observation x ^{∗} to class k that maximizes the quantity (5). Throughout the paper, we estimate the prior probability π _{ k } by n _{ k }/n, where n _{ k } is the sample size in class k. For balanced data, the prior probability is simplified as π _{ k }=1/K for all k=1,…,K. For gene g, the total number of reads is \(\lambda _{g}=\sum _{i=1}^{n} X_{ig}\), and the class difference d _{ kg } can be estimated by \((\sum _{i\in C_{k}}X_{ig} +1)/(\sum _{i\in C_{k}}\hat {s}_{i}\hat {\lambda }_{g} +1)\), which is the same posterior mean for \(\hat {d}_{kg}\) in [19] assuming a Gamma prior distribution for this parameter. Estimation of the unknown parameters including s _{ i } and ϕ _{ g } will be discussed in Section “Parameter estimation”.
To explore the relationship between the proposed NBLDA and the PLDA, we assume that s ^{∗} λ _{ g } d _{ kg } are bounded. When ϕ _{ g }→0, we have log(1+s ^{∗} λ _{ g } d _{ kg } ϕ _{ g })→0 and \(\phi _{g}^{1} \log (1+s^{*}\lambda _{g}d_{kg}\phi _{g}) = \log (1+s^{*}\lambda _{g}d_{kg}\phi _{g})^{\phi _{g}^{1}} \to s^{*}\lambda _{g}d_{kg}\).
where the right hand of (6) is the discriminant score of PLDA. That is, the NBLDA classifier reduces to the PLDA classifier when there is little dispersion in the data. From this point of view, the proposed NBLDA can be treated as a generalized version of PLDA.
To investigate how the dispersion changes their discriminant scores, We conduct a numerical comparison between NBLDA and PLDA. Two cases are considered, where the first one assumes a common dispersion for all genes, and the second one assumes not. Note that the classifiers (5) and (6) have two same terms: logπ _{ k } and C. Without loss of generality, we compute the discriminant scores only using the first two terms in (5) and (6), respectively. In the comparison study, we fix \(X^{*}_{g}=10\), d _{ kg }=1.5, s ^{∗}=1, λ _{ g }=10 and G=500. For the case of common dispersion, we set the dispersion ranging from 0 to 20. For the case of different dispersions, we let ϕ _{ g } be independent and identically distributed (i.i.d.) random variables from a chisquared distribution with the degrees of freedom ranging from 0.1 to 5.
Parameter estimation
Note that the discriminant score in (5) involves two unknown parameters, size factor s ^{∗} and dispersion parameter ϕ _{ g }.
Size factor estimation
 Total count: PLDA divided the total read counts of sample i by the total read counts of all samples to estimate the size factor of sample i. That is,$$\hat{s}^{*}=\frac{\sum_{g=1}^{G} X^{*}_{g}}{\sum_{i=1}^{n}\sum_{g=1}^{G} X_{ig}}~~~\text{and}~~~\hat{s}_{i}=\frac{\sum_{g=1}^{G} X_{ig}}{\sum_{i=1}^{n}\sum_{g=1}^{G} X_{ig}}. $$
 DESeq2: Love et al. [11] first divided the read counts of sample i by the geometric mean of all samples’ read counts, and then estimated the size factor by computing the median of those G values. Specifically, the size factors are estimated by$$\begin{aligned} m^{*}&=\text{median}_{g}\frac{X_{g}^{*}}{\left(\prod_{i=1}^{n} X_{ig}\right)^{1/n}}~~~\text{and}\\&~~~m_{i}=\text{median}_{g}\frac{X_{ig}}{\left(\prod_{i=1}^{n} X_{ig}\right)^{1/n}}, \end{aligned} $$$$\hat{s}^{*}=m^{*}/\sum\limits^{n}_{i=1}m_{i}~~~\text{and}~~~\hat{s}_{i}=m_{i}/\sum\limits^{n}_{i=1}m_{i}. $$
 Upper quartile: Bullard et al. [21] proposed a robust method that uses the upper quartile of the read counts to estimate the size factors. Specifically, the size factors are estimated bywhere q ^{∗} and q _{ i } are the upper quartiles for the test data and sample i in the training data, respectively.$$\hat{s}^{*}=\frac{q^{*}}{\sum_{i=1}^{n} q_{i}}~~~\text{and}~~~\hat{s}_{i}=\frac{q_{i}}{\sum_{i=1}^{n} q_{i}}, $$
In our simulation studies, we find that there is little difference in the performance of classification among the three methods. Hence, for brevity, we only report the simulation results based on the total count method in the reminder of the paper.
Dispersion parameter estimation
\(\tilde {\phi }_{g}\) are the initial dispersion estimates obtained by the method of moments, and ξ is the target value calculated by minimizing the average squared difference between \(\tilde {\phi }_{g}\) and \(\hat {\phi }_{g}\). Throughout the paper, we use the estimator (7) to estimate the dispersion parameter.
Results

NBLDA,

PLDA,

Support vector machines (SVM),

Knearest neighbors (KNN).
For PLDA, we use the R package “PoiClaClu". For SVM, we use the R package “e1071" and choose the radial basis kernel in our simulation studies. For KNN, we choose k=1, 3 and 5.
Simulation design
We run 1,000 simulations, compute its mean, and then obtain the mean misclassification rate. It is worth noting that Witten [19] discussed the problem of overdispersion and proposed to transform the data to fit a Poisson model. In our experiment, we applied the data transformation proposed by Witten [19] when testing PLDA.
Simulation results
Figure 2 illustrates the effect of the proportion of differentially expressed genes on the mean misclassification rate. In general, with an increasing number of differentially expressed genes, both methods have decreased mean classification rates. NBLDA always outperforms the other three methods. In particular, when the sample size is small (n=8), NBLDA has a significant improvement over the other approaches.
Figure 3 shows the impact of the number of genes on the mean misclassification rate. We consider G=20, 30, 50, and 100 for this investigation. From Fig. 3, we observe that an increasing number of genes will lead to a lower misclassification rate. NBLDA shows its superiority over the other three methods, and the improvement is more significant when the sample size and the number of genes are smaller.
Figure 4 shows the effect of overdispersion on the mean misclassification rate. We consider ϕ=1, 5, 10, 20 and 30 for this investigation. Figure 4 shows that a larger dispersion will result in a higher mean misclassification rate. Both NBLDA and PLDA perform better than SVM and KNN. When the overdispersion is not very high, NBLDA and PLDA have similar performance, with NBLDA slightly better than PLDA. When the overdispersion is high, however, the performance of NBLDA is much better than PLDA.
Real data analysis

Cervical cancer data [24]. Two groups of samples are contained in this data set. One is the nontumor group which includes 29 samples, and the other one is the tumor group which includes 29 samples. There are 714 microRNAs in this data set. This data set is available in Gene Expression Omnibus (GEO) Datasets with access number GSE20592.

HapMap data [25, 26]. A total number of 52,580 probes are included in this data set, and this data set includes two classes, CEU and YRI, where the sample sizes are 60 and 69, respectively.
The Cervical cancer data was also used in [19]. It is worth mentioning that Witten [19] used four data sets to illustrate the performance of the proposed method. We found that for the other two data sets, i.e., Liver and kidney data and Yeast data, the mean misclassification rates of PLDA and NBLDA discussed in this paper are all zeros. The other data set, i.e., Transcription factor binding data, is the ChIPSeq data. Hence, we do not discuss these three data sets in this manuscript.
Gene selection
For real biomedical research in which RNASeq technology is used, it is common that thousands or tens of thousands of genes are measured simultaneously. We perform a gene selection procedure to screen the informative genes before applying a classification rule to RNASeq data. By doing gene selection, we rule out the noise as much as possible so that the variance of the discriminant score is reduced, and consequently we have an increased interpretability.
The BSS/WSS method [27] is a common gene selection method and has been widely used in the literature [28, 29, 30]. This method computes the ratio of the sum of squares between groups to the sum of squares within groups for each gene, and selects genes whose ratios are in the top. However, this method assumes the data to be normally distributed so that it may not be suitable for RNASeq data.
Witten [19] proposed a screening method to select genes for RNASeq data by using softthresholding to shrink the estimate of d _{ kg } towards 1. However, this method can not be applied to our discriminant analysis because the dispersion is involved in our discriminant rule. For the negative binomial distribution, edgeR [9, 10] has been proposed to detect differentially expressed genes in RNASeq data. This method first estimates the genewise dispersions by maximizing the combination of genespecific conditional likelihood and common conditional likelihood, and then replaces the hypergeometric distribution in Fisher’s exact test by the negative binomial distribution to construct an exact test. In this paper, we use edgeR (version 3.3) to perform the gene selection procedure, which is available in Bioconductor (www.bioconductor.org).
Real data analysis results
We first conduct the gene selection procedure using edgeR (version 3.3) and obtain G genes for further analysis. We then randomly split the sample into two sets: the training set and the test set. The training set is used to construct the classifier and the test set is used to compute the misclassification rate. We repeat the whole procedure 1,000 times and compute the mean misclassification rate for the four methods, NBLDA, PLDA, SVM, and KNN, respectively.
The medians of their dispersions for Cervical cancer data and HapMap data, where "G" represents the number of top genes selected by edgeR (version 3.3)
Data sets  G=20  G=50  G=100  G=500 

Cervical cancer  21.2  23.3  18.2  11.0 
HapMap  36.4  40.1  38.2  20.1 
Discussion
In this paper, we have proposed an NBLDA classifier using the negative binomial model. Our simulation results show that our proposed NBLDA has a better performance than PLDA in the presence of moderate or high dispersions. When there is little dispersion in the data, NBLDA is also comparable to PLDA. We have further explored the relationship between NBLDA and PLDA, and investigated the impact of dispersion on the discriminant score of NBLDA by conducting a numerical comparison. It is worth noting that even for a small dispersion, the two discriminant scores can be rather different. This suggests that for real RNASeq data with moderate or high dispersion, NBLDA may be a more appropriate method than PLDA. Note that the true dispersions are unlikely to be known in practice. Therefore, we propose to first estimate the average dispersion using some novel estimation methods in the recent literature. Second, if the estimated average dispersion is small, we use PLDA; and otherwise we use NBLDA.
We note that the independence assumption in both Witten’s method and our method is very restrictive. For real gene expression data sets, it may not be realistic to assume that all genes are independent of each other. In our future study, we would like to incorporate the network information of pathways or gene sets to further improve the performance of classification. The clustering of sequencing data is also an important issue in biomedical research. Hence, another possible future work is to extend the proposed clustering method [19] to follow the negative binomial model.
Conclusions
Next generation sequencing technology has been widely applied in biomedical research and RNASeq begins to replace the microarray technology gradually in recent years. Since RNASeq data are nonnegative integers, differing from that of microarray data, it is necessary to develop methods that are well suited for RNASeq data. Two discrete distributions, the Poisson distribution and negative binomial distribution, are commonly used in the literature to model RNASeq data. Compared to the Poisson distribution, the negative binomial distribution allows its variance to exceed its mean and is more suitable for the situations when biological replicates are available. Nevertheless, the negative binomial model is more complicated than the Poisson model as the additional dispersion parameter also needs to be estimated. In this paper, we have developed a new classifier using the negative binomial model for RNAseq data classification. Our simulation results show that our proposed classifier has a better performance than existing works. To conclude, our proposed classifier can serve as an effective tool for classifying RNAseq data.
Notes
Acknowledgements
The authors would like to thank the anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper.
Funding
Hongyu Zhao’s research was supported by the National Institutes of Health grant R01 GM59507. Xiang Wan’s research was supported by the Hong Kong RGC grant HKBU12202114, the Hong Kong Baptist University grant FRG2/1415/077, and Hong Kong Baptist University Strategic Development Fund. Tiejun Tong’s research was supported in part by Hong Kong Baptist University FRG grants FRG1/1415/084, FRG2/1516/019 and FRG2/1516/038, and the National Natural Science Foundation of China grant (No. 11671338).
Availability of supporting data
The data sets supporting the results of this article are included within the article and the references.
Authors’ contributions
KD developed NBLDA for RNASeq data, conducted the simulation studies and real data analysis, and wrote the draft of the manuscript. HZ revised the manuscript. TT and XW provided the guidance on methodology and finalized the manuscript. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
References
 1.Mardis ER. Nextgeneration DNA sequencing methods. Annu Rev Genomics Hum Genet. 2008; 9:387–402.CrossRefPubMedGoogle Scholar
 2.Wang Z, Gerstein M, Snyder M. RNASeq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009; 10:57–63.CrossRefPubMedPubMedCentralGoogle Scholar
 3.Morozova O, Hirst M, Marra MA. Applications of new sequencing technologies for transcriptome analysis. Annu Rev Genomics Hum Genet. 2009; 10:135–51.CrossRefPubMedGoogle Scholar
 4.Lorenz DJ, Gill RS, Mitra R, Datta S. Using RNAseq data to detect differentially expressed genes. In: Statistical Analysis of Next Generation Sequencing Data. New York: Springer: 2014. p. 25–49.Google Scholar
 5.Law CW, Chen Y, Shi W, Smyth GK. Voom: precision weights unlock linear model analysis tools for rnaseq read counts. Genome Biol. 2014; 15(2):29.CrossRefGoogle Scholar
 6.Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNAseq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008; 18:1509–1517.CrossRefPubMedPubMedCentralGoogle Scholar
 7.Oshlack A, Robinson MD, Young MD, et al.From rnaseq reads to differential expression results. Genome biol. 2010; 11(12):220.CrossRefPubMedPubMedCentralGoogle Scholar
 8.McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor rnaseq experiments with respect to biological variation. Nucleic Acids Res. 2012; 40(10):4288–97.CrossRefPubMedPubMedCentralGoogle Scholar
 9.Robinson MD, Smyth GK. Smallsample estimation of negative binomial dispersion with applications to SAGE data. Biostatistics. 2008; 9:321–32.CrossRefPubMedGoogle Scholar
 10.Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26:139–40.CrossRefPubMedGoogle Scholar
 11.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for rnaseq data with deseq2. Genome Biol. 2014; 15(12):1–21.CrossRefGoogle Scholar
 12.Hardcastle TJ, Kelly KA. baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinforma. 2010; 11:422.CrossRefGoogle Scholar
 13.Zhou Y, Xia K, Wright FA. A powerful and flexible approach to the analysis of RNA sequence count data. Bioinformatics. 2011; 27:2672–678.CrossRefPubMedPubMedCentralGoogle Scholar
 14.Li J, Tibshirani R. Finding consistent patterns: A nonparametric approach for identifying differential expression in RNASeq data. Stat Methods Med Res. 2013; 22:519–36.CrossRefPubMedGoogle Scholar
 15.Wu H, Wang C, Wu Z. A new shrinkage estimator for dispersion improves differential expression detection in RNASeq data. Biostatistics. 2013; 14:232–43.CrossRefPubMedGoogle Scholar
 16.Si Y, Liu P. An optimal test with maximum average power while controlling FDR with application to RNASeq data. Biometrics. 2013; 69:594–605.CrossRefPubMedGoogle Scholar
 17.Yu D, Huber W, Vitek O. Shrinkage estimation of dispersion in Negative Binomial models for RNASeq experiments with small sample size. Bioinformatics. 2013; 29:1275–1282.CrossRefPubMedPubMedCentralGoogle Scholar
 18.Lin B, Zhang L, Chen X. LFCseq: a nonparametric approach for differential expression analysis of RNAseq data. BMC Genomics. 2014; 15(Suppl 10):7.CrossRefGoogle Scholar
 19.Witten DM. Classification and clustering of sequencing data using a Poisson model. Annals Appl Stat. 2011; 5:2493–518.CrossRefGoogle Scholar
 20.Tan KM, Petersen A, Witten D. Classification of RNAseq data. In: Statistical Analysis of Next Generation Sequencing Data. New York: Springer: 2014. p. 219–46.Google Scholar
 21.Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNASeq experiments. BMC Bioinforma. 2010; 11:94.CrossRefGoogle Scholar
 22.Dillies MA, Rau A, Aubert J, HennequetAntier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, et al. A comprehensive evaluation of normalization methods for Illumina highthroughput RNA sequencing data analysis. Brief Bioinform. 2013; 14:671–83.CrossRefPubMedGoogle Scholar
 23.Landau WM, Liu P. Dispersion estimation and its effect on test performance in RNASeq data analysis: A simulationbased comparison of methods. PLOS ONE. 2013; 8:81415.CrossRefGoogle Scholar
 24.Witten D, Tibshirani R, Gu SG, Fire A, Lui W. Ultrahigh throughput sequencingbased small rna discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls. BMC Biol. 2010; 8:58.CrossRefPubMedPubMedCentralGoogle Scholar
 25.Montgomery SB, Sammeth M, GutierrezArcelus M, Lach RP, Ingle C, Nisbett J, Guigo R, Dermitzakis ET. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature. 2010; 464:773–7.CrossRefPubMedGoogle Scholar
 26.Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, Pritchard JK. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010; 464:768–72.CrossRefPubMedPubMedCentralGoogle Scholar
 27.Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc. 2002; 97:77–87.CrossRefGoogle Scholar
 28.Lee JW, Lee JB, Park M, Song SH. An extensive comparison of recent classification tools applied to microarray data. Comput Stat Data Anal. 2005; 48:869–85.CrossRefGoogle Scholar
 29.Pang H, Tong T, Zhao H. Shrinkagebased diagonal discriminant analysis and its applications in highdimensional data. Biometrics. 2009; 65:1021–1029.CrossRefPubMedPubMedCentralGoogle Scholar
 30.Huang S, Tong T, Zhao H. Biascorrected diagonal discriminant rules for highdimensional classification. Biometrics. 2010; 66:1096–1106.CrossRefPubMedPubMedCentralGoogle Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.