Stratified Test Alleviates Batch Effects in Single-Cell Data

Liang, Shaoheng; Liang, Qingnan; Chen, Rui; Chen, Ken

doi:10.1007/978-3-030-42266-0_13

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 12099))

Included in the following conference series:

International Conference on Algorithms for Computational Biology

1703 Accesses
1 Citations

Abstract

Analyzing single-cell sequencing data across batches is challenging. We find that the Van Elteren test, a stratified version of Wilcoxon rank-sum test, elegantly mitigates the problem. We also modified the common language effect size to supplement this test, further improving its utility. On both simulated and real patient data we show the ability of Van Elteren test to control for false positives and false negatives. The effect size also estimates the differences between cell types more accurately.

Supported by grant number 2018-182735 to KC, Human Breast Cell Atlas Seed Network Grant (HCA3-0000000147) to NN and KC from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation, grant RP180248 to KC from Cancer Prevention & Research Institute of Texas, and the Cancer Center Support Grant P30 CA016672 to PP from the National Cancer Institute.

You have full access to this open access chapter, Download conference paper PDF

Benchmarking integration of single-cell differential expression

Article Open access 21 March 2023

Hai C. T. Nguyen, Bukyung Baik, … Dougu Nam

A practical solution to pseudoreplication bias in single-cell studies

Article Open access 02 February 2021

Kip D. Zimmerman, Mark A. Espeland & Carl D. Langefeld

SCeQTL: an R package for identifying eQTL from single-cell parallel sequencing data

Article Open access 11 May 2020

Yue Hu, Xi Xi, … Xuegong Zhang

Keywords

1 Introduction

Large-scale studies such as the Human Cell Atlas [15] involve hundreds of laboratories, thousands of patients, and millions of cells, bringing about opportunities and challenges in analyses. When comparing cell types or groups, discrepancies across experiments and differences among participants lead to omissions and false discoveries in differentially expressed genes. Even the trend (upregulated or downregulated) can be reversed in a phenomenon called Simpson’s paradox [1]. Although multiple methods have been proposed to tackle such batch effects, no such option for the widely used Wilcoxon rank-sum test [11, 18] has been applied to single-cell studies, to the best of our knowledge. Here, we show that the stratified rank-sum test (known as Van Elteren test [17]) and our modified common language effect size may benefit single-cell studies.

We briefly review and conceptually compare related works on correcting batch effect in Sect. 1.1. Then, in Sect. 2, we revisit Wilcoxon rank-sum test, and introduce the Van Elteren test supplemented by our direct extension of the common language effect size [5, 12]. Simulation studies and applications to real data in Sect. 3 show that the test controls for the batch effects and leads to more accurate biological discovers, compared with Wilcoxon rank-sum test. More discussions and explanations, are shown in Sect. 4.

1.1 Related Works

Mainstream methods to mitigate batch effect fall into two categories, batch correction methods and batch-aware statistical tests. The former includes methods reducing batch effect in the data to facilitate downstream analysis, while the latter includes analyses that control for the batch effect.

Batch Correction Methods. Batch correction methods eliminate the discrepancy among batches to create an integrated dataset. The most conspicuous manifestation of batch effect is splitting one cell type into multiple clusters. To solve this problem, many methods match and combine clusters across samples based on similarities. A commonly adopted one, Mutual nearest neighbor (MNN) [3], uses similar cells across datasets as anchors, and based on them correct the gene expression of other cells. Scanorama [4] and Seurat [16] are both based on the MNN methodology. Another method, Harmony [7], iteratively corrects the data by clustering the cells and moving neighboring clusters toward each other. These methods typically produce a unified data matrix, which can be conveniently used in visualization and downstream analysis. However, these empirical corrections usually lack negative control and raise uncertainty in the discovery [14].

Batch-Aware Statistical Tests. Instead of manipulating the data directly, statistical methods may handle batch effect by considering it as a covariate in the model. This is possible in Student’s t-test, Poisson test, negative binomial test, etc. Notably, all these tests are parametric, meaning that a distribution must be given in advance. However, the debate of the true distribution of single-cell gene expression has never ceased, which is a reason why the nonparametric Wilcoxon rank-sum test is widely used. To allow modeling covariates, one may use a generalized version of rank-sum test, the proportional odds model [2]. However, modeling batches by using a covariate also makes unnecessary assumptions upon them. Stratification, which only combines statistics from batches, is the “as simple as possible, but no simpler” way to handle batches. The Van Elteren test we use, is the stratified version of Wilcoxon rank-sum test.

It is worth noting that methods like scVI [9] have combined statistical modeling with batch effect correction. However, the effect of batches is modeled by a black-box neural network, making the interpretation elusive.

2 Methods

2.1 Wilcoxon Rank-Sum Test

We briefly revisit the Wilcoxon rank-sum test (also known as Mann–Whitney U test) [11, 18]. The test statistics U is defined as

$$\begin{aligned} U=\sum _{i=1}^{n_1}\sum _{j=1}^{n_2}\left( \mathbbm {1}_{a_i > b_j} + \frac{1}{2}\mathbbm {1}_{a_i = b_j}\right) , \end{aligned}$$

(1)

where $A=\{a_i\}_{i=1}^{n_1}$ and $B=\{b_j\}_{j=1}^{n_2}$ are the two samples to be compared (e.g., two cell types in one experiment), with sample sizes $n_1$ and $n_2$, respectively. Function “$\mathbbm {1}$” takes value 1 when its condition holds true, and 0 otherwise. When $n_1$ and $n_2$ are both at least 10, which is common in single-cell studies, the distribution of U approximately follows a normal distribution $\mathcal {N}(\mu , \sigma ^2)$ where

$$\begin{aligned} \mu = \frac{n_1n_2}{2} \end{aligned}$$

(2)

and

$$\begin{aligned} \sigma ^2 = \frac{n_1n_2}{12}\left( (n+1)-\sum _{i=1}^k \frac{t_i^3-t_i}{n(n-1)}\right) , \end{aligned}$$

(3)

in which $t_i$ stands for ties (corresponding to the second term in Eq. 1).

2.2 Van Elteren Test

The Van Elteren test [17] is the stratified version of Wilcoxon rank-sum test. For example, if there are m patients, they maybe treated as strata. In that case, a U statistic may be obtained from each patient $g\in \{1,\cdots ,m\}$, denoted as $U_g\sim \mathcal {N}(\mu _g, \sigma _g^2)$. A new statistics V is constructed by

$$\begin{aligned} V = \frac{\left[ \sum _{g=1}^m w_g(U_g - \mu _g)\right] ^2}{\sum _{g=1}^m w_g^2\sigma _g^2} \sim \chi _1^2, \end{aligned}$$

(4)

where $w_g$ is a weight for each sample to be discussed later. When $m = 1$, the formula degenerates to $V = {(U_g - \mu _g)^2}/{\sigma _g^2} \sim \chi _1^2$, which is consistent with the rank-sum test.

Weights. As discussed by Van Elteren [17], the weights $w_g$ can be assigned in different ways. It should be noted that the $U_g$ for a batch g ranges from 0 to $n_{g1}n_{g2}$, the product of two sample sizes in the batch. Should the weights all be equal, a patient with more cells available will dominate the test results. It is proven in [17] that weight

$$\begin{aligned} w_g=\frac{1}{n_{g1}n_{g2}} \end{aligned}$$

(5)

eliminates such effect, and a test utilizing such weight is thus named as “design-free test”. However, given that a batch with more instances available (e.g., a patient with more cells sequenced) may be more convincing, another weight

$$\begin{aligned} w_g=\frac{1}{n_{g1} +n_{g2} + 1} \end{aligned}$$

(6)

is introduced, which gives more power to larger batches. It also effectively assigns larger weights to batches whose samples are more balanced, when the batch sizes are the same. It is shown in [17] that this choice yields largest statistical power against randomized alternatives, and is thus named as “locally-best test”. The comparison of two weights are shown in Fig. 1.

Effect Sizes. For Wilcoxon rank-sum test, a simple definition of effect size is

$$\begin{aligned} f = \frac{U}{n_1n_2}, \end{aligned}$$

(7)

which is centered at $50\%$, meaning the probability $P(a>b)$ when a and b are randomly drawn from sample A and sample B, respectively. An effect size greater than $50\%$ generally means that A is higher, and vice versa. It may be easily extended for Van Elteren test by taking average using desired weights. For the design-free test, the effect size is

$$\begin{aligned} f = \frac{1}{m}\sum _{g=1}^m\frac{U_g}{n_{g1}n_{g2}}, \end{aligned}$$

(8)

as all batches are treated equally regardless of the sample sizes. It may be interpreted as the probability of $P(a_g>b_g)$ for $a_g$ and $b_g$ randomly drawn from $A_g$ and $B_g$, after randomly choosing a batch g. For the locally-best test, the effect size becomes

$$\begin{aligned} f= & {} \frac{1}{\sum _{g=1}^m\frac{n_{g1} n_{g2}}{n_{g1} + n_{g2} + 1}} \sum _{g=1}^m\frac{n_{g1} n_{g2}}{n_{g1} + n_{g2} + 1}\frac{U_g}{n_{g1} n_{g2}}\end{aligned}$$

(9)

$$\begin{aligned}= & {} \frac{\sum _{g=1}^m \frac{U_g}{n_{g1} + n_{g2} + 1}}{\sum _{g=1}^m \frac{n_{g1} n_{g2}}{n_{g1} + n_{g2} + 1}}, \end{aligned}$$

(10)

which changes the probability of choosing a group g to be in proportion to $\frac{n_{g1}n_{g2}}{n_{g1} +n_{g2} + 1}$, giving higher weights to batches with higher and balanced sample sizes (Fig. 1). Generally, any $w_g$ may be used to define f, as in

$$\begin{aligned} f = \frac{\sum _{g=1}^m U_g w_g}{\sum _{g=1}^m n_{g1}n_{g2}w_g}, \end{aligned}$$

(11)

the two previous options being its special cases.

3 Results

We implemented the Van Elteren test with the effect size in R, available at our GitHub repository (https://github.com/KChen-lab/stratified-tests-for-seurat), based on Seurat 3.0 by utilizing its differential expression analysis part (but irrelevant to the data integration) [16]. When there are two groups of cells, A and B, denoted as type, and patient identity, denoted as batch, the Van Elteren Test can be called as follows.

The genre may be set to either locally-best or design-free, as introduced in Sect. 2, based on which p-values and effect sizes are calculated. Typical results are shown in Table 2. An effect size of larger than 0.5 indicates a higher expression in cell type A, and vice versa. The avg_logFC, average logarithmic fold changes, are calculated automatically by Seurat, where a positive value indicates a higher expression. It may show different trends compared with the effect sizes. Generally, the effect sizes are more indicative after controlling for the batch effect.

3.1 Simulation Study

We simulated datasets to illustrate the key utilities of Van Elteren test. The parameters are specified in Table 1. Poisson distribution is used to model sequencing depth. Visualization is available in Fig. 2 for illustration. We assume that the library size of each sample is equalized by other genes beyond the simulated ones. The testing result of Van Elteren test and Wilcoxon rank-sum test is shown in Table 2. Trend (A over B) are indicated by arrows and insignificant p-values are grayed out. For Van Elteren test, the locally-best version and the design-free version return very similar results.

Table 1. Simulated datasets

Full size table

Suppressing False Negatives. Batch effect may introduce false negatives, where a significantly differentially expressed gene is overshadowed. For gene 1, on which cell type B always have higher expression on both patients, the Wilcoxon rank-sum test did not pass the threshold of 0.05 to reject the null hypothesis, while Van Elteren test yields a significant p-value. The effect size, smaller than 0.5, also correctly suggests that the expression of gene 1 in cell type B is higher than that in cell type A, compared with the average logarithmic fold change, which wrongfully indicates otherwise.

Suppressing Reversed Conclusions. Batch effect may also lead to reversed conclusion (i.e., which cell type has higher expression). For gene 2, on which cell type A always have higher expression value, both tests reject the null hypothesis. However, the effect size of Van Elteren test, larger than 0.5, correctly identifies that the expression of gene 2 in cell type A is higher than that in cell type B, while the average logarithmic fold change wrongfully indicates otherwise.

Suppressing False Positives. False discoveries are also possible outcome of batch effect. As is shown in gene 3, the distribution of both cell types are exactly the same in each patient. Nevertheless, Wilcoxon test yields a very significant p-value. The average logarithmic fold change also has a large magnitude. Van Elteren test returns a p-value of 0.7831, together with a effect size close to 0.5, suggesting that the difference is neither significant nor large.

Consistency. As a negative control, when the three issues above are not present, p-value from Van Elteren test is consistent with Wilcoxon rank-sum test, as is shown by gene 4. The effect size and the log fold change also both show that the cell type B has higher expression in gene 4.

3.2 Retina Data

We have tested the Van Elteren test on real retina single-cell gene expression data gathered from three patients [8]. Two regions, macula (i.e., the center area) and peripheral, are labeled in the data. We question which genes differentially express for the same cell type between two regions. We run Wilcoxon rank-sum test and Van Elteren test on 2,295 rod cells and 203 cone cells. We compare the results in Fig. 3, where genes with large differences in p-values between two tests are labeled with gene names.

Table 2. Results of the two tests in simulation studies

Full size table

Rod Cells. The results of two tests are largely comparable, showing a diagonal pattern. Meanwhile, some exceptions are present (see Table 3 for the p-values and effect sizes), among which we observed that p-values for gene GNGT1 and SYNE2 change the most.

For GNGT1, the reversed conclusion effect is also observed, as the Van Elteren f effect size suggests that the peripheral region has a higher expression, while the log fold change indicates otherwise. We further inspected the distributions to validate and interpret the differences. In Fig. 4A, the left panel does show generally higher expression of GNGT1 in each individual patient, while the aggregated distribution on the right shows a reversed effect, which is an instance of the aforementioned Simpson’s paradox. For SYNE2 (Fig. 4B), conspicuous discrepancy among batches is also shown, which leads to a less precise rank-sum test result. Indeed, these two genes were found playing roles in macular degeneration diseases [6, 10].

Cone Cells. For cone cells, we observed similar results. Some genes show changes while most genes are consistent across the tests. The p-values and effect sizes are shown in Table 4.

Table 3. Gene with large p-value change in rod cells

Full size table

Table 4. Gene with large p-value change in cone cells

Full size table

For gene PCBP4, Van Elteren test shows more significant p-value, and an effect size indicating smaller expression in macula, which is different from the log fold change. Decrease in PCBP4 has also been linked with age-related macular degeneration [13]. Figure 5 shows that batch effect in distribution of PCBP4 may have misled the rank-sum test and the logarithmic fold change.

4 Discussion

The results have clearly shown that Van Elteren test benefits biological studies in precisely identifying differentially expressed genes. Although the results we show do not include multiple comparison correction, Seurat 3.0 will automatically give corrected p-value based on the raw p-value using Bonferroni correction. Generally, any correction based on p-values will also apply.

The result also indicates that stratified test is a neat way to handle batch effect. Although covariate has the ability to control for explanatory variables, it is generally more suitable for continuous variables. It also casts more assumptions when modeling covariate. Stratified test, on the other hand, does not infer the influence of the discrete batches. Rather, it directly aggregates the statistical power of multiple samples.

Admittedly, for the rod cells, although changes in p-values are observed, the significance threshold was well passed by both. However, it should be noted that the retina data are collected from relatively healthy tissues and are considered clean, while Van Elteren test is expected to make a more meaningful difference on noisy pathological and tumor data. In addition, rod cell is the most populous cell in retina. For rare cell types that take smaller proportions, like the cone cells, the difference Van Elteren test makes can be crucial.

The caveat of stratified test is that for it to work the strata shall not overlap with the variable of interest. For instance, it may not find the difference, meanwhile also control for the batch effect, between two patients. Nonetheless, neither is covariate applicable to such cases. As the batch effect and biological effect are convolved, more prior knowledge is generally needed to distinguish them.

5 Summary

We have adopted Van Elteren test, an underappreciated statistical test, and our weighted common language effect size to single-cell sequencing data. When batch effect is severe, the test control for false positives. Otherwise, it is consistent with Wilcoxon rank-sum test. The modified common language effect size also faithfully depicts the trends. This work may increase the precision of differential expression analysis to help identify genes of interests.

References

Blyth, C.R.: On Simpson’s paradox and the sure-thing principle. J. Am. Stat. Assoc. 67(338), 364–366 (1972)
Article MathSciNet MATH Google Scholar
Everitt, B., Skrondal, A.: The Cambridge Dictionary of Statistics. BusinessPro collection, 4th edn. Cambridge University Press, Cambridge (2010)
Book MATH Google Scholar
Haghverdi, L., Lun, A.T., Morgan, M.D., Marioni, J.C.: Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36(5), 421 (2018)
Article Google Scholar
Hie, B., Bryson, B., Berger, B.: Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol. 37(6), 685 (2019)
Article Google Scholar
Kerby, D.S.: The simple difference formula: an approach to teaching nonparametric correlation. Compr. Psychol. 3, 1–10 (2014). 11-T
Article Google Scholar
Kolesnikov, A.V., et al.: G-protein $\beta \gamma $-complex is crucial for efficient signal amplification in vision. J. Neurosci. 31(22), 8067–8077 (2011)
Article Google Scholar
Korsunsky, I., et al.: Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1–8 (2019)
Article Google Scholar
Liang, Q., et al.: Single-nuclei RNA-seq on human retinal tissue provides improved transcriptome profiling. Nat. Commun. 10(1), 1–12 (2019)
Article Google Scholar
Lopez, R., Regier, J., Cole, M.B., Jordan, M.I., Yosef, N.: Deep generative modeling for single-cell transcriptomics. Nat. Methods 15(12), 1053 (2018)
Article Google Scholar
Maddox, D.M., et al.: A mutation in Syne2 causes early retinal defects in photoreceptors, secondary neurons, and Müller glia. Invest. Ophthalmol. Vis. Sci. 56(6), 3776–3787 (2015)
Article Google Scholar
Mann, H.B., Whitney, D.R.: On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18, 50–60 (1947)
Article MathSciNet MATH Google Scholar
McGraw, K.O., Wong, S.: A common language effect size statistic. Psychol. Bull. 111(2), 361 (1992)
Article Google Scholar
Meyer, J.G., Garcia, T.Y., Schilling, B., Gibson, B.W., Lamba, D.A.: Proteome and secretome dynamics of human retinal pigment epithelium in response to reactive oxygen species. Sci. Rep. 9(1), 1–12 (2019)
Article Google Scholar
Nygaard, V., Rødland, E.A., Hovig, E.: Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses. Biostatistics 17(1), 29–39 (2016)
Article MathSciNet Google Scholar
Regev, A., et al.: Science forum: the human cell atlas. Elife 6, e27041 (2017)
Article Google Scholar
Stuart, T., et al.: Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019)
Article Google Scholar
Van Elteren, P.: On the combination of independent two-sample tests of Wilcoxon. Bull. Inst. Int. Staist. 37, 351–361 (1960)
MathSciNet MATH Google Scholar
Wilcoxon, F.: Individual comparisons by ranking methods. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics. Springer Series in Statistics (Perspectives in Statistics), pp. 196–202. Springer, New York (1992). https://doi.org/10.1007/978-1-4612-4380-9_16
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Rice University, Houston, TX, 77005, USA
Shaoheng Liang
The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
Shaoheng Liang & Ken Chen
Baylor College of Medicine, Houston, TX, 77030, USA
Qingnan Liang & Rui Chen

Authors

Shaoheng Liang
View author publications
You can also search for this author in PubMed Google Scholar
Qingnan Liang
View author publications
You can also search for this author in PubMed Google Scholar
Rui Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ken Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ken Chen .

Editor information

Editors and Affiliations

Rovira i Virgili University, Tarragona, Spain
Carlos Martín-Vide
University of Extremadura, Cáceres, Spain
Miguel A. Vega-Rodríguez
University of Montana, Missoula, MT, USA
Travis Wheeler

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liang, S., Liang, Q., Chen, R., Chen, K. (2020). Stratified Test Alleviates Batch Effects in Single-Cell Data. In: Martín-Vide, C., Vega-Rodríguez, M., Wheeler, T. (eds) Algorithms for Computational Biology. AlCoB 2020. Lecture Notes in Computer Science(), vol 12099. Springer, Cham. https://doi.org/10.1007/978-3-030-42266-0_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-42266-0_13
Published: 03 April 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-42265-3
Online ISBN: 978-3-030-42266-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Stratified Test Alleviates Batch Effects in Single-Cell Data

Abstract

Similar content being viewed by others

Benchmarking integration of single-cell differential expression

A practical solution to pseudoreplication bias in single-cell studies

SCeQTL: an R package for identifying eQTL from single-cell parallel sequencing data

Keywords