Supervised Adversarial Alignment of Single-Cell RNA-seq Data

Ge, Songwei; Wang, Haohan; Alavi, Amir; Xing, Eric; Bar-Joseph, Ziv

doi:10.1007/978-3-030-45257-5_5

Songwei Ge⁹,
Haohan Wang¹⁰,
Amir Alavi⁹,
Eric Xing^10,11 &
…
Ziv Bar-Joseph^9,11

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 12074))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

1869 Accesses
3 Citations

Abstract

Dimensionality reduction is an important first step in the analysis of single cell RNA-seq (scRNA-seq) data. In addition to enabling the visualization of the profiled cells, such representations are used by many downstream analyses methods ranging from pseudo-time reconstruction to clustering to alignment of scRNA-seq data from different experiments, platforms, and labs. Both supervised and unsupervised methods have been proposed to reduce the dimension of scRNA-seq. However, all methods to date are sensitive to batch effects. When batches correlate with cell types, as is often the case, their impact can lead to representations that are batch rather than cell type specific. To overcome this we developed a domain adversarial neural network model for learning a reduced dimension representation of scRNA-seq data. The adversarial model tries to simultaneously optimize two objectives. The first is the accuracy of cell type assignment and the second is the inability to distinguish the batch (domain). We tested the method by using the resulting representation to align several different datasets. As we show, by overcoming batch effects our method was able to correctly separate cell types, improving on several prior methods suggested for this task. Analysis of the top features used by the network indicates that by taking the batch impact into account, the reduced representation is much better able to focus on key genes for each cell type.

Download conference paper PDF

scSemiAAE: a semi-supervised clustering model for single-cell RNA-seq data

Article Open access 26 May 2023

Zile Wang, Haiyun Wang, … Chunhou Zheng

Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis

Article Open access 11 May 2020

Xiangjie Li, Kui Wang, … Mingyao Li

Deep generative model embedding of single-cell RNA-Seq profiles on hyperspheres and hyperbolic spaces

Article Open access 05 May 2021

Jiarui Ding & Aviv Regev

Keywords

1 Introduction

Single-cell RNA sequencing (scRNA-seq) has revolutionized the study of gene expression programs [16, 27]. The ability to profile genes at the single-cell level has revealed novel specific interactions and pathways within cells [43], differences in the proportions of cell types between samples [17, 44], and the identity and characterization of new cell types [39]. Several biological tissues, systems, and processes have recently been studied using this technology [17, 43, 44].

While studies using scRNA-seq provide many insights, they also raise new computational challenges. One of the major challenges involves the ability to integrate and compare results from multiple scRNA-seq studies. There are several different commercial platforms for performing such experiments, each with their own biases. Furthermore, similar to other high throughput genomic assays, scRNA-seq suffers from batch effects which can make cells profiled in one lab look very different from the same cells profiled at another lab [37, 38]. Moreover, other types of high throughput transcriptomics profiling, including microscopy-based techniques, are also generating single cell expression datasets [8, 40]. The goal of fully utilizing these spatial datasets motivates the development of methods that can combine them with scRNA-seq when studying specific biological tissues and processes.

A number of recent methods have attempted to address this challenge by developing methods for aligning scRNA-seq data from multiple studies of the same biological system. Many of these methods rely on identifying nearest neighbors between the different datasets and using them as anchors. Methods that use this approach include Mutual Nearest Neighbors (MNN) [13] and Seurat [36]. Others including scVI and scAlign first embed all datasets into a common lower dimensional space. scVI encodes the scRNA-seq data with a deep generative model conditioning on the batch identifiers [24] while scAlign regularizes the representation between two datasets by minimizing the random walk probability differences between the original and embedding spaces. While these methods were successful for some datasets, here we show that they are not always able to correctly match all cell types. A key problem with these methods is the fact that they are unsupervised and rely on the assumption that cell types profiled by the different studies overlap. While this works for some datasets, it may fail for studies in which cells do not fully overlap or for those containing rare cell types. Unsupervised methods tend to group rare types with the larger types making it hard to identify them in a joint space.

Recent machine learning work has focused on a related problem termed “domain adaptation/generalization”. Methods developed for these problems attempt to learn representations of diverse data that are invariant to technical confounders [5, 25, 42]. These methods have been used for multiple applications such as machine translation for domain specific corpus [4] and face detection [28]. Several methods proposed for domain adaptation rely on the use of adversarial methods [5, 10, 21, 41], which has been proved effective to align latent distributions. In addition to the original task such as classification, these methods apply a domain classifier upon the learned representations. The encoder network is used for both improving accurate classification while at the same time reducing the impact of the domain (by “fooling" a domain classifier). This is achieved by learning encoder weights that simultaneously perform gradient descent on the label classification task and gradient ascent on the domain classification task.

Here we extend these approaches, coupling them with Siamese network learning [20] for overcoming batch effects in scRNA-seq analysis. We define a “domain” in this paper as a standalone dataset profiled at a single lab using a single platform. We define “label” as the cell type for each cell in the dataset. Considering the specificity of the cell types in the scRNA-seq datasets, we propose a conditional pair sampling strategy that constrains input pair selection when training the adversarial network. We discuss how to formulate a domain adaptation network for scRNA-seq data, how to learn the parameters for the network, and how to train it using available data.

We tested our method on several datasets ranging in size from 10 to 39 cell types and from 4 to 155 batches. As we show, for all of the datasets our domain adversarial method improves on previous methods, in some cases significantly. Visualization of the learned representation from several different methods helps highlight the advantages of the domain adversarial framework. As we show, the framework is able to accurately mitigate the batch effects while maintaining the grouping of cells from the same type across different batches. Biological analysis of the resulting model identifies key genes that can correctly distinguish between cell types across different experiments. Such batch invariant genes are promising candidates for a cell type specific signature that can be used across different studies to annotate cells.

2 Methods

2.1 Problem Formulation

To formulate the problem we start with a few notation definitions. We assume that the single cell RNA-seq data are drawn from the input space $\mathbf {X} \in \mathbb {R}^p$ where each sample (a cell) $\mathbf {x}$ has p features corresponding to the gene expression values. Cells are also associated with the label $y \in \mathbf {Y} = \{1, 2, ..., K\}$ which represents their cell types. We associate each sample with a specific domain/batch $d \in \mathcal {D}$ that represents any standalone dataset profiled at a single lab using a single platform. Note that we will use domain and batch interchangeably in this paper for convenience. The data are divided into a training set and a test set that are drawn from multiple studies. The domains used to collect training data are not used for the test set and so batch effects can vary between the training and test data. In practice, each of the domains only contains a small subset of the cell types. This means that the distribution of cell types is correlated with the distribution of domains. Thus, the methods that naively learn cell types based on expression profile [3, 18, 22] may instead fit domain information and not generalize well to the unobserved studies.

2.2 Domain Adversarial Training with Siamese Network

To overcome this problem and remove the domain impact when learning a cell type representation we propose a neural network framework which includes three modules as shown in Fig. 1: scRNA encoder, label classifier, and domain discriminator. The encoder module $f_e(\mathbf {x};\mathbf {\theta }_e)$ is used to reduce the dimensions of the data and contains fully connected layers which produce the hidden features, where $\mathbf {\theta }_e$ represents the parameters in these layers. The label classifier $f_l(f_e;\mathbf {\theta }_l)$ attempts to predict the label of input $\mathbf {x_1}$ whereas the goal of the domain discriminator $f_d(f_e;\mathbf {\theta }_d)$ is to determine whether a pair of inputs $\mathbf {x_1}$ and $\mathbf {x_2}$ are from the same domain or not. Past work for classifying scRNA-seq data only attempted to minimize the loss function for the label classifier $\mathcal {L}_l(f_l(f_e;\mathbf {\theta }_l))$ [3, 23]. Here, we extend these methods by adding a regularization term based on the adversarial loss of the domain discriminator $\mathcal {L}_d(f_d(f_e;\mathbf {\theta }_d))$ which we will elaborate later. The overall loss E on a pair of samples $\mathbf {x_1}$ and $\mathbf {x_2}$ is denoted by:

$$\begin{aligned} E(\mathbf {\theta }_e, \mathbf {\theta }_l, \mathbf {\theta }_d) = \mathcal {L}_l\big (f_l(f_e(\mathbf {x_1};\mathbf {\theta }_e);\mathbf {\theta }_l)\big ) - \lambda \mathcal {L}_d\big (f_d(f_e(\mathbf {x_1};\mathbf {\theta }_e);\mathbf {\theta }_d),\, f_d(f_e(\mathbf {x_2};\mathbf {\theta }_e);\mathbf {\theta }_d)\big ), \end{aligned}$$

where $\lambda $ can control the trade-off between the goals of domain invariance and higher classification accuracy. For convenience, we use $\mathbf {z_1}$ and $\mathbf {z_2}$ to denote the hidden representations of $\mathbf {x_1}$ and $\mathbf {x_2}$ calculated from $f_e(\mathbf {x};\mathbf {\theta }_e)$. Inspired by Siamese networks [20], we implement our domain discriminator by adopting a contrastive loss [12]:

$$\begin{aligned} \mathcal {L}_d\big (f_d(\mathbf {z_1};\mathbf {\theta }_d),\, f_d(\mathbf {z_2};\mathbf {\theta }_d)\big )&= U\frac{1}{2}D\big (f_d(\mathbf {z_1}), f_d(\mathbf {z_2})\big )^2 \\&+\,(1-U)\frac{1}{2}\big (\max \{0, m-D(f_d(\mathbf {z_1}), f_d(\mathbf {z_2}))\}\big )^2, \end{aligned}$$

where $U=1$ indicates that two samples are from the same domain d and $U=0$ indicates that they are not, $D(\cdot )$ is the euclidean distance, and m is the margin that indicates the prediction boundary. The domain discriminator parameters, $\theta _d$, are updated using back propagation to maximize the total loss E while the encoder and classifier parameters, $\theta _e$ and $\theta _l$, are updated to minimize E. To allow all three modules to be updated together end-to-end, we use a Gradient Reversal Layer (Fig. 1) [10, 29]. Specifically, Gradient Reversal Layers (GRL) have no effect in forward propagation, but flip the sign of the gradients that flow through them during backpropagation. The following provides the overall optimization problems solved for the network parameters:

$$\begin{aligned} (\widehat{\mathbf {\theta }_e},\widehat{\mathbf {\theta }_l})&= \mathop {\mathrm {arg}\,\mathrm {min}}\limits _{\mathbf {\theta }_e, \mathbf {\theta }_l} E(\mathbf {\theta }_e, \mathbf {\theta }_l, \widehat{\mathbf {\theta }_d})\\ (\widehat{\mathbf {\theta }_d})&= \mathop {\mathrm {arg}\,\mathrm {max}}\limits _{\mathbf {\theta }_d} E(\widehat{\mathbf {\theta }_e},\widehat{\mathbf {\theta }_l}, \mathbf {\theta }_d) \end{aligned}$$

In other words, the goal of the domain discriminator is to tell if two samples are drawn from the same or different batches. By optimizing the scRNA encoder adversarially against the domain discriminator, we attempt to make sure that the network representation cannot be used to classify based on domain knowledge. During the training, the maximization and minimization tasks compete with each other, which is achieved by adjusting the representations to improve the accuracy of the label classifier and simultaneously fool the domain discriminator.

2.3 Conditional Domain Generalization Strategy

Most prior domain adaption or generalization methods focused on the cases where the distribution of labels is independent of the domains [5, 25]. In contrast, as we show in Results, for scRNA-seq experiments different studies tend to focus on certain cell types [17, 43, 44]. Consequently, it is not reasonable to completely merge the scRNA-seq data from different batches. To be specific, aligning the scRNA-seq data from two batches with different sets of cell types would sacrifice its biological significance and prevent the cell classifier from predicting effectively. To overcome this issue, instead of arbitrarily choosing positive pairs (samples from the same domain) and negative pairs (samples from different domains), we constrain the selection as follows: (1) for positive pairs, only the samples with different labels from the same domain are selected. (2) for negative pairs, only the samples with the same label from different domains are selected. Figure 2 provides a visual interpretation of this strategy. Formally, letting $y_i$ and $z_i$ represent the i-th sample’s cell-type label and domain label respectively, we have the following equations to define the value of U for sample pairs:

$$ U = {\left\{ \begin{array}{ll} 0, &{} z_1 \ne z_2 \text { and } y_1 = y_2 \\ 1, &{} z_1 = z_2 \text { and } y_1 \ne y_2 \end{array}\right. } $$

This strategy prevents the domain adversarial training from aligning samples with different labels or separating samples with same labels. For example, in order to fool the discriminator with a positive pair, the encoder must implicitly increase the distance of two samples with different cell types. Therefore, combining this strategy with domain adversarial training allows the network to learn cell type specific, focused representations. We term our model Single Cell Domain Generalization Network (scDGN).

3 Results

3.1 Experiment Setups

Datasets. To test our method and to compare it to previous methods for aligning and classifying scRNA-seq data, we used several recent datasets. These datasets contain between 6,000 and 45,000 cells, and all include cells profiled in multiple experiments by different labs and on different platforms.

scQuery: We use a subset of the dataset provided by scQuery, which includes uniformly processed data from hundreds of studies [3]^{Footnote 1}. The dataset we use contains 44,490 samples from 155 different experiments. scQuery assigns cells to 39 types spanning diverse categories ranging from immune cells to neurons to organ specific cells. We use 99 of the 155 batches for training, 26 for validation, and 30 for testing. We provide a list of the studies used for each set in Appendix A.1 [11]. Statistics for the different datasets are shown in Table 1. RPKM normalization is applied to the 20,499 genes in each sample. Note that while there are 39 cell-types in the training set, only 19 and 23 of them are included in the validation and test set. This mimics the application of the methods to future studies that may not profile all types of cells.

PBMC: The Peripheral Blood Mononuclear Cells (PBMC) dataset contains 28,969 cells assigned to 10 blood cell types. The data are profiled in 9 batches (from 8 different sequencing technologies) [6]. We use the data from the 10xChromiumv2A platform as test data and the rest as training data. Following the provided tutorial [2], we use the top 3000 variable genes for the analysis.

Seurat Pancreas: The Seurat pancreas dataset is designed for evaluating single cell alignment algorithms and contains 6321 scRNA-seq samples of human pancreatic islet cell produced by four studies. We use the smallest study for the test data and the other three for training as shown in Table 1. Thirteen canonical labels of the pancreatic islet cell are assigned to cells in each study. Similar to the Seurat PBMC dataset, we only used the 3000 most variable genes. To further simulate the correlation between cell types and domains for this dataset we randomly remove the data for 6 of the 13 cell types for each of the training domains. As a result, we construct 6 synthetic datasets based this strategy to evaluate the alignment performance of different methods under a high label-domain correlation setting. The specific cell type information of each dataset is listed in Appendix A.3 [11].

Table 1. Basic statistics for scQuery, Suerat pancreas, and PBMC datasets

Full size table

Model Configurations. We used the network of Lin et al. [23] as the components for the encoder and the label classifier in our model. The encoder contains two hidden layers with 1136 and 100 units. The label classifier is directly connected to the 100 unit layer and makes predictions based on these values. The domain discriminator contains an additional hidden layer with 64 units and is also connected to the 100 unit layer of the encoder (Fig. 1). For each layer, tanh() is used as the non-linear activation function. We test several other possible configurations but did not observe improvement in performance. As is commonly done, we use a validation set to tune the hyperparameters for learning including learning rates, decay, momentum, and the adversarial weight and margin parameters $\lambda $ and m. Generally, our analysis indicates that for larger datasets a lower weight $\lambda $ and larger margin m for the adversarial training is preferred and vice versa. More details about the hyperparameters and training are provided in Appendix A.2 [11].

Table 2. Overall performances of different methods. MI represents the mutual information between batch and cell type in the corresponding dataset. The highest test accuracy for each dataset is bolded.

Full size table

Baselines. We compared scDGN to several prior methods for classifying and aligning scRNA-seq data. These included the neural network (NN) model of Lin et al. [23] which is developed for classifying scRNA-seq data, CaSTLe [22] which performs cell type classification based on transfer learning, and several state-of-the-art alignment methods. For alignment, we compared to MNN [13] which utilizes mutual nearest neighbors to align data from different batches, scVI [24] which trains a deep generative model on the scRNA-seq data and uses an explicit batch identifier to retain conditional independence property of the representation, and Seurat [36] which first identifies the anchors among different batches and then projects different datasets using a correction vector based on the order defined by hierarchical clustering with pairwise distances. Our comparisons include both visual projection of the learned alignment (Figs. 4 and 5) and quantitative analysis of the accuracy of the predicted test cell types (Table 2). For the latter, to enable comparisons of the supervised and unsupervised methods, we used the resulting aligned data from the unsupervised methods to train a neural network that has the same configuration as Lin et al. [23]. For scVI, which results in a much lower dimensional representation, we used a smaller input vector and a smaller hidden layer. Note that these alignment methods actually use the scRNA-seq test data to determine the final dimensionality reduction function while our method does not utilize the test data for any model decision or parameter learning. To effectively apply Seurat to scQuery, we remove the batches which have ${<}100$ samples. Also, for those datasets that the assumption of overlapped cell types is not guaranteed such as scQuery, we find that the performance of MNN highly depends on the order of alignment. Therefore, for MNN on the scQuery dataset, we use 10 random permutations of batch orders and report the average accuracy.

3.2 Overall Performance

As mentioned above, we use the validation set to select the best model when using the scQuery dataset. For the smaller datasets, we use the model obtained after 250 epochs (all models converged after this number of epochs). Test accuracy for the different methods is presented in Table 2. We show both mean and standard deviation of the accuracy for 10 randomly initialized experiments. We also report the performances on different cell types in Appendix B [11]. In addition, Table 2 presents the Mutual Information (MI) between labels and domains which corresponds to the difficulty of the dataset. A larger MI indicates that models that do not account for the domain are likely to fit the domain information rather than the cell type. For the scQuery dataset, we find the accuracy is low for all methods indicating that this dataset is relatively difficult. This is corroborated by the large MI value. For such data we see a clear advantage for the scDGN: scDGN improves by over 10% over all other methods ($p = 5.069\times 10^{-5}$ based on Student’s t-test when compared to the NN baseline which is tied for second best). The improvements over other single cell alignment methods are even more significant. scDGN also achieves the best performance on the second largest dataset, the PBMC dataset. However, given the very low MI for this dataset the performance of the other methods, including the baseline NN, is almost as good as the performance of scDGN. The third dataset we test on is the Seurat pancreas dataset. This is the smallest dataset and so it has the least number of training samples. Still, of the 6 settings we tested (which differed in the subset of cells that were excluded from training), we find that scDGN is the top performer in 4 of them, comparable to the top performer for another 1 and in only one setting (Pancreas 3, with the highest MI) is significantly outperformed by Seurat. Note that even for the Pancreas 3 data the domain adversarial training helps: using this the scDGN is able to improve by more than 20% over the baseline NN used for the label classifier.

3.3 Visualization of the Representation Learned by Alignment and Classification Methods

To further explore the effectiveness of the batch removal provided by our proposed domain adversarial training with conditional domain generalization strategy, we visualize the 100-dimensional hidden representations learned by NN and scDGN: Fig. 3 presents both PCA and t-SNE plots for several different cell types across the three datasets. Points are colored using their batch IDs in order to evaluate batch effects. As can be seen, using scDGN we obtain results that are much better at mixing cells from the different batches when compared to the baseline NN model. The impact is larger for the pancreas datasets which have larger MI compared to the PBMC dataset, which helps explain the large increase in performance for these two datasets.

We next extended this comparison and visualized the learned (aligned) representations for all methods using data from both the Pancreas2 and scQuery datasets (Figs. 4 and 5). For the Pancreas2 dataset, we visualize the entire dataset. For scQuery, given the large number of cell types and domains, we present PCA visualization of a subset of cell types and domains. As can be seen, in addition to scDGN, Seurat is also able to successfully mix the data from different batches. However, as the results in Table 2 indicate this may come at the expense of not correctly separating cell types. MNN and scVI are not always effective at removing batch effects for the cell types. In contrast, scDGN is able to do both domain mixing and cell type assignment, leading to its better performance overall. For example, for the acinar and alpha cell types in the pancreas dataset (Fig. 4), only scDGN , MNN, and Seurat are able to align the data from different domains. However, MNN and Seurat over-correct the representation by aligning different cell types from different domains, mixing acinar and gamma cells. Additional visualizations for other cell types and domains can be found in Appendix C [11], where the same advantages of scDGN over other methods can be consistently observed.

3.4 Analysis of Key Genes

While NNs are often treated as black boxes, recent methods provide useful directions for making them more interpretable [31]. Here we use activation maximization, which relies on the gradient of the correct category logit with respect to the input vector to select the key inputs for each of the models [9, 33, 34]. Formally, given a particular cell type i and a trained neural network $\phi $, activation maximization looks for important input genes $x'$ by solving the following optimization problem:

$$ x' = \max _x (\phi (x)\cdot e_i), $$

where $e_i$ is the natural basis vector associated with the i-th category. This can be solved through backpropagation, where the gradient of $\phi (x)$ with respect to x, which can be viewed as the weight of the first-order Taylor expansion of the neural network, are calculated to iteratively update the input. We follow a previous method [33] and initialize the optimization with a zero vector. Given this setting, we ran the optimization for 100 iterations with learning rate set to 1. The important genes are selected as those inputs leading to the largest changes compared with the initialization values. To compare scDGN and NN for certain cell types, we select the top k genes with the largest changes and perform GO analysis on these selected genes.

Table 3. GO analysis results for top 100 scQuery liver genes in the NN method.

Full size table

Table 4. GO analysis results for top 100 scQuery liver genes in the scDGN method.

Full size table

As an example, consider the genes identified for the liver cell type using the scQuery dataset. We select the top 100 genes for this cell type from NN and scDGN and present the enriched GO categories on Biological Process with adjusted p-value ${<}1.0\times 10^{-4}$ in Tables 3 and 4. We also list these genes by order in Appendix A.3 [11]. As can be seen, while a number of significant GO categories are identified for the top 100 NN genes, these are generic and not liver specific. They include general terms related to interactions between organs and immune response categories that are active in multiple organs and cell types. In sharp contrast, the categories identified for scDGN are much more specific and highlight key pathways that are mainly utilized in the liver. For example, the top category for the scDGN genes, “chylomicron remodeling”, refers to the main physiological purpose of chylomicron remnants: to facilitate the return of bile lipoproteins and cholesterol to the liver [30]. Specifically, in this pathway chylomicrons (lipoproteins) are broken down (remodeled via hydrolysis) and converted to a form called “chylomicron remnant” that is taken up by specific receptors that exist primarily on the surface of liver cells [14]. The second term, “pos. regulation of cholesterol esterification” refers to cholesterol esterification, a critical step in reverse cholesterol transport, the process in which excess cholesterol is sent to the liver to be removed from the body [1, 26]. Furthermore, Cholesteryl Ester Transfer Protein (CETP) is a key enzyme involved in this process and is highly expressed in liver cells, and variants of CETP are associated with increased risk of atherosclerosis [1, 32]. The fifth most significant term, “lipoprotein remodeling” is part of the two aforementioned processes. The top 100 genes identified by the scDGN include apoa1 (main protein component of High-Density Lipoprotein cholesterol), apoa2, and apoc1, all of which encode lipoproteins that are primarily expressed in the liver [7, 19]. These genes were not included in the top 100 genes by the NN. We present the GO analysis results comparison for several additional cell types in Appendix D.2 [11].

4 Discussion

Single cell computational methods that do not account for batch effects are likely to fit the noise introduced by the batches. Several recent methods have been proposed for aligning scRNA-seq from multiple studies of the same tissues or processes. Most of these methods are unsupervised and assume that the cell types among different batches overlap. However, we show that these methods would fail on the studies in which cell types do not fully overlap, which is often the case when dealing with multiple datasets. To overcome this problem we extend a supervised scRNA-seq cell type assignment method based on NN and regularize its prediction to be invariant to batch effects.

Our method is based on the ideas of domain adversarial training. In such training, two competing tasks are used to optimize the representation of scRNA-seq data. The first focuses on the traditional goal of cell type identification while the second attempts to construct representations that are not affected by specific batch or experimental artifacts. This is accomplished by jointly minimizing a loss function that takes into account both goals, accounting for the weight of each of the goals using a gradient reversal layer. We also proposed a conditional strategy to avoid over-correction. We presented efficient learning methods for this setting and tested it on three large scale scRNA-seq datasets containing experiments from several different platforms for partially overlapping cell types.

As we show, our scDGN method is able to correctly identify cell types in the test datasets. For the largest dataset we tested on which contained close to 40 different cell types, scDGN significantly outperformed all prior methods. It also ranked first for the 2nd largest dataset and for all but 1 of the 6 tests on the third dataset. Importantly, it always outperformed the supervised learning based method indicating that batch effects should be addressed when designing such methods for cell type assignments. In addition to accurately assigning cell types, further analysis of significant genes indicates that by overcoming batch effects scDGN is better able to focus on relevant sets of genes when compared to prior supervised methods, explaining its improvement in accuracy.

While scDGN performed best on the data we analyzed, there are a number of possible issues with this approach. First, it learns a large number of parameters which require large input datasets. However, as we showed, scDGN is able to perform well even for datasets with a few thousand cells which matches current sizes of scRNA-seq datasets. Second, scDGN is based on NNs which are often seen as a black box, making it hard to interpret the resulting model and its biological relevance. Recent work provides a number of directions that can be used to overcome this issue. As we showed, using activation maximization we were able to identify several relevant cell type specific genes in the learned network. Future work would include using additional NN interpretation methods, including LIME [31] or ROAR and KAR [15], to further identify the set of genes that play the largest role in the decisions the network makes. Third, as shown in Third, as shown in Appendix C.13 [11], scDGN sometimes does not mix up the representations from different batches for all cell types. Considering the visualization results for NN in Appendix C.18 [11] and its competitive performance in Table 2 together, it may indicate that it is not always necessary to remove batch effects for the model to achieve high test accuracy. Therefore, it is worthwhile to further study when the alignment is imperative. Finally, unlike prior scRNA-seq alignment methods scDGN is supervised. While this is an advantage when it comes to accuracy, as we have shown, it may be a problem for the new data. We believe that as more scRNA-seq and other high throughput single cell data accumulate, we would have labeled data for most cell types which would enable training an scDGN for even more cell types. As we have shown with the scQuery dataset, for which scDGN significantly outperformed all other methods, when such data exists scDGN is able to correctly align experiments and platforms not seen in the training set.

scDGN is implemented in Python with the PyTorch API [35] and users can obtain the code and sampled data from https://github.com/SongweiGe/scDGN.

Notes

1.
https://scquery.cs.cmu.edu/processed_data/.

References

Inazu, A.: Plasma cholesteryl ester transfer protein (CETP) in relation to human pathophysiology (Chap. 3). In: Komoda, T. (ed.) The HDL Handbook, pp. 35–59. Academic Press, Boston (2010)
Chapter Google Scholar
Integration and label transfer - standard workflow, October 2019. https://satijalab.org/seurat/v3.1/integration.html#standard-workflow
Alavi, A., Ruffalo, M., Parvangada, A., Huang, Z., Bar-Joseph, Z.: A web server for comparative analysis of single-cell RNA-seq data. Nat. Commun. 9(1), 4768 (2018)
Article Google Scholar
Chu, C., Wang, R.: A survey of domain adaptation for neural machine translation. arXiv preprint arXiv:1806.00258 (2018)
Csurka, G.: Domain adaptation for visual applications: a comprehensive survey. arXiv preprint arXiv:1702.05374 (2017)
Ding, J., et al.: Systematic comparative analysis of single cell RNA-seq methods. BioRxiv, p. 632216 (2019)
Google Scholar
Domingo-Espín, J., Nilsson, O., Bernfur, K., Giudice, R.D., Lagerstedt, J.O.: Site-specific glycations of apolipoprotein A-I lead to differentiated functional effects on lipid-binding and on glucose metabolism. Biochimica et Biophysica Acta (BBA) Mol. Basis Dis. 1864(9, Part B), 2822–2834 (2018)
Article Google Scholar
Eng, C.H.L., et al.: Transcriptome-scale super-resolved imaging in tissues by RNA seqFISH+. Nature 568(7751), 235 (2019)
Article Google Scholar
Erhan, D., Bengio, Y., Courville, A., Vincent, P.: Visualizing higher-layer features of a deep network. Univ. Montreal 1341(3), 1 (2009)
Google Scholar
Ganin, Y., et al.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(1), 1–35 (2016)
MathSciNet MATH Google Scholar
Ge, S., Wang, H., Alavi, A., Xing, E., Bar-Joseph, Z.: Supporting information for: Supervised adversarial alignment of scRNA-seq data. bioRxiv (2020). https://doi.org/10.1101/2020.01.06.896621v1.full.pdf
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), vol. 2, pp. 1735–1742. IEEE (2006)
Google Scholar
Haghverdi, L., Lun, A.T., Morgan, M.D., Marioni, J.C.: Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36(5), 421 (2018)
Article Google Scholar
Hara, T., Tan, Y., Huang, L.: In vivo gene delivery to the liver using reconstituted chylomicron remnants as a novel nonviral vector. Proc. Natl. Acad. Sci. 94(26), 14547–14552 (1997)
Article Google Scholar
Hooker, S., Erhan, D., Kindermans, P.J., Kim, B.: Evaluating feature importance estimates. arXiv preprint arXiv:1806.10758 (2018)
Hwang, B., Lee, J.H., Bang, D.: Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp. Mol. Med. 50(8), 1–14 (2018)
Article Google Scholar
Jaitin, D.A., et al.: Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Science 343(6172), 776–779 (2014)
Article Google Scholar
Kiselev, V.Y., Yiu, A., Hemberg, M.: scmap: projection of single-cell RNA-seq data across data sets. Nat. Methods 15(5), 359 (2018)
Article Google Scholar
Ko, H.L., Wang, Y.S., Fong, W.L., Chi, M.S., Chi, K.H., Kao, S.J.: Apolipoprotein C1 (APOC 1) as a novel diagnostic and prognostic biomarker for lung cancer: a marker phase I trial. Thorac. Cancer 5(6), 500–508 (2014)
Article Google Scholar
Koch, G., Zemel, R., Salakhutdinov, R.: Siamese neural networks for one-shot image recognition. In: ICML Deep Learning Workshop, vol. 2 (2015)
Google Scholar
Li, H., Pan, S.J., Wang, S., Kot, A.C.: Domain generalization with adversarial feature learning. In: CVPR (2018)
Google Scholar
Lieberman, Y., Rokach, L., Shay, T.: Castle-classification of single cells by transfer learning: harnessing the power of publicly available single cell RNA sequencing experiments to annotate new experiments. PLoS One 13(10), e0205499 (2018)
Article Google Scholar
Lin, C., Jain, S., Kim, H., Bar-Joseph, Z.: Using neural networks for reducing the dimensions of single-cell RNA-seq data. Nucleic Acids Res. 45(17), e156 (2017)
Article Google Scholar
Lopez, R., Regier, J., Cole, M.B., Jordan, M.I., Yosef, N.: Deep generative modeling for single-cell transcriptomics. Nat. Methods 15(12), 1053 (2018)
Article Google Scholar
Motiian, S., Piccirilli, M., Adjeroh, D.A., Doretto, G.: Unified deep supervised domain adaptation and generalization. In: ICCV, vol. 2, p. 3 (2017)
Google Scholar
Murakami, T., et al.: Triglycerides are major determinants of cholesterol esterification/transfer and HDL remodeling in human plasma. Arterioscler. Thromb. Vasc. Biol. 15(11), 1819–1828 (1995)
Article Google Scholar
Papalexi, E., Satija, R.: Single-cell RNA sequencing to explore immune cell heterogeneity. Nat. Rev. Immunol. 18(1), 35 (2018)
Article Google Scholar
Patel, V.M., Gopalan, R., Li, R., Chellappa, R.: Visual domain adaptation: a survey of recent advances. IEEE Signal Process. Mag. 32(3), 53–69 (2015)
Article Google Scholar
Pei, Z., Cao, Z., Long, M., Wang, J.: Multi-adversarial domain adaptation. In: AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Redgrave, T.: Chylomicron metabolism. Biochem. Soc. Trans. 32(1), 79–82 (2004). https://doi.org/10.1042/bst0320079
Article Google Scholar
Ribeiro, M.T., Singh, S., Guestrin, C.: Why should i trust you?: explaining the predictions of any classifier. In: SIGKDD, pp. 1135–1144. ACM (2016)
Google Scholar
Seidman, M.A., Mitchell, R.N., Stone, J.R.: Pathophysiology of atherosclerosis (Chap. 12). In: Willis, M.S., Homeister, J.W., Stone, J.R. (eds.) Cellular and Molecular Pathobiology of Cardiovascular Disease, pp. 221–237. Academic Press, San Diego (2014)
Chapter Google Scholar
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)
Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806 (2014)
Steiner, B., et al.: Pytorch: An imperative style, high-performance deep learning library. In: NeurIPS, vol. 32 (2019)
Google Scholar
Stuart, T., et al.: Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019)
Article Google Scholar
Stuart, T., Satija, R.: Integrative single-cell analysis. Nat. Rev. Genet. 20, 257–272 (2019)
Article Google Scholar
Tung, P.Y., et al.: Batch effects and the effective design of single-cell gene expression studies. Sci. Rep. 7, 39921 (2017)
Article Google Scholar
Villani, A.C., et al.: Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science 356(6335), eaah4573 (2017)
Article Google Scholar
Wang, G., Moffitt, J.R., Zhuang, X.: Multiplexed imaging of high-density libraries of RNAs with MERFISH and expansion microscopy. Sci. Rep. 8(1), 4847 (2018)
Article Google Scholar
Wang, H., Ge, S., Xing, E.P., Lipton, Z.C.: Learning robust global representations by penalizing local predictive power. arXiv preprint arXiv:1905.13549 (2019)
Wang, H., He, Z., Lipton, Z.C., Xing, E.P.: Learning robust representations by projecting superficial statistics out. arXiv preprint arXiv:1903.06256 (2019)
Yu, Y., et al.: Single-cell RNA-seq identifies a PD-1 hi ILC progenitor and defines its development pathway. Nature 539(7627), 102 (2016)
Article Google Scholar
Zeisel, A., et al.: Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347(6226), 1138–1142 (2015)
Article Google Scholar

Download references

Acknowledgements

This work was partially supported by National Institute of Health grants 1R01GM122096 and OT2OD026682 to Z.B.J. and by a Scholars Award in Studying Complex Systems from the James S. McDonnell Foundation to Z.B.J. HW was supported by the National Institutes of Health grants R01-GM093156 and P30-DA035778.

Author information

Authors and Affiliations

Computational Biology Department, Carnegie Mellon University, Pittsburgh, 15213, USA
Songwei Ge, Amir Alavi & Ziv Bar-Joseph
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, 15213, USA
Haohan Wang & Eric Xing
Machine Learning Department, Carnegie Mellon University, Pittsburgh, 15213, USA
Eric Xing & Ziv Bar-Joseph

Authors

Songwei Ge
View author publications
You can also search for this author in PubMed Google Scholar
Haohan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Amir Alavi
View author publications
You can also search for this author in PubMed Google Scholar
Eric Xing
View author publications
You can also search for this author in PubMed Google Scholar
Ziv Bar-Joseph
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Songwei Ge or Ziv Bar-Joseph .

Editor information

Editors and Affiliations

Carnegie Mellon University, Pittsburgh, PA, USA
Russell Schwartz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ge, S., Wang, H., Alavi, A., Xing, E., Bar-Joseph, Z. (2020). Supervised Adversarial Alignment of Single-Cell RNA-seq Data. In: Schwartz, R. (eds) Research in Computational Molecular Biology. RECOMB 2020. Lecture Notes in Computer Science(), vol 12074. Springer, Cham. https://doi.org/10.1007/978-3-030-45257-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-45257-5_5
Published: 21 April 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-45256-8
Online ISBN: 978-3-030-45257-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Supervised Adversarial Alignment of Single-Cell RNA-seq Data

Abstract

Similar content being viewed by others

scSemiAAE: a semi-supervised clustering model for single-cell RNA-seq data

Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis

Deep generative model embedding of single-cell RNA-Seq profiles on hyperspheres and hyperbolic spaces

Keywords

1 Introduction