Keywords

1 Introduction

Single-cell RNA sequencing (scRNA-seq) has revolutionized the study of gene expression programs [16, 27]. The ability to profile genes at the single-cell level has revealed novel specific interactions and pathways within cells [43], differences in the proportions of cell types between samples [17, 44], and the identity and characterization of new cell types [39]. Several biological tissues, systems, and processes have recently been studied using this technology [17, 43, 44].

While studies using scRNA-seq provide many insights, they also raise new computational challenges. One of the major challenges involves the ability to integrate and compare results from multiple scRNA-seq studies. There are several different commercial platforms for performing such experiments, each with their own biases. Furthermore, similar to other high throughput genomic assays, scRNA-seq suffers from batch effects which can make cells profiled in one lab look very different from the same cells profiled at another lab [37, 38]. Moreover, other types of high throughput transcriptomics profiling, including microscopy-based techniques, are also generating single cell expression datasets [8, 40]. The goal of fully utilizing these spatial datasets motivates the development of methods that can combine them with scRNA-seq when studying specific biological tissues and processes.

A number of recent methods have attempted to address this challenge by developing methods for aligning scRNA-seq data from multiple studies of the same biological system. Many of these methods rely on identifying nearest neighbors between the different datasets and using them as anchors. Methods that use this approach include Mutual Nearest Neighbors (MNN) [13] and Seurat [36]. Others including scVI and scAlign first embed all datasets into a common lower dimensional space. scVI encodes the scRNA-seq data with a deep generative model conditioning on the batch identifiers [24] while scAlign regularizes the representation between two datasets by minimizing the random walk probability differences between the original and embedding spaces. While these methods were successful for some datasets, here we show that they are not always able to correctly match all cell types. A key problem with these methods is the fact that they are unsupervised and rely on the assumption that cell types profiled by the different studies overlap. While this works for some datasets, it may fail for studies in which cells do not fully overlap or for those containing rare cell types. Unsupervised methods tend to group rare types with the larger types making it hard to identify them in a joint space.

Recent machine learning work has focused on a related problem termed “domain adaptation/generalization”. Methods developed for these problems attempt to learn representations of diverse data that are invariant to technical confounders [5, 25, 42]. These methods have been used for multiple applications such as machine translation for domain specific corpus [4] and face detection [28]. Several methods proposed for domain adaptation rely on the use of adversarial methods [5, 10, 21, 41], which has been proved effective to align latent distributions. In addition to the original task such as classification, these methods apply a domain classifier upon the learned representations. The encoder network is used for both improving accurate classification while at the same time reducing the impact of the domain (by “fooling" a domain classifier). This is achieved by learning encoder weights that simultaneously perform gradient descent on the label classification task and gradient ascent on the domain classification task.

Here we extend these approaches, coupling them with Siamese network learning [20] for overcoming batch effects in scRNA-seq analysis. We define a “domain” in this paper as a standalone dataset profiled at a single lab using a single platform. We define “label” as the cell type for each cell in the dataset. Considering the specificity of the cell types in the scRNA-seq datasets, we propose a conditional pair sampling strategy that constrains input pair selection when training the adversarial network. We discuss how to formulate a domain adaptation network for scRNA-seq data, how to learn the parameters for the network, and how to train it using available data.

We tested our method on several datasets ranging in size from 10 to 39 cell types and from 4 to 155 batches. As we show, for all of the datasets our domain adversarial method improves on previous methods, in some cases significantly. Visualization of the learned representation from several different methods helps highlight the advantages of the domain adversarial framework. As we show, the framework is able to accurately mitigate the batch effects while maintaining the grouping of cells from the same type across different batches. Biological analysis of the resulting model identifies key genes that can correctly distinguish between cell types across different experiments. Such batch invariant genes are promising candidates for a cell type specific signature that can be used across different studies to annotate cells.

Fig. 1.
figure 1

Architecture of scDGN. The network includes three modules: scRNA encoder \(f_e\) (blue), label classifier \(f_l\) (orange) and domain discriminator \(f_d\) (red). Note that the red and orange networks use the same encoding as input. Solid lines represent the forward direction of the neural network while the dashed lines represent the backpropagation direction with the corresponding gradient it passes. Gradient Reversal Layers (GRL) have no effect in forward propagation, but flip the sign of the gradients that flow through them during backpropagation. This allows the combined network to simultaneously optimize label classification and attempt to “fool” the domain discriminator. Thus, the encoder leads to representations that are invariant to the different domains while still distinguishing cell types.

2 Methods

2.1 Problem Formulation

To formulate the problem we start with a few notation definitions. We assume that the single cell RNA-seq data are drawn from the input space \(\mathbf {X} \in \mathbb {R}^p\) where each sample (a cell) \(\mathbf {x}\) has p features corresponding to the gene expression values. Cells are also associated with the label \(y \in \mathbf {Y} = \{1, 2, ..., K\}\) which represents their cell types. We associate each sample with a specific domain/batch \(d \in \mathcal {D}\) that represents any standalone dataset profiled at a single lab using a single platform. Note that we will use domain and batch interchangeably in this paper for convenience. The data are divided into a training set and a test set that are drawn from multiple studies. The domains used to collect training data are not used for the test set and so batch effects can vary between the training and test data. In practice, each of the domains only contains a small subset of the cell types. This means that the distribution of cell types is correlated with the distribution of domains. Thus, the methods that naively learn cell types based on expression profile [3, 18, 22] may instead fit domain information and not generalize well to the unobserved studies.

2.2 Domain Adversarial Training with Siamese Network

To overcome this problem and remove the domain impact when learning a cell type representation we propose a neural network framework which includes three modules as shown in Fig. 1: scRNA encoder, label classifier, and domain discriminator. The encoder module \(f_e(\mathbf {x};\mathbf {\theta }_e)\) is used to reduce the dimensions of the data and contains fully connected layers which produce the hidden features, where \(\mathbf {\theta }_e\) represents the parameters in these layers. The label classifier \(f_l(f_e;\mathbf {\theta }_l)\) attempts to predict the label of input \(\mathbf {x_1}\) whereas the goal of the domain discriminator \(f_d(f_e;\mathbf {\theta }_d)\) is to determine whether a pair of inputs \(\mathbf {x_1}\) and \(\mathbf {x_2}\) are from the same domain or not. Past work for classifying scRNA-seq data only attempted to minimize the loss function for the label classifier \(\mathcal {L}_l(f_l(f_e;\mathbf {\theta }_l))\) [3, 23]. Here, we extend these methods by adding a regularization term based on the adversarial loss of the domain discriminator \(\mathcal {L}_d(f_d(f_e;\mathbf {\theta }_d))\) which we will elaborate later. The overall loss E on a pair of samples \(\mathbf {x_1}\) and \(\mathbf {x_2}\) is denoted by:

$$\begin{aligned} E(\mathbf {\theta }_e, \mathbf {\theta }_l, \mathbf {\theta }_d) = \mathcal {L}_l\big (f_l(f_e(\mathbf {x_1};\mathbf {\theta }_e);\mathbf {\theta }_l)\big ) - \lambda \mathcal {L}_d\big (f_d(f_e(\mathbf {x_1};\mathbf {\theta }_e);\mathbf {\theta }_d),\, f_d(f_e(\mathbf {x_2};\mathbf {\theta }_e);\mathbf {\theta }_d)\big ), \end{aligned}$$

where \(\lambda \) can control the trade-off between the goals of domain invariance and higher classification accuracy. For convenience, we use \(\mathbf {z_1}\) and \(\mathbf {z_2}\) to denote the hidden representations of \(\mathbf {x_1}\) and \(\mathbf {x_2}\) calculated from \(f_e(\mathbf {x};\mathbf {\theta }_e)\). Inspired by Siamese networks [20], we implement our domain discriminator by adopting a contrastive loss [12]:

$$\begin{aligned} \mathcal {L}_d\big (f_d(\mathbf {z_1};\mathbf {\theta }_d),\, f_d(\mathbf {z_2};\mathbf {\theta }_d)\big )&= U\frac{1}{2}D\big (f_d(\mathbf {z_1}), f_d(\mathbf {z_2})\big )^2 \\&+\,(1-U)\frac{1}{2}\big (\max \{0, m-D(f_d(\mathbf {z_1}), f_d(\mathbf {z_2}))\}\big )^2, \end{aligned}$$
Fig. 2.
figure 2

Conditional domain generalization strategy: Shapes represent different labels and colors (or patterns) represent different domains. For negative pairs from different domains, we only select those samples with the same label. For positive pairs from the same domain, we only select the samples with different labels.

where \(U=1\) indicates that two samples are from the same domain d and \(U=0\) indicates that they are not, \(D(\cdot )\) is the euclidean distance, and m is the margin that indicates the prediction boundary. The domain discriminator parameters, \(\theta _d\), are updated using back propagation to maximize the total loss E while the encoder and classifier parameters, \(\theta _e\) and \(\theta _l\), are updated to minimize E. To allow all three modules to be updated together end-to-end, we use a Gradient Reversal Layer (Fig. 1) [10, 29]. Specifically, Gradient Reversal Layers (GRL) have no effect in forward propagation, but flip the sign of the gradients that flow through them during backpropagation. The following provides the overall optimization problems solved for the network parameters:

$$\begin{aligned} (\widehat{\mathbf {\theta }_e},\widehat{\mathbf {\theta }_l})&= \mathop {\mathrm {arg}\,\mathrm {min}}\limits _{\mathbf {\theta }_e, \mathbf {\theta }_l} E(\mathbf {\theta }_e, \mathbf {\theta }_l, \widehat{\mathbf {\theta }_d})\\ (\widehat{\mathbf {\theta }_d})&= \mathop {\mathrm {arg}\,\mathrm {max}}\limits _{\mathbf {\theta }_d} E(\widehat{\mathbf {\theta }_e},\widehat{\mathbf {\theta }_l}, \mathbf {\theta }_d) \end{aligned}$$

In other words, the goal of the domain discriminator is to tell if two samples are drawn from the same or different batches. By optimizing the scRNA encoder adversarially against the domain discriminator, we attempt to make sure that the network representation cannot be used to classify based on domain knowledge. During the training, the maximization and minimization tasks compete with each other, which is achieved by adjusting the representations to improve the accuracy of the label classifier and simultaneously fool the domain discriminator.

2.3 Conditional Domain Generalization Strategy

Most prior domain adaption or generalization methods focused on the cases where the distribution of labels is independent of the domains [5, 25]. In contrast, as we show in Results, for scRNA-seq experiments different studies tend to focus on certain cell types [17, 43, 44]. Consequently, it is not reasonable to completely merge the scRNA-seq data from different batches. To be specific, aligning the scRNA-seq data from two batches with different sets of cell types would sacrifice its biological significance and prevent the cell classifier from predicting effectively. To overcome this issue, instead of arbitrarily choosing positive pairs (samples from the same domain) and negative pairs (samples from different domains), we constrain the selection as follows: (1) for positive pairs, only the samples with different labels from the same domain are selected. (2) for negative pairs, only the samples with the same label from different domains are selected. Figure 2 provides a visual interpretation of this strategy. Formally, letting \(y_i\) and \(z_i\) represent the i-th sample’s cell-type label and domain label respectively, we have the following equations to define the value of U for sample pairs:

$$ U = {\left\{ \begin{array}{ll} 0, &{} z_1 \ne z_2 \text { and } y_1 = y_2 \\ 1, &{} z_1 = z_2 \text { and } y_1 \ne y_2 \end{array}\right. } $$

This strategy prevents the domain adversarial training from aligning samples with different labels or separating samples with same labels. For example, in order to fool the discriminator with a positive pair, the encoder must implicitly increase the distance of two samples with different cell types. Therefore, combining this strategy with domain adversarial training allows the network to learn cell type specific, focused representations. We term our model Single Cell Domain Generalization Network (scDGN).

3 Results

3.1 Experiment Setups

Datasets. To test our method and to compare it to previous methods for aligning and classifying scRNA-seq data, we used several recent datasets. These datasets contain between 6,000 and 45,000 cells, and all include cells profiled in multiple experiments by different labs and on different platforms.

scQuery: We use a subset of the dataset provided by scQuery, which includes uniformly processed data from hundreds of studies [3]Footnote 1. The dataset we use contains 44,490 samples from 155 different experiments. scQuery assigns cells to 39 types spanning diverse categories ranging from immune cells to neurons to organ specific cells. We use 99 of the 155 batches for training, 26 for validation, and 30 for testing. We provide a list of the studies used for each set in Appendix A.1 [11]. Statistics for the different datasets are shown in Table 1. RPKM normalization is applied to the 20,499 genes in each sample. Note that while there are 39 cell-types in the training set, only 19 and 23 of them are included in the validation and test set. This mimics the application of the methods to future studies that may not profile all types of cells.

PBMC: The Peripheral Blood Mononuclear Cells (PBMC) dataset contains 28,969 cells assigned to 10 blood cell types. The data are profiled in 9 batches (from 8 different sequencing technologies) [6]. We use the data from the 10xChromiumv2A platform as test data and the rest as training data. Following the provided tutorial [2], we use the top 3000 variable genes for the analysis.

Seurat Pancreas: The Seurat pancreas dataset is designed for evaluating single cell alignment algorithms and contains 6321 scRNA-seq samples of human pancreatic islet cell produced by four studies. We use the smallest study for the test data and the other three for training as shown in Table 1. Thirteen canonical labels of the pancreatic islet cell are assigned to cells in each study. Similar to the Seurat PBMC dataset, we only used the 3000 most variable genes. To further simulate the correlation between cell types and domains for this dataset we randomly remove the data for 6 of the 13 cell types for each of the training domains. As a result, we construct 6 synthetic datasets based this strategy to evaluate the alignment performance of different methods under a high label-domain correlation setting. The specific cell type information of each dataset is listed in Appendix A.3 [11].

Table 1. Basic statistics for scQuery, Suerat pancreas, and PBMC datasets

Model Configurations. We used the network of Lin et al. [23] as the components for the encoder and the label classifier in our model. The encoder contains two hidden layers with 1136 and 100 units. The label classifier is directly connected to the 100 unit layer and makes predictions based on these values. The domain discriminator contains an additional hidden layer with 64 units and is also connected to the 100 unit layer of the encoder (Fig. 1). For each layer, tanh() is used as the non-linear activation function. We test several other possible configurations but did not observe improvement in performance. As is commonly done, we use a validation set to tune the hyperparameters for learning including learning rates, decay, momentum, and the adversarial weight and margin parameters \(\lambda \) and m. Generally, our analysis indicates that for larger datasets a lower weight \(\lambda \) and larger margin m for the adversarial training is preferred and vice versa. More details about the hyperparameters and training are provided in Appendix A.2 [11].

Table 2. Overall performances of different methods. MI represents the mutual information between batch and cell type in the corresponding dataset. The highest test accuracy for each dataset is bolded.

Baselines. We compared scDGN to several prior methods for classifying and aligning scRNA-seq data. These included the neural network (NN) model of Lin et al. [23] which is developed for classifying scRNA-seq data, CaSTLe [22] which performs cell type classification based on transfer learning, and several state-of-the-art alignment methods. For alignment, we compared to MNN [13] which utilizes mutual nearest neighbors to align data from different batches, scVI [24] which trains a deep generative model on the scRNA-seq data and uses an explicit batch identifier to retain conditional independence property of the representation, and Seurat [36] which first identifies the anchors among different batches and then projects different datasets using a correction vector based on the order defined by hierarchical clustering with pairwise distances. Our comparisons include both visual projection of the learned alignment (Figs. 4 and 5) and quantitative analysis of the accuracy of the predicted test cell types (Table 2). For the latter, to enable comparisons of the supervised and unsupervised methods, we used the resulting aligned data from the unsupervised methods to train a neural network that has the same configuration as Lin et al. [23]. For scVI, which results in a much lower dimensional representation, we used a smaller input vector and a smaller hidden layer. Note that these alignment methods actually use the scRNA-seq test data to determine the final dimensionality reduction function while our method does not utilize the test data for any model decision or parameter learning. To effectively apply Seurat to scQuery, we remove the batches which have \({<}100\) samples. Also, for those datasets that the assumption of overlapped cell types is not guaranteed such as scQuery, we find that the performance of MNN highly depends on the order of alignment. Therefore, for MNN on the scQuery dataset, we use 10 random permutations of batch orders and report the average accuracy.

3.2 Overall Performance

As mentioned above, we use the validation set to select the best model when using the scQuery dataset. For the smaller datasets, we use the model obtained after 250 epochs (all models converged after this number of epochs). Test accuracy for the different methods is presented in Table 2. We show both mean and standard deviation of the accuracy for 10 randomly initialized experiments. We also report the performances on different cell types in Appendix B [11]. In addition, Table 2 presents the Mutual Information (MI) between labels and domains which corresponds to the difficulty of the dataset. A larger MI indicates that models that do not account for the domain are likely to fit the domain information rather than the cell type. For the scQuery dataset, we find the accuracy is low for all methods indicating that this dataset is relatively difficult. This is corroborated by the large MI value. For such data we see a clear advantage for the scDGN: scDGN improves by over 10% over all other methods (\(p = 5.069\times 10^{-5}\) based on Student’s t-test when compared to the NN baseline which is tied for second best). The improvements over other single cell alignment methods are even more significant. scDGN also achieves the best performance on the second largest dataset, the PBMC dataset. However, given the very low MI for this dataset the performance of the other methods, including the baseline NN, is almost as good as the performance of scDGN. The third dataset we test on is the Seurat pancreas dataset. This is the smallest dataset and so it has the least number of training samples. Still, of the 6 settings we tested (which differed in the subset of cells that were excluded from training), we find that scDGN is the top performer in 4 of them, comparable to the top performer for another 1 and in only one setting (Pancreas 3, with the highest MI) is significantly outperformed by Seurat. Note that even for the Pancreas 3 data the domain adversarial training helps: using this the scDGN is able to improve by more than 20% over the baseline NN used for the label classifier.

Fig. 3.
figure 3

Visualization of learned representations for NN and scDGN: using PCA and t-SNE Rows: The three datasets we tested the method on. Columns: Methods and cell types. For each row, data from different batches are distinguished using different colors.

3.3 Visualization of the Representation Learned by Alignment and Classification Methods

To further explore the effectiveness of the batch removal provided by our proposed domain adversarial training with conditional domain generalization strategy, we visualize the 100-dimensional hidden representations learned by NN and scDGN: Fig. 3 presents both PCA and t-SNE plots for several different cell types across the three datasets. Points are colored using their batch IDs in order to evaluate batch effects. As can be seen, using scDGN we obtain results that are much better at mixing cells from the different batches when compared to the baseline NN model. The impact is larger for the pancreas datasets which have larger MI compared to the PBMC dataset, which helps explain the large increase in performance for these two datasets.

Fig. 4.
figure 4

PCA visualizations of the representations learned by different models on the full Pancreas2 dataset. Colors for different cell types and domains are shown in the legend at the top.

Fig. 5.
figure 5

PCA visualizations of the representations of certain cell types and batches by different models for the scQuery dataset. Top two rows: Cell types. Colors represent different batches. HSC = hematopoietic stem cell. Bottom two rows: Batches. Colors represent different cell types.

We next extended this comparison and visualized the learned (aligned) representations for all methods using data from both the Pancreas2 and scQuery datasets (Figs. 4 and 5). For the Pancreas2 dataset, we visualize the entire dataset. For scQuery, given the large number of cell types and domains, we present PCA visualization of a subset of cell types and domains. As can be seen, in addition to scDGN, Seurat is also able to successfully mix the data from different batches. However, as the results in Table 2 indicate this may come at the expense of not correctly separating cell types. MNN and scVI are not always effective at removing batch effects for the cell types. In contrast, scDGN is able to do both domain mixing and cell type assignment, leading to its better performance overall. For example, for the acinar and alpha cell types in the pancreas dataset (Fig. 4), only scDGN , MNN, and Seurat are able to align the data from different domains. However, MNN and Seurat over-correct the representation by aligning different cell types from different domains, mixing acinar and gamma cells. Additional visualizations for other cell types and domains can be found in Appendix C [11], where the same advantages of scDGN over other methods can be consistently observed.

3.4 Analysis of Key Genes

While NNs are often treated as black boxes, recent methods provide useful directions for making them more interpretable [31]. Here we use activation maximization, which relies on the gradient of the correct category logit with respect to the input vector to select the key inputs for each of the models [9, 33, 34]. Formally, given a particular cell type i and a trained neural network \(\phi \), activation maximization looks for important input genes \(x'\) by solving the following optimization problem:

$$ x' = \max _x (\phi (x)\cdot e_i), $$

where \(e_i\) is the natural basis vector associated with the i-th category. This can be solved through backpropagation, where the gradient of \(\phi (x)\) with respect to x, which can be viewed as the weight of the first-order Taylor expansion of the neural network, are calculated to iteratively update the input. We follow a previous method [33] and initialize the optimization with a zero vector. Given this setting, we ran the optimization for 100 iterations with learning rate set to 1. The important genes are selected as those inputs leading to the largest changes compared with the initialization values. To compare scDGN and NN for certain cell types, we select the top k genes with the largest changes and perform GO analysis on these selected genes.

Table 3. GO analysis results for top 100 scQuery liver genes in the NN method.
Table 4. GO analysis results for top 100 scQuery liver genes in the scDGN method.

As an example, consider the genes identified for the liver cell type using the scQuery dataset. We select the top 100 genes for this cell type from NN and scDGN and present the enriched GO categories on Biological Process with adjusted p-value \({<}1.0\times 10^{-4}\) in Tables 3 and 4. We also list these genes by order in Appendix A.3 [11]. As can be seen, while a number of significant GO categories are identified for the top 100 NN genes, these are generic and not liver specific. They include general terms related to interactions between organs and immune response categories that are active in multiple organs and cell types. In sharp contrast, the categories identified for scDGN are much more specific and highlight key pathways that are mainly utilized in the liver. For example, the top category for the scDGN genes, “chylomicron remodeling”, refers to the main physiological purpose of chylomicron remnants: to facilitate the return of bile lipoproteins and cholesterol to the liver [30]. Specifically, in this pathway chylomicrons (lipoproteins) are broken down (remodeled via hydrolysis) and converted to a form called “chylomicron remnant” that is taken up by specific receptors that exist primarily on the surface of liver cells [14]. The second term, “pos. regulation of cholesterol esterification” refers to cholesterol esterification, a critical step in reverse cholesterol transport, the process in which excess cholesterol is sent to the liver to be removed from the body [1, 26]. Furthermore, Cholesteryl Ester Transfer Protein (CETP) is a key enzyme involved in this process and is highly expressed in liver cells, and variants of CETP are associated with increased risk of atherosclerosis [1, 32]. The fifth most significant term, “lipoprotein remodeling” is part of the two aforementioned processes. The top 100 genes identified by the scDGN include apoa1 (main protein component of High-Density Lipoprotein cholesterol), apoa2, and apoc1, all of which encode lipoproteins that are primarily expressed in the liver [7, 19]. These genes were not included in the top 100 genes by the NN. We present the GO analysis results comparison for several additional cell types in Appendix D.2 [11].

4 Discussion

Single cell computational methods that do not account for batch effects are likely to fit the noise introduced by the batches. Several recent methods have been proposed for aligning scRNA-seq from multiple studies of the same tissues or processes. Most of these methods are unsupervised and assume that the cell types among different batches overlap. However, we show that these methods would fail on the studies in which cell types do not fully overlap, which is often the case when dealing with multiple datasets. To overcome this problem we extend a supervised scRNA-seq cell type assignment method based on NN and regularize its prediction to be invariant to batch effects.

Our method is based on the ideas of domain adversarial training. In such training, two competing tasks are used to optimize the representation of scRNA-seq data. The first focuses on the traditional goal of cell type identification while the second attempts to construct representations that are not affected by specific batch or experimental artifacts. This is accomplished by jointly minimizing a loss function that takes into account both goals, accounting for the weight of each of the goals using a gradient reversal layer. We also proposed a conditional strategy to avoid over-correction. We presented efficient learning methods for this setting and tested it on three large scale scRNA-seq datasets containing experiments from several different platforms for partially overlapping cell types.

As we show, our scDGN method is able to correctly identify cell types in the test datasets. For the largest dataset we tested on which contained close to 40 different cell types, scDGN significantly outperformed all prior methods. It also ranked first for the 2nd largest dataset and for all but 1 of the 6 tests on the third dataset. Importantly, it always outperformed the supervised learning based method indicating that batch effects should be addressed when designing such methods for cell type assignments. In addition to accurately assigning cell types, further analysis of significant genes indicates that by overcoming batch effects scDGN is better able to focus on relevant sets of genes when compared to prior supervised methods, explaining its improvement in accuracy.

While scDGN performed best on the data we analyzed, there are a number of possible issues with this approach. First, it learns a large number of parameters which require large input datasets. However, as we showed, scDGN is able to perform well even for datasets with a few thousand cells which matches current sizes of scRNA-seq datasets. Second, scDGN is based on NNs which are often seen as a black box, making it hard to interpret the resulting model and its biological relevance. Recent work provides a number of directions that can be used to overcome this issue. As we showed, using activation maximization we were able to identify several relevant cell type specific genes in the learned network. Future work would include using additional NN interpretation methods, including LIME [31] or ROAR and KAR [15], to further identify the set of genes that play the largest role in the decisions the network makes. Third, as shown in Third, as shown in Appendix C.13 [11], scDGN sometimes does not mix up the representations from different batches for all cell types. Considering the visualization results for NN in Appendix C.18 [11] and its competitive performance in Table 2 together, it may indicate that it is not always necessary to remove batch effects for the model to achieve high test accuracy. Therefore, it is worthwhile to further study when the alignment is imperative. Finally, unlike prior scRNA-seq alignment methods scDGN is supervised. While this is an advantage when it comes to accuracy, as we have shown, it may be a problem for the new data. We believe that as more scRNA-seq and other high throughput single cell data accumulate, we would have labeled data for most cell types which would enable training an scDGN for even more cell types. As we have shown with the scQuery dataset, for which scDGN significantly outperformed all other methods, when such data exists scDGN is able to correctly align experiments and platforms not seen in the training set.

scDGN is implemented in Python with the PyTorch API [35] and users can obtain the code and sampled data from https://github.com/SongweiGe/scDGN.