1 Introduction

Domain adaptation aims to adapt classifiers trained on source domains to novel unlabeled target domains, where a domain shift, namely a difference in distribution statistics, is expected [8, 34]. Typically, the domain shift is considered within the context of “closed”, static domains, implicitly assuming datasets available in their entirety at adaptation time [8, 36]. However, in realistic applications data collection is not static nor closed but “open”, giving rise to non-stationary domain shifts [11].

Consider e.g. social media feeds, or urban imagery taken from inside a car. These images often arrive in “bundles” with different distribution statistics, due to, for instance, being collected in different cities or with different weather conditions (see Fig. 1). If we were to consider these bundles as isolated domains, we wouldn’t be exploiting the available unlabeled data entirely. In addition, if the distribution changes gradually over time in a streaming-like fashion, being able to adapt over bundles sequentially may benefit real-time predictions on future incoming data bundles. In streaming data this distribution shift over time is called concept drift, and the incoming stream is usually too large to be held in memory, therefore it is processed in data bundles which are later discarded.

Fig. 1.
figure 1

Distribution shift across stream batches in GTA5

As an adaptation method, we look into associative learning proposed by Haeusser et al. [15, 16], which uses association of embeddings in latent space and has been shown to work well for domain adaptation and semi-supervised learning. However, associative domain adaptation makes the implicit assumption that the class probability distributions between the source and the target domains are similar at adaptation time. This assumption cannot be guaranteed to hold when the target dataset is not well known in advance, such as in “open” datasets the class probability statistics may change dynamically or in tasks where class statistics across domains may vary a lot. An example of such a task is semantic segmentation. To this end, the associations between source and target embeddings need to be performed while taking into account the non-stationary changes of the class probability statistics.

This work makes three contributions. First, we argue that domain adaptation is important beyond static domain datasets, including continuously collected datasets whose statistics are non-stationary. For dynamic datasets domain adaptation should be able to adapt to the evolving statistics. Second, starting from associative domain adaptation [15] we show that the similar class distribution assumption between domains hurts adaptation. We therefore reformulate the approach to make the adaptation loss invariant to the inevitable non-stationary changes on the class distribution statistics. Third, we present two applications of our proposed approach, one on adapting streaming image classification, where the streaming data distribution changes over time, and one on domain adaption for semantic segmentation (see Fig. 2), where the source and target datasets have inherently different class statistics.

Fig. 2.
figure 2

Adaptation results for semantic segmentation on Cityscapes

2 Related Work

Domain Adaptation. A handful of domain adaptation methods revolve around discrepancy-based adaptation [23, 24, 31], for instance, [22] use a multi-kernel maximum mean discrepancy (MMD) minimization approach. Other methods are data reconstruction-based and often use reconstruction with e.g. autoencoders, as an auxiliary task to learn invariant features [5, 12, 13].

Another category is adversarial approaches. Adversarial discriminative methods use a classifier to discriminate between domains during training and ensure feature invariance for source and target [10, 35]. Adversarial generative methods use a generative adversarial network (GAN) [14] to learn a mapping between source and target images by interleaving the task loss, mapping generator and discriminator loss [3, 18].

Domain adaptation for semantic segmentation was recently pioneered by [19] with an adversarial discriminative based approach. Similarly [6] use discriminators for feature invariance, but for different parts of an image grid. [28] use a standard GAN approach to have a generator network learn the mapping while a discriminator network distinguishes between real and fake images. [38] split the original segmentation network into three output branches where the first two generate pseudo-labels for the third branch. [39] adopt a curriculum learning approach for by solving easy to difficult tasks to achieve adaptation.

Associative Learning. Introduced by [16], learning by association was initially applied to semi-supervised learning. [15] use associations between source and target to close the domain gap for classification achieve competitive results across different domain adaptation benchmarks for classification. The advantage of this method compared to discrepancy-based approaches is that it does not require a choice of kernel and extra hyper-parameters that come with them.

Streaming Data Classification. Approaches that deal with streaming data are either passive approaches that use a single classifier or an ensemble [30, 37] or active ones where an extra decision is made on whether to update the classifier. Most often classification algorithms such as Decision Trees, Rule-Based and Nearest Neighbor are used, whereas adjustments in neural network architectures to account for streaming have been proposed [1]. [2] use a complex sampling and filtering mechanism for active training and a random forest based classifier. [33] use a micro-cluster nearest neighbor which makes use of statistical summaries for data streams. Not many works look into exploiting unsupervised data for improving data stream classifiers. [32] use semi-supervised feature learning to adjust k-nearest neighbor weights. To our best knowledge, we are the first to explore this direction for image classification with modern deep architectures.

Fig. 3.
figure 3

Associative domain adaptation for unequal class distributions. Crosses represent the source domain and circles represent the target. Arrows represent source to target probabilities. (a) Uniformly distributed visit loss. (b) Intuition of correcting wrong associations by balancing the visit loss according to class distributions. (c) Cluster estimates to approximate class distribution in target.

3 Method

3.1 Associative Domain Adaptation

We start from two datasets, source and target. The source dataset, \(D^{S}=\{x^S_i, y^S_i\}, i=1,..., N^S\), comprises of \(N^S\) image samples with embeddings \(x^S_i\), annotated by one-hot vectors \(y^S_i=[y^S_{ic}], c=1, ..., C\), which equals to 1 if the image \(x^S_{ic}\) belongs to class c, and 0 otherwise. The target dataset, \(D^{T}=\{x^T_j\}, j=1,...,N^T\), comprises only of image embeddings which belong to the same set of classes, \(c=1,...,C\); however, no class annotations are available for retraining or fine-tuning. Between the source and target datasets there is a domain shift in the distribution of their respective embeddings, thus \(p(x^S)\ne p(x^T)\). The goal, therefore, is to adapt a classifier trained on the source dataset to work well for the target.

Associative domain adaptation [15] adapts by considering an additional adaptation loss during training on top of the standard task-specific loss, \(\mathcal {L} = \mathcal {L}_{task} + \mathcal {L}_{assoc}\). Specifically, the associative domain adaptation is decomposed into a walker and a visit loss,

$$\begin{aligned} \mathcal {L}_{assoc} = \mathcal {L}_{walker} + \beta \mathcal {L}_{visit}, \end{aligned}$$
(1)

where \(\beta \) is a weighting coefficient. Central to the associative domain adaptation is the affinity matrix \(A \in \mathbb {R}^{N_S \times N_T}\), which contains elements \(a_{ij}\) proportional to how likely the i-th embedding in the source domain, \(x^S_i\), is to be associated with the j-th embedding in the target domain, namely \(a_{ij} \propto p(x^T_j|x^S_i)\). Learning embeddings that yield an affinity matrix that minimizes the loss in Eq. (1) is the goal of associative domain adaptation.

The walker and the visit losses have complementary objectives. The objective of the walker loss is to encourage the source embeddings to lie close after adaptation to source embeddings of the same class. As no class labels are available in the target dataset, however, this objective is reformulated. Specifically, after double transition from the source to the target and back to the source, the starting and finishing source class labels should minimize the cross-entropy loss with respect to a normalized equality matrix \(E=\{e_{ik}\}\), namely

$$\begin{aligned} \mathcal {L}_{walker} = \sum _{i, k} e_{ik} \log { \big [ p(x^S_k | x^T_j)\cdot p(x^T_j | x^S_i) \big ]} , \end{aligned}$$
(2)

where \(x^T_j\) is the closest embedding in the target set and \(e_{jk}= \frac{y^S_i \cdot y^S_k}{N_S}\).

The walker loss alone, however, can lead to degenerate solutions, where the transition probabilities are learned to associate source embeddings only with a few relevant yet “easy” target embeddings. To mitigate this, the visit loss encourages that all target embeddings are equally visited. This is achieved by a minimizing cross-entropy objective

$$\begin{aligned} \mathcal {L}_{visit} = \sum _{j} v_j \log {p(x^T_j | x^S_i)}, v_j=1/N_T \end{aligned}$$
(3)

where \(v_j=1/N_T\).

3.2 Associative Domain Adaptation for Unequal Class Distributions

Associative domain adaptation implicitly assumes that the source and target distributions are similar on batch level during adaptation. The reason is that for the visit loss to be minimized in Eq. (3) it is assumed that the ideal target is the average over the size of the target dataset, \(v_j=1/N_T\). [15] consider a smaller \(\beta \) for the visit loss, if the class distributions between the source and target datasets are unequal. However, this solution implicitly expects access to the adaptation set in order to tune \(\beta \). In addition, simply receiving a weaker signal from the visit loss does not exploit the full adaptation capacity and might enforce wrong associations, as we illustrate in Fig. 3(a).

As we want target embeddings to be visited by the same-class source embeddings, intuitively they should be visited at a rate proportional to the difference between the source and target class distributions, as shown in Fig. 3(b). We can formalize the intuition by adding a weighting coefficient in front of \(v_j\) and reformulating Eq. (3) as:

$$\begin{aligned} \mathcal {L}_{visit} = \sum _{j} \gamma _j v_j \log {p(x^T_j | x^S_i)}, \gamma _j=\frac{p(Y^S =y^T_j)}{p(Y^T=y^T_j)} \end{aligned}$$
(4)

namely weighted by the ratio of class probabilities at the source and target for the correct class of the target embedding. Clearly, we cannot directly compute the ratio \(p(Y^S =y^T_j)/p(Y^T=y^T_j)\), as we would need to know the true class of the target embedding \(y^T_j\). However, we propose a way to estimate them.

Although we have no control on the target dataset, we do have control over the source dataset for which the labels are available, thus when constructing the mini-batch based on which we will perform the adaptation, we can first sample the source uniformly such that all class probabilities are equal in the source dataset, i.e. \(p(Y^S =y^T_j)=const\). Consequently, from a probabilistic perspective it is not important which particular class the j-th target embedding belongs to, alleviating the necessity to make a soft prediction for the class label of the j-th embedding.

What remains to compute the weighting coefficient \(\gamma _j\) is computing the class probability \(p(Y^T =y^T_j)\) for the j-th embedding. It is logical to expect that same-class embeddings cluster together for a modern classifier to be able to discriminate between classes. We can retrieve the class cluster around an embedding sample in an unsupervised manner and compute the probability based on cluster size. We rely on unsupervised clustering to estimate class probabilities in the batch. The approximation holds true under the assumption that the clusters are well aligned to the means of the respective, optimal classifiers. In practice, we consider hierarchical agglomerative clustering, which experimentally appears to allow for good alignment between the obtained clusters and works well when clusters have very different sizes. We illustrate the process in Fig. 3(c).

Fig. 4.
figure 4

The streaming setup uses a pre-trained model from a stationary supervised set. The model is then sequentially adapted to the incoming stream batches.

3.3 Dynamic Domain Shift in Streaming Data

Let us consider a pre-collected annotated set \(D^{S}=\{x^S_i, y^S_i\}, i=1,..., N^S\) with embeddings \(x^S_i\), with one-hot labels \(y^S_i\) and an incoming stream of image data that needs to be classified. At every time step \(\tau =1,.,K\), the stream is accumulated in a streaming data batch \(D^T_\tau \). Due to the concept drift, i.e. distribution shift over time, a classifier \(f_0(\theta )\) trained to minimize \(\mathcal {L}_{task}(\theta , D^S)\) will perform worse on the streaming batches. Being able to produce accurate predictions as soon as a stream batch comes in is crucial. Changing the models over time aims to account for the concept drift. A second problem to account for in streaming is the size of the incoming data. Usually only a small part of this data can be stored in memory. One way to deal with this is to have a mechanism in place that selects the data to be stored; another way is to be able to use and then discard all the data coming into the stream.

We simulate a streaming scenario where the stationary training set \(D^S\) is pre-collected and first used for off-line training of a predictive model \(f_0(\theta )\). Incoming stream data batches \(D^{T}_\tau \) are small compared to the stationary set \(D^{S}\), but the whole stream cannot be stored in memory, so at a time step \(\tau =k\) only a set of \(D^T_{k}, D^T_{k+1}, ...D^T_{k+w}\) is available, where w is a storage window size. A classifier \(f_{k-1}(\theta )\) trained on the stationary set and adapted to \(D^T_1, ... D^T_{k-1}\) sequentially is available. We adapt to \(D^T_k\) by minimizing the objective

$$\begin{aligned} \arg \min _{\begin{array}{c} \theta \end{array}} \mathcal {L}_{task}(f_{k-1}(\theta ), y^S) + \mathcal {L}_{walker}(\theta , D^S, D^T_k) + \beta \mathcal {L}_{visit}(\theta , D^S, D^T_k) \end{aligned}$$

The benefits of this approach are twofold. First, adapting to \(D^T_k\) improves prediction results on \(D^T_k\) itself in an unsupervised manner without extra annotation. Second, the predictions improve for \(D^T_{k+1}, D^T_{k+2} ... \) and so on in a cascade fashion, since distribution in incoming sets is more likely to be similar to the previous stream sets nearby than the stationary source, especially if we would use a sliding window over incoming sets. For simplicity we take a window size of 1. An illustration is provided in Fig. 4. In our setting, we extract patches from the GTA5 dataset and do patch-wise classification in order to demonstrate the working of our setup with a simpler task. We expect a similar behavior for more complex tasks such as semantic segmentation and object detection.

3.4 Dynamic Domain Shift in Semantic Segmentation

Having relaxed the distribution assumption, associative domain adaptation can be applied to tasks where source and target class distributions in a batch are not uniform or uniformity cannot be approximated, such as semantic segmentation. Consider a source dataset \(D^{S}=\{x^S_i, y^S_{i, H \times W}\} i=1,..., N^S\), where HW are image dimensions, is annotated at pixel level. The target images \(D^{T}=\{x^T_j\}, j=1,...,N^T\) are available without annotations. Using modern segmentation architectures, we can consider embeddings extracted from a mid-network layer which contains downsampled data. Using a DeepLab-V2 [4] architecture, we extract embedding \(x^S_{i'}\) at pixel level in decoder layers before bilinear upsampling which are 8 times downsampled in each spatial dimension. We downsample the label annotations and use \(y^S_{i', U \times V}\), where \(U=H/8\) and \(V=W/8\) together with downsampled embeddings for adaptation.

An important consideration when adapting for dense prediction is the choice of affinity matrix \(A \in \mathbb {R}^{N_S \times N_T}\) between embeddings, where \(a_{ij} \propto p(x^T_j|x^S_i)\). In [15], \(p(x^T_j|x^S_i)\) is computed as softmax over rows of A, i.e.  \(p(x^T_j|x^S_i) = \exp (a_{ij}) / \sum _{j'} \exp (a_{ij'})\), where \(a_{ij}= x^S_i \cdot x^T_j\) is the dot product between embedding vectors. The unnormalized dot product as an affinity is unbounded and can cause very small probability values for the softmax, which may lead to exploding gradients. We mitigate this by using an affinity measure based on Euclidean distance. In addition, we observe that the dimensionality of pixel embeddings for semantic segmentation is crucial for convergence. If too large, the gradients propagated are noisy and adaptation not very effective. However, dimensionality has to be large enough to allow for similar embeddings to group together but still preserve discriminable structures in latent space. For this, we add an embedding layer in the decoder where dimensionality can be adjusted for the task.

4 Experiments and Results

We validate the performance of the proposed domain adaptation method under different settings for domain class distribution divergence. First, we show the effect of increased class distribution divergence on associative domain adaptation [15] and how we can recover accuracy drops with our formulation. Second, we evaluate on a visual stream classification setting, where data and class distributions change over time. Third, we further validate the proposed method on domain adaptation for semantic segmentation The code, models and datasets will all become available upon publication.

4.1 Classification Under Different Class Distributions Between Domains

We report our results on several image classification adaptation benchmarks. For digit classification we adapt on MNIST [21] \(\rightarrow \) MNISTM [10], SVHN (Street View House Numbers) [26] \(\rightarrow \) MNIST and Sythetic Digits [10] to SVHN [26]. Next, we adapt for street sign classification from Synthetic Signs dataset [25] to German Traffic Sign Recognition Benchmark [29]. As a last benchmark, CIFAR-10 [20] \(\rightarrow \) STL-10 [7] adaptation is performed. Out of the 10 classes present in STL-10 and CIFAR-10, 9 of these overlap so they can be used for domain adaptation.

We report experiments after changing KL-divergence between the source and target class distributions, to quantify the effect of unequal class distributions for domain adaptation. In Table 1 we report the accuracies over the datasets when class distribution divergence increases for associative domain adaptation, as well as the proposed method.

Table 1. Adaptation accuracy as KL-divergence of source to target class distributions in a batch increases. The oracle version uses the true target class probabilities and serves as an upper bound.

First, as expected, larger KL-divergence between source and target usually leads to worse accuracy for associative domain adaptation. Second, the proposed method improves recognition after domain adaptation, especially for larger class distribution divergence, and especially for tasks where the classifiers are not already near maximal adaptation capacity.

To further derive insights, we also include results with an oracle-weighted visit loss that use the target class distributions (theoretical upper bound). Although our off-the-shelf agglomerative clustering does not always approximate the batch statistics perfectly, it does come considerably close to the oracle-weighted score and almost always outperforms the unweighted approach. In addition, using oracle test statistics the proposed method often comes close to the recognition accuracies of classifiers trained directly on the target domain indicating that our theoretical reasoning is correct. We conclude that when we expect a dynamical domain shift, where class distributions between the source and target change, our approach is more robust to for domain alignment.

4.2 Streaming Data Classification

Next, we evaluate the method on a streaming data scenario, where the class distributions are expected to be different between source and target. To simulate a streaming data scenario, we note that the popular synthetically generated and finely annotated GTA5 dataset [27] is in fact ordered sequentially. Video-like fragments can be observed throughout the dataset, and a shift in distribution over time can also be observed, as shown in Fig. 1. We therefore extract patches from GTA5 frame sequences and adapt to a patch-wise classification task, where the label for each patch is equivalent to the dense label for the middle pixel. We use \(65 \times 65\) patches cropped from a \(256 \times 512\) downsampled version of the original GTA5 dataset.

We consider a streaming data scenario where a small set of stationary labeled data is pre-collected and available for training. For the stationary data, we sample patches from the first 5,000 images in the GTA5 dataset. About 32,000 patches of \(65 \times 65\) dimensions are sampled. For the incoming stream we sample patches from bundles of 1,000 images each, collected sequentially. 6,000 patches are sampled from every bundle of images and accumulated in a streaming batch. We experiment with adapting six of these sets following the stationary training set.

Several observations follow from the results in Table 2. First, there is indeed a dynamical domain shift when considering visual streams instead of static datasets. When considering the classifiers trained only on the source, there is considerable fluctuation on the recognition accuracy over time. Note that this is not always harmful, e.g. for streaming batches 5 and 6 accuracy improves, presumably because the shift between target and source is smaller.

Table 2. Streaming classification accuracy per adaptation round. Cells marked “-” indicate the batch hasn’t yet entered the stream.

Second, the proposed streaming adaptation method yields considerable and constant accuracy improvements over the source-only scores, no matter the source-only recognition accuracy. Also, the proposed method yields modest but consistent improvements over standard associative domain adaptation [15].

Third, as expected, best adaptation is achieved when adapting and testing on the same stream batch (lag = 0). However, adapting with some lag allows for accurate adaptation as well. We conclude that for visual streams, where we cannot store the data and we cannot always immediately adapt, dynamical domain adaptation is valuable.

4.3 Semantic Segmentation

Last, we validate the proposed method on the task of domain adaptation for semantic segmentation of urban street scenes. This is an application where source and target class statistics cannot be expected to align, especially on batch level where adaptation happens.

We adapt on the GTA5 \(\rightarrow \) Cityscapes adaptation benchmark, which is important to domain adaptation as adapting from synthetic to real data provides potential for exploiting very easily rendered synthetic sets. GTA5 contains 24,966 images with resolution \(1914 \times 1052\), of which 12,500 are used for training and around 6,800 for validation. Cityscapes contains 5,000 pixel-level annotated images of \(2048 \times 1024\) resolution, of which 2,975 images for the training set and 500 images for validation are available. We run our experiments with images from both datasets downsampled to \(512 \times 256\) size.

As a base segmentation network we use DeepLab-V2 [4] with a ResNet-50 [17] backbone. We extend the original DeepLab-V2 architecture with a D-dimensional embedding layer that can be adjusted for experiment purposes and report results with \(D=64\). The embedding layer is placed before the bilinear upsampling part of the decoder, yielding embeddings that are 8 times downsampled in each spatial dimension. In this way we can not only adapt to more compressed information on pixel level embeddings, but also fit embedding metrics in reasonable memory even for large datasets.

We use \(\beta =0.5\) for the visit loss, adjusted for the magnitude of the loss values. We use the respective training sets of GTA5 and Cityscapes as the domains for training, test on the Cityscapes val set, and report the results in Table 3.

Table 3. GTA5 to Cityscapes domain adaptation. The last two rows show results on adapting with the unweighted version of the method and the distribution independent one.

First, we train for 30K iterations on source only for GTA5, using pre-trained ImageNet [9] weights for the ResNet-50 encoder part of the network. We observe that the proposed distribution independent approach consistently improves standard associative domain adaptation, both in terms of mIoU and pixel accuracy.

Further, the proposed method improves standard domain adaptation on 15 out of the 19 categories. Standard associative domain adaptation is still better for large classes with near constant class frequency (e.g. vegetation, terrain, sky), since adaptation over these would overrule smaller classes in a batch. Interestingly, the proposed method seem to improve significantly (6–10%) over mid-size classes, such as car, bus and person, where indeed we expect larger class frequency fluctuations. We conclude that our approach is promising for domain adaptation of complex dense prediction tasks such as semantic segmentation, and potentially, integrating with the streaming techniques above, to video semantic segmentation.

5 Conclusion

We have presented a robust and distribution independent associative learning method for domain adaptation. Our formulation accounts for realistic scenarios where source and target data distribution in a batch cannot be approximated to be equal. A novel setup for dynamic domain adaptation that adapts over unlabeled data in order to improve classifier prediction over time for streaming data has been proposed. We have shown that we can exploit unsupervised data to achieve improvements over several streaming batches without additionally annotated samples. Using our associative domain adaptation formulation and architecture considerations we achieve competitive results for semantic segmentation.

Having considered a dynamic time-shifting distribution setup and shown dense prediction adaptation results, we lay the grounds for a framework that can potentially work well with dense prediction tasks for streaming video data such as video segmentation.