Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Deep convolutional neural networks have brought impressive advances to the state of the art across a multitude of tasks in computer vision [1,2,3]. At the same time, these significant leaps require a large amount of labeled data. For some pixel-level tasks, e.g., semantic segmentation, obtaining a fine-grained label is expensive and time-consuming. In [4], they report that it takes more than 90 min for manually labeling a single image. Recent advances in Computer Graphics [5] offer an alternative solution to address the data issue. In [5], they automatically capture both images and fine-grained labels from GTAV game with the speed faster than human in several orders of magnitude.

However, models trained on the synthetic data fail to perform well on the real-world images. The main reason is the shift between training and test domains [6]. In the presence of the domain shift, the model trained on the synthetic data often tends to be biased towards the source domain (synthetic images), making them incapable to generalize to the target domain (real images).

Traditional approaches for domain adaptation mainly focus on the image classification task, which can be summarized as two lines: (1) minimizing the distance between the source and target distributions [7,8,9]; (2) explicitly ensuring that two distributions close to each other by adversarial learning [10, 11]. Existing works [12, 13] used the similar idea, i.e., gradient reversal layer, to our proposed loss in the domain adaptation for image classification, which was achieved by multiplying a negative scalar during the backpropagation. However, since there exist large category discrepancies between pixels in one image, the manner of uniformly reversing the gradients for all pixels with same scalar is not suitable for the structured prediction in the segmentation. Those drawbacks limit the gradient reversal layer to generalize to the segmentation adaptation.

Fig. 1.
figure 1

We show the tendency of mIoU on the source domain and target domain. The curves indicate the trends and points denote the actual mIoU. Besides, we display the samples from source domain (GTAV) and target domain (Cityscapes)

Semantic segmentation provides pixel-level label for input image, which carries more dense and structured information than image classification, and thus making its domain adaptation difficult. Hence, the domain adaptation techniques in the classification task which focus on sparsely high-level features do not translate well to the segmentation adaptation [14]. Few works have explored the domain adaptation for segmentation [14,15,16]. Orthogonal to those works focusing on manipulating the data statistics [15] or applying the curriculum learning [14] to adaptation, we propose the novel Conservative Loss to realize it without introducing extra computational overhead.

We observe that with training step goes by, the performance on the target domain first rises and then falls. We show the trends of mIoU on the experiment of synthetic (GTAV data [5]) to real (Cityscapes data [4]) segmentation adaptation in Fig. 1. It can be observed that the performance on source domain and target domain would not reach the best at the same time because of the domain shift. Since there is no ground truth for target domain during training, it is required to find the saddle point of target domain on the source domain. It is note-worthy that the saddle point for target domain does bias to the best score on the source domain but not reach, which delivers a balance between the discriminativeness and domain-invariant. This phenomenon is consistent with many domain adaptation theories [17,18,19]. Therefore, we focus on learning representations with two following characteristics which are: (i) discriminative for semantic segmentation on the source domain (corresponding to the ‘first rises’) and (ii) invariant to the change of domains.

In this paper, this is achieved by training with the Conservative Loss in an adversarial framework. The Conservative Loss is extremely simple. It holds two attributes corresponding to the properties of desired representations. First, when the probability of ground truth label on the source domain is low, the Conservative Loss enforces the network to learn more discriminative features via gradient descent, which corresponds to the first property of discriminativeness. Second, when the probability of ground truth label is much high, our loss penalizes this case by giving a negative value, which prevents the model from biasing to source domain training data further increasing the generalization capability. This corresponds to the second property of domain-invariant. Our loss function can be seen to seek the optimal parameters that deliver a saddle point of those two objectives. Furthermore, the generative adversarial network (GAN) [20] is also introduced to our model. Unlike some works [10, 15] where they apply the feature-level discriminator, we utilize the GAN to further supplement the domain alignment by enforcing reconstructed images to be indistinguishable for the discriminator.

We conduct extensive experiments on synthetic to real segmentation adaptation. The proposed method considerably improves over previous state-of-the-art and achieves 9.3 points of mIoU gain on Synthia [21] to Cityscapes [4] experiment without introducing any extra computational overhead during evaluation. Ablation studies verify the effect of different components to our performance and give more insights into properties of our Conservative Loss. More discussions and visualization demonstrate the Conservative Loss has good flexibility rather than limiting to a fixed instantiation.

2 Related Work

Semantic Segmentation. Semantic segmentation is a highly active field, which is a task of assigning object label to each pixel of image. With the surge of deep segmentation model [3], most recent top-performing methods are built on the CNNs [1, 22, 23].

Huge amount of human effort is required to annotate the fined-grained semantic segmentation ground truth. According to [5], it did take about 60 min to manually segment each image. On the contrary, collecting data from video games such as GTAV [5] is much faster and cheaper compared with the human annotator. For example, [5] extracted 24,966 GTAV images with annotations within 49 hrs by using a GPU parallel method. However, it is hard to apply the model trained on the synthetic image to the real-world image because of their discrepant data distributions.

Domain Adaptation. Many machine learning methods rely on the assumption that the training and test data are in the same distribution. However, it is often the case that there exists some discrepancies [17, 19], which leads to significant performance drop on the test data. Domain adaptation aims to alleviate the impact of the discrepancy between training and test data.

Domain Adaptation for Image Classification. Existing works on domain adaptation mostly focus on image classification problem. Conventional methods include Maximum Mean Discrepancy (MMD) [7,8,9], geodesic flow kernel [8], sub-space alignment [24], asymmetric metric learning [25], etc. Recently, domain adaptation approaches aim to improve the adaptability of deep neural networks [7, 13, 26,27,28,29,30,31].

Domain Adaptation for Semantic Segmentation. Much less attention has been given to domain adaptation for semantic segmentation task. The pioneering work in this task is [15], which combines the global and local alignment methods with a domain adversarial training. Another work [14] applies the curriculum learning to solve the domain adaptation from easy to hard. In [16], they propose an unsupervised learning to adapt road scene segmenters across different cities. In [32], they perform output space adaptation at feature level by an adversarial module. Unlike them constraining the distribution [15] or the output of the network [32], we propose the Conservative Loss to naturally seek the discriminative and domain-invariant representations.

Adversarial Learning. Recently, Generative Adversarial Network (GAN) [20] has raised great attention. Some works extend this framework for domain adaptation. CoGAN [11] achieves the domain adaptation by generating cross-domain instances. Domain adversarial neural networks [12] consider adversarial training for suppressing domain biases. In [10], they incorporate adversarial discriminative setting to help mitigate performance degradation. In our work, we also incorporate the GAN into our model, whose discriminator drives the source image towards the target one for promoting domain alignment.

3 Methodology

As presented above, the key to realize unsupervised domain adaptation is the discriminative and domain-invariant representations. The Conservative Loss is proposed to penalize the extreme cases and its goal is to deliver a balance between the discriminative and the domain-invariant representations. Furthermore, we introduce the generative adversarial networks to align the source and target embedding. Below, we first describe the framework of our model and its network blocks. Then, the Conservative Loss and its background are presented in details. Finally, the alternative optimization is provided.

Fig. 2.
figure 2

The pipeline of our framework. E denotes the encoder, G denotes the generator, and D is the discriminator. S is the pixel-wise classifier for semantic segmentation. The red color represents the network blocks for the source domain, and the blue for the target domain. We also display the Conservative Loss and its backpropagation. represents the gradient ascend and denotes the gradient descend (Color figure online)

3.1 Framework Overview

Our framework is illustrated in Fig. 2. In our setting, there are two domains: source domain (image and label) and target domain (image only). Our framework aims to achieve good performance on the target domain by applying the model trained on the source domain.

Our model consists of two major parts, i.e., GAN and Segmentation part. The GAN aims to align the source and target embedding. More specifically, the generator and discriminator are playing a minimax game [20], in which the generator takes source embedding as input and generates the target-like image to fool the discriminator, while the discriminator tries to classify the reconstructed image [10, 11]. The segmentation part can be seen as a regular segmentation model. For each part, the detailed components are shown in the following:

  • The encoder(E) performs the feature embedding given source or target image, whose architecture is a fully convolutional network. The generator(G) reconstructs the image based on the embedding. The discriminator(D) does classify the reconstructed images as real or fake. S is the pixel-wise classifier.

  • The GAN consists of encoder, generator and discriminator.

  • The segmentation part consists of encoder and pixel-level classifier. Note that the encoder does work in both GAN and Segmentation.

The detailed architecture of generators and discriminators is described in the supplementary material because of the limited page space.

3.2 Background

In this section, we briefly introduce the theory of domain adaptation and present its relation to our proposed loss.

Many theoretical analyses of domain adaptation [17,18,19] have offered a upper bound on the expected risks of target domain, which depends on its source domain error (test-time) and the divergence between two domains. Formally,

$$\begin{aligned} \epsilon _{\mathcal {T}} \le \epsilon _{\mathcal {S}} + \frac{1}{2}d(\mathcal {S}, \mathcal {T}) + \mathcal {C}, \end{aligned}$$
(1)

where \(\mathcal {S}\) and \(\mathcal {T}\) denote the source domain and target domain, respectively. \(\epsilon \) is the expected risk. d is the domain divergence, which has different notions, for example \(\mathcal {H}\)-divergence [19]. \(\mathcal {C}\) is a constant term.

It can be observed that two terms \(\epsilon _{\mathcal {S}}\) and \(d(\mathcal {S}, \mathcal {T})\) closely relate to the properties in the desired representations. The first term \(\epsilon _{\mathcal {S}}\) indicates that the model should produce discriminative representations for getting smaller expected risks on the source domain, which corresponds to the first property of discriminativeness. The second term \(d(\mathcal {S}, \mathcal {T})\) defines the discrepancy distance between two distributions, in which the more similar the representations of both domains are, the smaller it is. This correlates with the second property of domain-invariant. More theoretical analyses are shown in the supplementary material.

Fig. 3.
figure 3

The proposed Conservative Loss with different a. It can be observed that the Conservative Loss keeps low values in the middle level and punishes the extremely good or bad cases

3.3 Conservative Loss

As explained above, the desired representations should be discriminative for the main task on source domain and possess good generalization ability rather than getting into the overfitting. We thus propose the Conservative Loss for the semantic segmentation on the source domain, which carries the two following properties:

  • When the probability of ground truth class is low, the loss function gives a positive value, which enables the network to learn a more discriminative feature by using gradient descent method.

  • When the probability is high, the loss function delivers the negative value, which makes the network avoid the bias towards the source domain via the gradient ascend further learning the better generalization.

The Conservative Loss is formulated as:

$$\begin{aligned} \text {CL}(p_t) = (1+\log _{a}(p_t))^2 * \log _a({-\log _{a}(p_t)}), \end{aligned}$$
(2)

where \(p_t\) is the probability of our prediction towards ground truth. a is the base of logarithmic function, which also indicates the intersection point with x-axis, that is \(\frac{1}{a}\). The Conservative Loss is visualized for several values of \(a\in [2, \mathrm {e}, 3, 4]\) in Fig. 3, in which \(\mathrm {e}\) is Euler’s number and \(\mathrm {e}\approx 2.718\). Specifically, \((1+\log _{a}(p_t))^2\) acts as a modulating factor, which delivers the large values when \(p_t\) is much low or high. \(\log _a(-\log _a(p_t))\) is designed as the switch of gradient direction, in which when \(p_t>\frac{1}{a}\) it is negative, otherwise it is positive.

In the following, we have raised two lemmas to analysis the appealing property of our Conservative Loss.

Lemma 1

The objective function of domain adaptation system contains a saddle point, which relates to the zero point of Conservative Loss.

As the pipeline in Fig. 2 shown, the full objective consists of two parts, including the loss \(\mathcal {L}_{seg, p_t}^s\) for Segmentation and the loss \(\mathcal {L}_{GAN}\) for GAN. The sign of \(\mathcal {L}_{seg, p_t}^s\) dynamically depends on \(p_t\). When \(p_t\) is much high, the negative value leads to the gradient ascend for escaping the bias to source domain. Otherwise, the positive value makes the features discriminative. It can be seen that our loss balances the two objectives (discriminativeness and domain-invariant) that shape the representations during learning, and its zero point acts as the saddle point. More details are shown in the supplementary material.

Lemma 2

Our loss encourages the moderate examples in large range, which makes the overall optimization more stable.

From the loss form, it can be observed that the loss focuses on the hard negatives and positives, and tends to give the low value for the probability in the middle level. For instance, with \(a = \mathrm {e}\), the loss values of \(p_t = 0.9\) and \(p_t = 0.1\) are \(-1.8\) and 1.4, respectively, while the loss values of \(p_t = 0.5\) and \(p_t = 0.6\) are \(-0.03\) and \(-0.06\). In such setting, the loss extends the range in which an example receives low loss, which brings a stable optimization even in the case of the gradient descend and ascend frequently alternate due to the joint optimization of \(\mathcal {L}_{seg, p_t}^s\) and \(\mathcal {L}_{GAN}\).

In practice we use a \(\lambda \)-balanced variant of the Conservative Loss:

$$\begin{aligned} \text {CL}(p_t) = \lambda (1+\log _{a}(p_t))^2 * \log _a({-\log _{a}(p_t)}). \end{aligned}$$
(3)

As our experiments will show, different balanced factors \(\lambda \) yield slightly different performance. While in our main experiments we use the Conservative Loss defined above, its exact form is not crucial. In Sect. 4.5 we offer other forms of our loss which also maintain the two properties, and experimental results demonstrate that they can also be effective.

3.4 Model Objective

Our full objective is to alternatively update the three network blocks, i.e., discriminators(D), generators(G) and encoder(E). Note that S is a pixel-level classifier which has no learnable parameters in our model. Hence, the objective contains three terms: \(\mathcal {L}_{D}\), \(\mathcal {L}_G\) and \(\mathcal {L}_E\). We then explain the various losses used in our method and describe the alternative optimization scheme.

Adversarial Loss. Inheriting from GAN [20], we apply the adversarial losses which are derived from the discriminator to all three blocks. We term them as \(\mathcal {L}_{\text {GAN,D}}, \mathcal {L}_{\text {GAN,G}}\) and \(\mathcal {L}_{\text {GAN,E}}\). For each adversarial loss it consists of two parts, i.e., \(\mathcal {L}_{GAN}^s\) for the source image and \(\mathcal {L}_{GAN}^t\) for the target image. Thus we can obtain the adversarial loss by \(\mathcal {L}_{GAN} = \mathcal {L}_{GAN}^s + \mathcal {L}_{GAN}^t\). It is noted that for the encoder, the adversarial loss does a cross-domain update (i.e., classifying the image as real or fake from source domain to target domain and vice versa), which enforces the network to generate similar embeddings for two domains.

Reconstructed Loss. The generator performs the image reconstruction. We use L1 distance as \(\mathcal {L}_{rec}\) because L1 encourages less blurring.

Segmentation Loss. As Sect. 3.3 introduced, the Conservative Loss is applied to the semantic segmentation in the domain adaptation setting.

During training, we iteratively optimize all three learnable parts (Encoder, Generator and Discriminator). During inference, only the encoder and segmentation classifier are used to produce the results on target domain. The alternating update scheme is described as following:

  1. (1)

    Update discriminators: the overall loss is \(\mathcal {L}_D = \mathcal {L}_{GAN,D}\).

  2. (2)

    Update generators: the loss involves the adversarial loss and reconstructed loss. The overall loss is \(\mathcal {L}_{G} = \mathcal {L}_{GAN, G} + \mathcal {L}_{rec}\).

  3. (3)

    Update encoder: since the encoder does work in both two components, i.e., GAN and Segmentation, the overall loss is a combination of several losses, including adversarial loss and segmentation loss on source domain; \(\mathcal {L}_{E} = \mathcal {L}_{GAN, E} + \mathcal {L}_{seg}^s\).

4 Experiments

4.1 Dataset

Following previous works [14, 15], we use GTAV [5] or Synthia [21] dataset as the source domain with pixel-level labels, and we use Cityscapes [4] dataset as the target domain. We briefly introduce the datasets as following:

GTAV has 24,966 urban scene images rendered by the gaming engine GTAV. The semantic categories are compatible with the Cityscapes dataset. We take the whole GTAV dataset with labels as the source domain data.

Synthia is a large dataset which contains different video sequences rendered from a virtual city. We take SYNTHIA-RAND-CITYSCAPES [21] as the source domain data which provides 9,400 images from all the sequences with Cityscape-compatible annotations. Inheriting from existing methods [14], we take 16 common object categories for the evaluation.

Cityscapes is a real-world image dataset focused on the urban scene, which consists of 2,975 images in training set and 500 images for validation. The resolution of images is \(2048 \times 1024\) and 19 semantic categories are provided with pixel-level labels. We take the unlabeled training set as the target domain data. The adaptation results are reported on the validation set.

4.2 Training Setup

In our experiments, we use the FCN8s [3] as the semantic segmentation model. The backbone is VGG16 [2] which is pretrained on the ImageNet dataset [33]. We apply the PatchGAN [34] as the discriminator, in which the discriminator tries to classify whether overlapping image patches are real or fake. Similar to EBGAN [35], we add the Gaussian noise to the generator. During training, Adam [36] optimization is applied with \(\beta _1 = 0.9\) and \(\beta _2 =0.999\). For the Conservative Loss, we apply \(a = \mathrm {e}\) and the balanced weight \(\lambda = 5\). The ablation study will give more detailed explanations. Due to the GPU memory limitation, the images used in our experiments are resized and cropped to \(1024 \times 512\) and the batch size is 1. More experimental settings will be available in the supplementary material.

Warm Start. In our experiments, two different training strategies are employed, which are cold start and warm start. The cold start is that the whole model is trained by using the Conservative Loss from scratch. The warm start indicates the model is trained by first using cross entropy loss and then using our Conservative Loss. Many works [37,38,39] demonstrate that the warm start strategy to gradient update provides a more stable training compared with cold start. As the ablation study will show, the warm start performs better than the cold start. In the next domain adaptation experiments, the model is trained using warm start strategy for fairness.

4.3 Results

In this section, we provide a quantitative evaluation by performing two adaptation experiments, i.e., from GTAV to Cityscapes and from Synthia to Cityscapes. We compare our method with several existing models, including FCNWild [15], CDA [14] and [32]. FCNWild [15] applies the dilated network [40] as the backbone and the base model of [14] is the FCN8s-VGG19 [3]. Tsai et al. [32] adopts adversarial learning in the output space to perform feature adaptation. The detailed results of each category are available in the supplementary material.

Table 1. Results of domain adaptation from GTAV \(\rightarrow \) Cityscapes. The bold values denote the best scores in the column.
Table 2. Results of domain adaptation from Synthia \(\rightarrow \) Cityscapes.

GTAV \(\rightarrow \) Cityscapes. For a fairness, the result is evaluated over the 19 common classes. From Table 1 shown, our proposed method achieves the best performance (mIoU = 38.1), which has 9.2 points higher than [14] and 11 points higher than [15]. Due to the different experimental settings and backbone network (baseline method [14] also mentions the difference), our own baseline performance is higher than other methods. However, the highlight is the performance gain. We can find that the proposed method yields an improvement of 8.1 points higher than 6.0 in [15] and 6.6 in [14].

Synthia \(\rightarrow \) Cityscapes. We report the results of mIoU in Table 2. It is noted that [32] reported the results on Synthia [21] to Cityscapes adaptation with only 13 object categories (excluding wall, fense and pole). We also report this results as the mIoU-2. Our proposed model achieves a mIoU of 34.2, and more importantly our model obtains a 9.3 points of performance gain which is higher than the performance gain of [14] (7.0) and [15] (2.8). Compared with [32] on 13 categories, our method also achieves the better performance. In particular, our model does not use any additional scene parsing data except the source domain and target domain data, while the [14] uses another dataset, i.e., PASCAL CONTEXT dataset, to obtain the superpixel label.

Table 3. Results of ablation study for different components in the proposed model. CL means the Conservative Loss. CE means the cross entropy loss
Table 4. Results of ablation experiments for a and \(\lambda \) in the Conservative Loss

4.4 Ablation Study

In this section, we perform the thorough ablation experiments, including experiments with different components, different factors in the Conservative Loss and different training strategies. Those experiments demonstrate different contributions of components and provide more insights of our method.

Effect of Different Components. In this experiment, we show how each component in our model affects the final performance. We consider several cases as following: (1): the baseline model, which contains only the base segmentation model (FCN8s in our model) and is trained using source data only. (2) the FCN8s and GAN component, which consists of base model and GAN and is trained using both source data and target data with the cross entropy loss. (3) the full model, which involves three parts, i.e., base model, GAN and Conservative Loss. We perform the ablation experiments on GTAV \(\rightarrow \) Cityscapes setting.

The results of ablation study are shown in Table 3. It can be observed that each component plays an important role in performance improvement. More specifically, our full model achieves the best results and obtains 8.1 points performance gain. The GAN part also gets 4.4 performance gain compared with FCN8s+CE. Note that the GAN component could introduce the unlabeled target domain data into the whole model, so the Conservative Loss is applied based on the GAN and there is no variant of FCN8s+CL.

Effect of a and \(\lambda \) in the Conservative Loss. In this part, we design the ablation experiments for a and \(\lambda \) in the Conservative Loss. As shown in Eq. 2, a is the base of logarithm and denotes the intersection point with x-axis. \(\lambda \) is a balanced factor. We show the impacts of different a and \(\lambda \) in Table 4.

Since there are two variables, we perform the ablation study for one variable with another fixed. For the ablation of a (with fixed \(\lambda = 5\)), it can be observed that \(a = \mathrm {e}\) achieves the best result. Furthermore, we can find that all different a obtain much better performance compared with the cross entropy loss (34.4 in Table 3), which demonstrates that our loss performs consistently better and has high robustness. For the ablation of \(\lambda \) (with fixed \(a = \mathrm {e}\)), different \(\lambda \) show slightly different results and \(\lambda = 5\) obtains the best performance.

Warm Start and Cold Start. As described in Sect. 4.2, we use a warm start strategy to train the proposed model. In this experiment, we compare the two training strategies. For the cold start strategy, we clamp the Conservative Loss with [\(\min =-10\), \(\max =10\)], while this constraint is not exist in the warm start. We use the \(\lambda \)-balanced Conservative Loss with \(\lambda =5\) and \(a=\mathrm {e}\).

Table 5. Results of two training strategies, i.e., cold start and warm start. CL means the Conservative Loss

In Table 5, it can be observed that the Conservative Loss with cold start outperforms [14] with a large margin (6.3 points). The warm start performs better than the cold start because it enables the network to train stably.

4.5 Discussion

In this section, we design several experiments to verify the capability of the proposed method. We show the effect of adaptation on distribution to measure how domain gap is reduced in the feature level. We compared with several classification losses and homogeneous losses to show its superiority and flexibility.

Visualizations. To verify the effect of adaptation on the distribution, we use t-SNE [41] to visualize feature distributions in Fig. 4. 100 images are randomly selected from each domain and for each image the features from last convolutional layer (the channel size equals to class categories.) are extracted. We compare the distributions of our model with FCN8s (No adaptation). Four categories are sampled to display for a clearly visual effect. We observe that with the adaptation applying, the distance between two domains with same class becomes closer and the discrepancy between different classes also gets clear.

Fig. 4.
figure 4

We show the effect of adaptation on the distribution of the extracted features. \(\spadesuit \) denotes the point from source domain and \(\bigstar \) is from target domain

Fig. 5.
figure 5

The left figure shows three classification losses, including Cross Entropy loss (CE) in blue, Focal Loss (FL) in green and Conservative Loss (CL) in red. The right table shows the results of all three losses on GTAV \(\rightarrow \) Cityscapes adaptation experiment (Color figure online)

Comparison with Other Classification Losses. In this experiment, we compare the Conservative Loss to Cross Entropy Loss and Focal Loss [42]. The Cross Entropy Loss is given by \(\text {CE}(p_t) = -\log (p_t)\), which is plotted in Fig. 5 with green line. To ensure fairness, we utilize the \(\alpha \)-balanced Focal Loss \(\text {FL}(p_t) = -\alpha _t(1-p_t)^2\log (p_t)\) and warm start in the experiment of Focal Loss, and apply \(\alpha _t = 5\) by using a cross-validation.

From the right table in Fig. 5, it can be observed that the Focal Loss obtains a better performance compared with the cross entropy loss because it focuses learning on hard negative examples. However, in the domain adaptation, the domain-invariant representations are crucial to achieve good adaptation performance. The Conservative Loss does enable the network to be insensitive to domain changes by punishing the extreme cases. It can be seen that the Conservative Loss yields higher result (38.1), and obtains more performance gain (3.7) than the Focal Loss (1.4) based on the cross entropy loss.

Effect of Homogeneous Losses. As shown in Sect. 3.3, the Conservative Loss has two properties: (1) when the \(p_t\) is low, the Conservative Loss enforces the network to learn discriminative features. (2) when the \(p_t\) is high, the loss enables the network to learn domain invariant features by gradient ascend method, which aims to penalize the extremely good cases. There are several losses that also maintain these two properties, for example the cubic equation. In this experiment, we propose several homogeneous losses to verify the effect of these two properties, which are given by:

$$\begin{aligned} \text {Loss}_1&= -\lambda _1(p_t-0.5)^3, \end{aligned}$$
(4)
$$\begin{aligned} \text {Loss}_2&= -\lambda _2(p_t-\frac{1}{\mathrm {e}})^3, \end{aligned}$$
(5)
$$\begin{aligned} \text {Loss}_3&=\left\{ \begin{array}{rcl} -\alpha *(p_t - \frac{1}{\mathrm {e}})^3,&{} &{}{p_t < \frac{1}{\mathrm {e}}}, \\ \\ -\beta *(p_t - \frac{1}{\mathrm {e}})^3,&{} &{}{p_t \ge \frac{1}{\mathrm {e}}}. \end{array} \right. \end{aligned}$$
(6)

Equations 4 and 5 demonstrate the \(\lambda \)-balanced cubic equations with different intersection points, i.e., 0.5 and \(\frac{1}{\mathrm {e}}\), respectively. Equation 6 is a piecewise function, which is more similar to the Conservative Loss due to these two balanced factors.

Table 6. Results of homogeneous losses

We apply the adaptation experiment on GTAV \(\rightarrow \) Cityscapes to verify their capabilities. The results are reported in Table 6. In order to ensure fairness, all experiments are performed based on the warm start and those hyper-parameters (\(\lambda _1, \lambda _2, \alpha , \beta \)) are chosen by using the cross-validation. We can observe that all homogeneous losses perform better than the cross entropy loss (34.4) and Focal Loss (35.8). Therefore, we can find that the exact form of the Conservative Loss is not crucial, and several homogeneous losses also yield comparable results and perform better than cross entropy loss and Focal Loss. Generally, we expect any loss function with similar properties as Conservative Loss to be equally effective.

5 Conclusion

In this paper, we have proposed a novel loss, the Conservative Loss, for the semantic segmentation adaptation. To enforce the network to learn the discriminative and domain-invariant representations, our loss combines the gradient descend and gradient ascend method together, in which it penalizes the extreme cases and encourages moderate cases. We further introduce the adversarial networks to our full model for supplementing the domain alignment. Extensive experiments demonstrate our model achieves state-of-the-art. Exploratory experiments show that the Conservative Loss has high flexibility without limiting to exact form.