Keywords

1 Introduction

Deep learning pipelines have stimulated substantial progress for general object detection. Detectors kept pushing the boundaries on several detection datasets. However, despite being able to efficiently detect objects seen by arbitrary viewing angles, CNN-based detectors are still limited in a way that they could not function properly when faced with domains significantly different from those in the original training dataset. The most common way to obtain performance gain is to go through the troublesome data collection/annotation process. Nevertheless, the recent successes of Generative Adversarial Networks (GANs) on image-to-image translation have opened up possibilities in generating large-scale detection training data without the need for object annotation.

Generative adversarial networks [1], which put two networks (i.e., a generator and a discriminator) competing against each other, have emerged as a powerful framework for learning generative models of random data distributions. While expecting GANs to produce an RGB image and its associated bounding boxes from a random noise vector still sounds like a fantasy, training GANs to translate images from one scenario to another could help skip the tedious data annotation process. In the past, GAN-based image-to-image translation methods, such as Pix2Pix [2], were considered to have limited applications due to the requirement for pairwise training data. Although these methods yielded impressive results, the fact that they require pairwise training images largely reduces their practicality for the problem that we aim to solve.

Recently, unpaired image-to-image translation methods have achieved astonishing results on various domain adaptation challenges. Having almost identical architectures, CycleGAN [3], DiscoGAN [4], and DualGAN [5] made unpaired image-to-image translation possible through introducing the cycle consistency constraint. CoGAN [6] is a model which also works on unpaired images, using two shared-weight generators to generate images of two domains with one random noise. UNIT [7] is an extension of CoGAN. Aside from having similar hard weight-sharing constraints as CoGAN, Liu et al. further implemented the latent space assumption by encouraging two encoders to map images from two domains into the same latent space, which largely increases the translation consistency. These methods all demonstrate compelling visual results on several image-to-image translation tasks; however, what hinders the capability of these methods for providing large-scale detection training data, specifically when faced with translation tasks with a large domain shift, is the fact that these networks often arrive at solutions where the translation results are indistinguishable from the the target domain in terms of style, and usually contain corrupted image-objects.

In this paper we propose a structure-aware image-to-image translation network, which allows us to directly benefit object detection by translating existing detection RGB data from its original domain other scenarios. The contribution of this work is three-fold: (1) We train the encoder networks to extract structure-aware information through the supervision of a segmentation subtask, (2) we experiment on different weight sharing strategy to ensure the preservation of image-objects during image-translations, and (3) our object-preserving network provides significant performance gain on the night-time vehicle detection.

We stress particularly on day-to-night image translation not only for the importance of night-time detection, but also for the fact that day/night image translation is one of the most difficult domain transformations. However, our method is also capable of handling various domain pairs. We train our network on synthetic (i.e., SYNTHIA [8], GTA dataset [9]) Compared to the competing methods, the domain translation results of our network significantly enhance the capability of the object detector for application on both synthetic (i.e., SYNTHIA, GTA) and real-world (i.e., KITTI [10], ITRI) data. In addition, we welcome those who are interested in the ITRI dataset to email us for provision.

Fig. 1.
figure 1

Overall structure of the proposed image-to-image translation network. X, Y: image domain X and Y; Z: feature domain; \({\hat{X}}_{pred}\), \({\hat{Y}}_{pred}\): predicted segmentation masks; \({\bar{X}}\), \({\bar{Y}}\): translated results; dotted line implicates soft-sharing, solid line implicates hard-sharing.

2 Proposed Framework

In unsupervised image-to-image translation, models learn joint distribution where the network encodes images from the two domains into a shared feature space. We assume that, for an image to be properly translated to the other domain, the encoded information is required to contain (1) mutual style information between domain A and B, and (2) structural information of the given input image, as illustrated in Fig. 1. Based on the assumption we design our network to jointly optimize image-translation and semantic segmentation. Through our weight-sharing strategy, the segmentation subtask serves as an auxiliary regularization for image-translation.

Let X and Y denote the two image domains, \({\hat{X}}\) and \({\hat{Y}}\) denote the corresponding segmentation masks, and Z represent the encoded feature space. Our network, as depicted in Fig. 1, consists of two encoders \(E_{x}:\mathrm X \rightarrow \mathrm Z\) and \(E_{y}:\mathrm Y \rightarrow \mathrm Z\), two generators, \(G_{x}:\mathrm Z \rightarrow \mathrm {\bar{Y}}\) and \(G_{y}: \mathrm Z \rightarrow \mathrm {\bar{X}}\), two segmentation generators, \(P_{x}:\mathrm Z \rightarrow \mathrm {\hat{X}}_{pred}\), and \(P_{y}:\mathrm Z \rightarrow \mathrm {\hat{Y}}_{pred}\), and two discriminators \(D_{x}\) and \(D_{y}\) for the two image domains, respectively. Our network learns image domain translation in both directions and the segmentation sub-tasks simultaneously. For an input \(x\in X\) , \(E_{x}\) first encodes x into the latent space, and the 256-channel feature vector is then processed to produce (1) the translated output \({\bar{y}}\) via \(G_{x}\), and (2) the semantic representation \({\hat{x}}_{pred}\) via \(P_{x}\). The translated output \({\bar{y}}\) is then fed through the inverse encoder-generator pair \(\{E_{y}, G_{y}\}\) to yield the reconstructed image \(x_{rec}\). Detailed architecture of our network is given in Table 1.

Table 1. Network architecture for the image-to-image translation experiments. N, K, and S denote the number of convolution filters, kernel size, and stride, respectively

2.1 Structure-Aware Encoding and Segmentation Subtask

We actively guide the encoder networks to extract context-aware features by regularizing them via segmentation subtask so that the extracted 256-channel feature vector contains not only mutual style information between X and Y domains, but also the intricate low-level semantic features of the input image that are valuable in the preservation of image-objects during translation. The segmentation loss is formulated as:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{seg-x}(P_{x},E_{x},X,{\hat{X}})&= \lambda _{seg-L1}{\mathbb {E}}_{x \sim p_{data(x)}}[\Vert P_{x}(E_{x}(x))-{\hat{x}}\Vert _{1}] \; \\&\quad + \lambda _{seg-cross entropy}{\mathbb {E}}_{x \sim p_{data(x)}}[\Vert \log (P_{x}(E_{x}(x))-{\hat{x}}) \Vert _{1}] \end{aligned} \end{aligned}$$
(1)
$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{seg-y}(P_{y}, E_{y}, Y, {\hat{Y}})&= \lambda _{seg-L1}{\mathbb {E}}_{y \sim p_{data(y)}}[\Vert P_{y}(E_{y}(y))-{\hat{y}}\Vert _{1}] \; \\&\quad + \lambda _{seg-cross entropy}{\mathbb {E}}_{y \sim p_{data(y)}}[\Vert \log (P_{y}(E_{y}(y))-{\hat{y}}) \Vert _{1}] \end{aligned} \end{aligned}$$
(2)

2.2 Weight Sharing for Multi-task Network

Sharing weights between the generator and parsing network allows the generator to fully take advantage of the context-aware feature vector. We hard-share the first 6 residual blocks and soft-share the subsequent two deconvolution blocks for generators and parsing networks. We experiment on different weight-sharing strategies, as illustrated in Sect. 3.2, such as hard-share, not sharing the deconvolution blocks, and not sharing the residual blocks, and come to the best sharing strategy. We calculate the weight difference between deconvolution layers of the two networks and model the difference as a loss function through mean square error with target as a zero matrix. The mathematical expression for the soft weight sharing loss function is given by

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{\omega }(\omega _{G},\omega _{P}) = -\log ((\omega _{G_{x}} \cdot \omega _{P_{x}} / \Vert \omega _{G_{x}} \Vert _{2} \Vert \omega _{P_{x}} \Vert _{2})^2) \end{aligned} \end{aligned}$$
(3)

where \(\omega _{G}\) and \(\omega _{P}\) denote the weight vectors formed by the deconvolution layers of the generator and parsing networks, respectively.

2.3 Cycle Consistency

The cycle consistency loss has been proven quite effective in preventing network from generating random images in the target domain. We also enforce the cycle-consistency constraint in the proposed framework to further regularize the ill-posed unsupervised image-to-image translation problem. The loss function is given by

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{cyc}(E_{x}, G_{x}, E_{y},G_{y}, X, Y)&= {\mathbb {E}}_{x \sim p_{data(x)}}[\Vert G_{y}(E_{y}(G_{x}(E_{x}(x))))-\mathrm {x}\Vert _{1}] \; \\&\quad + {\mathbb {E}}_{y \sim p_{data(y)}}[\Vert G_{x}(E_{x}(G_{y}(E_{y}(y))))-y \Vert _{1}]. \end{aligned} \end{aligned}$$
(4)

2.4 Adversarial Learning

Our network contains two Generative Adversarial Networks: \(GAN_{1}\): \(\{E_{x}, G_{x}, D_{x}\}\), and \(GAN_{2}\): \(\{E_{y}, G_{y}, D_{y}\}\). We apply adversarial losses to both GANs, and formulate the objective loss functions as:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{GAN_{1}}(E_{x}, G_{x}, D_{x}, X, Y)&= {\mathbb {E}}_{y \sim p_{data(y)}}[\log D_{x}(y)] \; \\&\quad + {\mathbb {E}}_{x \sim p_{data(x)}}[\log (1-D_{x}(G_{x}(E_{x}(x))))] \\ \end{aligned} \end{aligned}$$
(5)
$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{GAN_{2}}(E_{y}, G_{y},D_{y}, Y, X)&= {\mathbb {E}}_{x \sim p_{data(x)}}[\log D_{y}(x)] \; \\&\quad + {\mathbb {E}}_{y \sim p_{data(y)}}[\log (1-D_{y}(G_{y}(E_{y}(y))))] \end{aligned} \end{aligned}$$
(6)

2.5 Network Learning

We jointly solve the learning problems for the image-translation streams: \(\{E_{1}, G_{1}\}\) and \(\{E_{2}, G_{2}\}\), the image-parsing streams: \(\{E_{1}, P_{1}\}\) and \(\{E_{2}, P_{2}\}\), and two GAN networks: \(GAN_{1}\) and \(GAN_{2}\), for training the proposed network. The integrated objective function is given as follows:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{full} =\,&{\mathcal {L}}_{GAN}(E_{x}, G_{x}, D_{x}, X, Y) + {\mathcal {L}}_{GAN}(E_{y}, G_{y}, D_{y}, Y, X) \; \\&+ \lambda _{cyc} * {\mathcal {L}}_{cyc}(E_{x}, G_{x}, E_{y}, G_{y}, X, Y) \; \\&+ \lambda _{seg}* ({\mathcal {L}}_{seg}(E_{x}, P_{x}, X, {\hat{X}}) + {\mathcal {L}}_{seg}(E_{y}, P_{y}, Y, {\hat{Y}})) \; \\&+ \lambda _{\omega }* ({\mathcal {L}}_{\omega _{x}}(\omega _{G_{x}},\omega _{P_{x}}) + {\mathcal {L}}_{\omega _{y}}(\omega _{G_{y}},\omega _{P_{y}})) \end{aligned} \end{aligned}$$
(7)

3 Experimental Results

Though many works were dedicated on providing large-scale vehicle datasets for the research community [11,12,13,14,15], most public are collected in daytime. Considering that CNN-based detectors highly rely data augmentation techniques to stimulate performance, training detectors with both day and night images is necessary so as to make them more general. Synthetic dataset, such as SYNTHIA or GTA dataset, provides diverse on-road synthetic sequences as well as segmentation masks in scenarios such as day, night, snow, etc. As our network requires both segmentation mask and nighttime image, we conducted the training of our network with SYNTHIA and GTA datasets. For evaluation purpose, however, we utilize real-world data such as KITTI and our ITRI datasets.

The performance of the network was further analyzed through training YOLO [16] and Faster R-CNN (VGG 16-based) [17] detectors with generated image sets. Aside from revising both detectors to perform 1-class vehicle detection, all hyper-parameters were the same as those used for training on PASCAL VOC challenge. The IOU threshold for objects to be considered true-positives is 0.5, where we follow the standard for common object detection datasets. In the transformation of segmentation Ground-Truth to its counterpart in detection, we exclude the bounding boxes whose heights lower than 40 pixels or occluded for more than 75% in the subsequent AP estimation.

3.1 Synthetic Datasets

We first assess the effectiveness of training detectors with transformed images in both day and night scenarios. We evaluated our network, which is trained with SYNTHIA, by training detectors with transformed images produced by our network. As shown in Table 2, AugGAN outperforms competing methods in both day and night scenarios. AugGAN also surpasses its competitors when trained with GTA dataset, see Table 3. Visually, the transformation results of AugGAN is clearly better in terms of image-object preservation and preventing the appearance of artifacts as shown in Figs. 2 and 3.

Fig. 2.
figure 2

SYNTHIA day-to-night transformation results - GANs trained with SYNTHIA: first row: SYNTHIA daytime testing images; second row: results of CycleGAN; 3rd row: results of UNIT; 4th row: results of AugGAN

Fig. 3.
figure 3

GTA day-to-night transformation results - GANs trained with GTA: first row: GTA daytime testing images; second row: outputs of CycleGAN; 3rd row: outputs of UNIT; 4th row: outputs of AugGAN.

Table 2. Detection accuracy comparison (AP) - GANs trained with SYNTHIA. SDTrain/SNTrain: SYNTHIA daytime/nightime training set; SDTest/SNTest: SYNTHIA daytime/nighttime testing set.
Table 3. Detection accuracy comparison (AP) - detectors trained with transformed images produced by GANs (trained with GTA dataset), and tested with real images. GTA-D-Train: transformed data with GTA training daytime images as input; GTA-N-Test: GTA testing nighttime data.
Table 4. Detection accuracy comparison (AP) - detectors trained with transformed images produced by GANs (trained with GTA dataset and SYNTHIA), and tested with real images. KITTI-D2N-S/KITTI-D2N-G: KITTI day-to-night training data generated by GANs; ITRIN: ITRI-Night dataset.

3.2 KITTI and ITRI-Night Datasets

Aside from testing on SYNTHIA and GTA datasets, we also assess the capability of our network on real world data, such as KITTI, which has been widely used in assessing the performance of on-road object detectors used in autonomous driving systems. With the previously trained AugGAN, be it trained with SYNTHIA or GTA dataset, we transformed the KITTI dataset (7481 images with 6686 of which contains vehicle instances) [18] to its nighttime version and evaluate the translation results via detector training. We trained vehicle detectors with the translated KITTI dataset and tested on our ITRI-Night testing set (9366 images with 20833 vehicle instances). As experimental result indicates, real-world data transformed by AugGAN quantitatively and visually achieves better result even though AugGAN was trained with synthetic dataset, see Table 4, Figs. 4 and 5.

Table 5. Detection accuracy comparison (AP) - detectors trained with transformed images produced by GANs (trained with SYNTHIA/GTA dataset). ITRID-D2N-S/ITRID-D2N-G: ITRI-day day-to-night training data generated by GANs trained with SYNTHIA/GTA datasets; ITRIN: ITRI-Night dataset.
Fig. 4.
figure 4

KITTI day-to-night transformation results - GANs trained with SYNTHIA: first row: KITTI images; second row: result of CycleGAN; 3rd row: result of UNIT; 4th row: result of AugGAN.

Fig. 5.
figure 5

KITTI dataset day-to-night transformation results - GANs trained with GTA dataset: first row: input images from KITTI dataset; second row: outputs of CycleGAN; 3rd row: outputs of UNIT; 4th row: outputs of AugGAN

3.3 ITRI Daytime and Nighttime Datasets

We collected a set of real-driving daytime (25104 images/87374 vehicle instances) dataset, captured mostly in the same scenario as its our nighttime dataset (9366 images with 20833 vehicle instances). In Table 5, the experiments demonstrate similar results as in other datasets. The transformed day-to-night training images are proved to be helpful in vehicle detector training. Training images generated by AugGAN outperforms those by competing methods due to its preservation in image-objects, with some examples shown in Figs. 6 and 7.

Fig. 6.
figure 6

ITRI-Day dataset day-to-night transformation results - GANs trained with SYNTHIA: First row: input images from ITRI-Day dataset; Second row: outputs of cycleGAN; 3rd row: outputs of UNIT; 4th row: outputs of AugGAN

Fig. 7.
figure 7

ITRI-Day dataset day-to-night transformation results - GANs trained with GTA dataset: first row: input images from ITRI-Day dataset; second row: outputs of cycleGAN; 3rd row: outputs of UNIT; 4th row: outputs of AugGAN

3.4 Transformations Other Than Daytime and Nighttime

AugGAN is capable of learning transformation across unpaired synthetic and real domains and only segmentation supervision in domain-A is required. This increases the flexibility of learning cross-domain adaptation for subsequent detector training. As shown in Fig. 8: 2nd row, our method could learn image translation from not only synthetic-synthetic, but also synthetic-real domain pairs.

Fig. 8.
figure 8

More image translation cases: 1st column: GTA-day to SYNTHIA; 2nd column: GTA-day to GTA-sunset; 3rd column: GTA-day to GTA-rain; 4th column: SYNTHIA-day to ITRI-night

4 Model Analysis

4.1 Segmentation Subtask

In our initial experiment on introducing the segmentation subtask, the parsing network was only utilized in the forward cycle (e.g., only day-to-night). We later on discovered that our results are improved by utilizing the parsing network to regularize both forward and inverse cycles. As can be seen in Table 6, it is quite obvious that adding regularization to the inverse cycle leads to better transformation results which make detectors more accurate. Although using only single-sided segmentation has already outperformed the previous works, introducing segmentation in both forward and backward cycles brings further accuracy improvement for object detection.

Table 6. Detection accuracy comparison (AP) - detectors trained with transformed data produced by GANs (trained with SYNTHIA). SDTrain: SYNTHIA daytime training set, transformed into nighttime; SNTest: SYNTHIA nighttime testing set.
Table 7. Weight-sharing strategy comparison: \(\lambda _{w}\) denotes the cosine similarity loss multiplier, with \(\lambda _{w}=0.02\) yielded best result. The matrix in this table is the average precision of Faster RCNN
Fig. 9.
figure 9

Style transfer and segmentation results for different weight-sharing strategies: 1st row: input images; 2nd row: style transfer and segmentation results of hard weight sharing, hard-weighting on encoder only (\(\lambda _{w}=0\)), and hard weighting sharing in encoder with soft-weight sharing (\(\lambda _{w}=0.02\)) in decoder.

4.2 Weight-Sharing Strategy

Our network design is based on the assumption that extracted semantic segmentation features of individual layers, through proper weight sharing, can serve as auxiliary regularization for image-to-image translation. Thus finding the proper weight sharing policy came to be the most important factor in our design. Weighting sharing mechanism in neural networks can be roughly categorized into soft weight-sharing and hard weight-sharing. Soft weight-sharing [19] was originally proposed for regularization and could be applied to network compression [20]. Recently, hard weight-sharing has been proven useful in generating images with similar high-level semantics [6]. The policy that we currently adopt is two-folded: (1) hard-share encoders and residual blocks of the generator-parsing net pairs, (2) soft-share deconvolution layers of the generator-parsing net pairs. We came to this setting based on extensive trial and error, and during the process we realized that both policies are integral for the optimization of our network. Without hard-sharing the said layers in (1), image-objects tends to be distorted; Without (2), the network tends to only optimize one of the tasks, see Table 7 and Fig. 9. In short, our network surpasses competing methods because our multi-task network can maintain realistic transformation style as well as preserving image-objects with the help of segmentation subtask.

5 Conclusion and Future Work

In this work, we proposed an image-to-image translation network for generating large-scale trainable data for vehicle detection algorithms. Our network is especially adept in preserving image-objects, thanks to the extra guidance of the segmentation subtask. Our method, though far from perfect, quantitatively surpasses competing methods for stimulating vehicle detection accuracy. In the future, we will continue to experiment on different tasks based on this framework, and our pursuit for creating innovative solutions for the world will continue to stride.