Recycle-GAN: Unsupervised Video Retargeting

Bansal, Aayush; Ma, Shugao; Ramanan, Deva; Sheikh, Yaser

doi:10.1007/978-3-030-01228-1_8

Aayush Bansal¹⁷,
Shugao Ma¹⁸,
Deva Ramanan¹⁷ &
…
Yaser Sheikh^17,18

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11209))

Included in the following conference series:

European Conference on Computer Vision

2320 Accesses
156 Citations

Abstract

We introduce a data-driven approach for unsupervised video retargeting that translates content from one domain to another while preserving the style native to a domain, i.e., if contents of John Oliver’s speech were to be transferred to Stephen Colbert, then the generated content/speech should be in Stephen Colbert’s style. Our approach combines both spatial and temporal information along with adversarial losses for content translation and style preservation. In this work, we first study the advantages of using spatiotemporal constraints over spatial constraints for effective retargeting. We then demonstrate the proposed approach for the problems where information in both space and time matters such as face-to-face translation, flower-to-flower, wind and cloud synthesis, sunrise and sunset.

You have full access to this open access chapter, Download conference paper PDF

XGAN: Unsupervised Image-to-Image Translation for Many-to-Many Mappings

TSIT: A Simple and Versatile Framework for Image-to-Image Translation

A survey on high coherence visual media retargeting: recent advances and applications

Article 13 July 2016

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

We present an unsupervised data-driven approach for video retargeting that enables the transfer of sequential content from one domain to another while preserving the style of the target domain. Such a content translation and style preservation task has numerous applications including human motion and face translation from one person to other, teaching robots from human demonstration, or converting black-and-white videos to color. This work also finds application in creating visual content that is hard to capture or label in real world settings, e.g., aligning human motion and facial data of two individuals for virtual reality, or labeling night data for a self-driving car. Above all, the notion of content translation and style preservation transcends pixel-to-pixel operation to a more semantic and abstract human understandable concepts, thereby paving way for advance machines that can directly collaborate with humans.

The current approaches for retargeting can be broadly classified into three categories. The first set of work is specifically designed for domains such as human faces [5, 41, 42]. While these approaches work well when faces are fully visible, they fail when applied to occluded faces (virtual reality) and lack generalization to other domains. The work on paired image-to-image translation [23] attempted for generalization across domain but requires manual supervision for labeling and alignment. This requirement makes it hard for the use of such approaches as manual alignment or labeling many (in-the-wild) domains is not possible. The third category of work attempts unsupervised and unpaired image translation [26, 53]. These work enforce a cyclic consistency [51] on unpaired 2D images and learn transformation from one domain to another. However, the use of unpaired images alone is not sufficient for video retargeting. Primarily, it is not able to pose sufficient constraints on optimization and often leads to bad local minima or a perceptual mode collapse making it hard to generate the required output in the target domain. Secondly, the use of the spatial information alone in 2D images makes it hard to learn the style of a particular domain as stylistic information requires temporal knowledge as well.

In this work, we make two specific observations: (i) the use of temporal information provides more constraints to the optimization for transforming one domain to other and helps in learning a better local minima; (ii) the combined influence of spatial and temporal constraints helps in learning the style characteristic of an identity in a given domain. More importantly, we do not require manual labels as temporal information is freely available in videos (available in abundance on web). Shown in Fig. 1 are the example of translation for human faces and flowers. Without any manual supervision and domain-specific knowledge, our approach learns this retargeting from one domain to the other using publicly available video data on the web from both domains.

Our Contributions: We introduce a new approach that incorporates spatiotemporal cues along with conditional generative adversarial networks [15] for video retargeting. We demonstrate the advantages of spatiotemporal constraints over the spatial constraints alone for image-to-labels, and labels-to-image in varying environmental settings. We then show the importance of proposed approach in learning better association between two domains, and its importance for self-supervised content alignment of the visual data. Inspired by the ever-existing nature of space-time, we qualitatively demonstrate the effectiveness of our approach for various natural processes such as face-to-face translation, flower-to-flower, synthesizing clouds and winds, aligning sunrise and sunset.

2 Related Work

A variety of work dealing with image-to-image translation [11, 17, 23, 40, 53] and style translation [4, 10, 19] exists. In fact a large body of work in computer vision and computer graphics is about an image-to-image operation. While the primary efforts were on inferencing semantic [30], geometric [1, 9], or low-level cues [48], there is a renewed interest in synthesizing images using data-driven approaches by the introduction of generative adversarial networks [15]. This formulation have been used to generate images from cues such as a low-resolution image [8, 28], class labels [23], and various other input priors [21, 35, 49]. These approaches, however, require an input-output pair to train a model. While it is feasible to label data for a few image-to-image operations, there are numerous tasks for which it is non-trivial to generate input-output pairs for training supervision. Recently, Zhu et al. [53] proposed to use the cycle-consistency constraint [51] in adversarial learning framework to deal with this problem of unpaired data, and demonstrate effective results for various tasks. The cycle-consistency [26, 53] enabled many image-to-image translation tasks without any expensive manual labeling. Similar ideas have also found application in learning depth cues in an unsupervised manner [14], machine translation [47], shape correspondences [20], point-wise correspondences [51, 52], or domain adaptation [18].

The variants of Cycle-GAN [53] have been applied to various temporal domains [14, 18]. However, these work consider only the spatial information in 2D images, and ignore the temporal information for optimization. We observe two major limitations: (1). Perceptual Mode Collapse: there are no guarantees that cycle consistency would produce perceptually unique data to the inputs. In Fig. 2, we show the outputs of a model trained for Donald Trump to Barack Obama, and an example for image2labels and labels2image. We find that for different inputs of Donald Trump, we get perceptually similar output of Barack Obama. However, we observe that these outputs have some unique encoding that enables to reconstruct image similar to input. We see similar behavior for image2labels and labels2image in Fig. 2-(b); (2). Tied Spatially to Input: Due to the reconstruction loss on the input itself, the optimization is forced to learn a solution that is closely tied to the input. While this is reasonable for the problems where only spatial transformation matters (such as horse-to-zebra, apples-to-oranges, or paintings etc.), it is important for the problems where temporal and stylistic information is required for synthesis (prominently face-to-face translation). In this work, we propose a new formulation that utilizes both spatial and temporal constraints along with the adversarial loss to overcome these two problems. Shown in Fig. 2-(c, d) are the outputs of proposed approach overcoming the above mentioned problems. We posit this is due to more constraints available for an under-constrained optimization.

The use of GANs [15] and variational auto-encoder [27] have also found a way for synthesizing videos and temporal information. Walker et al. [45] use temporal information to predict future trajectories from a single image. Recent work [16, 44, 46] used temporal models to predict long term future poses from a single 2D image. MoCoGAN [43] decomposes motion and content to control video generation. Similarly, Temporal GAN [39] employs a temporal generator and an image generator that generates a set of latent variables and image sequences respectively. While relevant, the prior work is mostly focused about predicting the future intent from single images at test time or generating videos from a random noise. Concurrently, MoCoGAN [43] shows example of image-to-video translation using their formulation. However, our focus is on a general video-to-video translation where the input video can control the output in a spirit similar to image-to-image translation. To this end, we can generate hi-res videos of arbitrary length from our approach whereas the prior work [39, 43] has shown to generate only 16 frames of $64 \times 64$.

Spatial and Temporal Constraints: The spatial and temporal information is known to be an integral sensory component that guides human action [12]. There exists a wide literature utilizing these two constraints for various computer vision tasks such as learning better object detectors [34], action recognition [13] etc. In this work, we take a first step to exploit spatiotemporal constraints for video retargeting and unpaired image-to-image translation.

Learning Association: Much of computer vision is about learning association, be it learning high-level image classification [38], object relationships [32], or point-wise correspondences [2, 24, 29, 31]. However, there has been relatively little work on learning association for aligning the content of different videos. In this work, we use our model trained with spatiotemporal constraints to align the semantical content of two videos in a self-supervised manner, and do automatic alignment of the visual data without any additional supervision.

3 Method

Assume we wish to learn a mapping $G_Y : X \rightarrow Y$. The classic approach tunes $G_Y$ to minimize reconstruction error on paired data samples $\{(x_i,y_i)\}$ where $x_i \in X$ and $y_i \in Y$:

$$\begin{aligned} \min _{G_Y} \sum _i ||y_i - G_Y(x_i)||^2. \end{aligned}$$

(1)

Adversarial Loss: Recent work [15, 23] has shown that one can improve the learned mapping by tuning it with a discriminator $D_Y$ that is adversarially trained to distinguish between real samples of y from generated samples $G_Y(x)$:

$$\begin{aligned} \min _{G_Y} \max _{D_Y} L_{g}(G_Y,D_Y) = \sum _s \log D_Y(y_s) + \sum _t \log (1 - D_Y(G_Y(x_t))), \end{aligned}$$

(2)

Importantly, we use a formulation that does not require paired data and only requires access to individual samples $\{x_t\}$ and $\{y_s\}$, where different subscripts are used to emphasize the lack of pairing.

Cycle Loss: Zhu et al. [53] use cycle consistency [51] to define a reconstruction loss when the pairs are not available. Popularly known as Cycle-GAN (Fig. 3-b), the objective can be written as:

$$\begin{aligned} L_c(G_{X}, G_{Y}) = \sum _t ||x_t - G_{X}(G_{Y}(x_t))||^2. \end{aligned}$$

(3)

Recurrent Loss: We have so far considered the setting when static data is available. Instead, assume that we have access to unpaired but ordered streams $(x_1,x_2,\ldots ,x_t,\ldots )$ and $(y_1,y_2\ldots , y_s,\ldots )$. Our motivating application is learning a mapping between two videos from different domains. One option is to ignore the stream indices, and treat the data as an unpaired and unordered collection of samples from X and Y (e.g., learn mappings between shuffled video frames). We demonstrate that much better mapping can be learnt by taking advantage of the temporal ordering. To describe our approach, we first introduce a recurrent temporal predictor $P_X$ that is trained to predict future samples in a stream given its past:

$$\begin{aligned} L_\tau (P_X) = \sum _t ||x_{t+1} - P_X(x_{1:t})||^2, \end{aligned}$$

(4)

where we write $x_{1:t} = (x_1 \ldots x_t)$.

Recycle Loss: We use this temporal prediction model to define a new cycle loss across domains and time (Fig. 3-c) which we refer as a recycle loss:

$$\begin{aligned} L_{r}(G_X, G_Y, P_Y) = \sum _t ||x_{t+1} - G_X(P_Y(G_Y(x_{1:t})))||^2, \end{aligned}$$

(5)

where $G_Y(x_{1:t}) = (G_Y(x_1), \ldots , G_Y(x_t))$. Intuitively, the above loss requires sequences of frames to map back to themselves. We demonstrate that this is a much richer constraint when learning from unpaired data streams in Fig. 4.

Recycle-GAN: We now combine the recurrent loss, recycle loss, and adversarial loss into our final Recycle-GAN formulation:

$$\begin{aligned}&\min _{G,P} \max _{D} L_{rg}(G,P,D) = L_{g}(G_X,D_X) + L_{g}(G_Y,D_Y) + \\&\lambda _{rx} L_{r}(G_X, G_Y, P_Y) + \lambda _{ry} L_{r}(G_Y, G_X, P_X) + \lambda _{{\tau }x} L_{\tau }(P_{X}) + \lambda _{{\tau }y} L_{\tau }(P_{Y}). \end{aligned}$$

Inference: At test time, given an input video with frames $\{x_t\}$, we would like to generate an output video. The simplest strategy would be directly using the trained $G_Y$ to generate a video frame-by-frame $y_t = G_Y(x_t)$. Alternatively, one could use the temporal predictor $P_Y$ to smooth the output:

$$\begin{aligned} y_{t} = \frac{G_{Y}(x_{t}) + P_{Y}(G_{Y}(x_{1:t-1}))}{2}, \end{aligned}$$

where the linear combination could be replaced with a nonlinear function, possibly learned with the original objective function. However, for simplicity, we produce an output video by simple single-frame generation. This allows our framework to be applied to both videos and single images at test-time, and produces fairer comparison to spatial approach.

Implementation Details: We adopt much of the training details from Cycle-GAN [53] to train our spatial translation model, and Pix2Pix [23] for our temporal prediction model. The generative network consists of two convolution (downsampling with stride-2), six residual blocks, and finally two upsampling convolution (each with a stride 0.5). We use the same network architecture for $G_X$, and $G_Y$. The resolution of the images for all the experiments is set to $256 \times 256$. The discriminator network is a $70 \times 70$ PatchGAN [23, 53] that is used to classify a $70 \times 70$ image patch if it is real or fake. We set all $\lambda _s = 10$. To implement our temporal predictors $P_X$ and $P_Y$, we concatenate the last two frames as input to a network whose architecture is identical to U-Net architecture [23, 37].

4 Experiments

We now study the influence of spatiotemporal constraints over spatial cyclic constraints. Because our key technical contribution is the introduction of temporal constraints in learning unpaired image mappings, the natural baseline is Cycle-GAN [53], a widely adopted approach for exploiting spatial cyclic consistency alone for an unpaired image translation. We first present quantitative results on domains where ground-truth correspondence between input and output videos are known (e.g., a video where each frame is paired with a semantic label map). Importantly, this correspondence pairing is not available to either Cycle-GAN or Recycle-GAN, but used only for evaluation. We then present qualitative results on a diverse set of videos with unknown correspondence, including video translations across different human faces and temporally-intricate events found in nature (flowers blooming, sunrise/sunset, time-lapsed weather progressions).

4.1 Quantitative Analysis

We use publicly available Viper [36] dataset for image2labels and labels2image to evaluate our findings. This dataset is collected using computer game with varying realistic content and provides densely annotated pixel-level labels. Out of the 77 different video sequences consisting of varying environmental conditions, we use 57 sequences for training our model and baselines. The held-out 20 sequences are used for evaluation. The goal for this evaluation is not to achieve the state-of-the-art performance but to compare and understand the advantage of spatiotemporal cyclic consistency over the spatial cyclic consistency [53]. We selected the model that correspond to minimum reconstruction loss for our approach.

While the prior work [23, 53] has mostly used Cityscapes dataset [7], we could not use it for our evaluation. Primarily the labelled images in Cityscapes are not continuous video sequences, and the information in the consecutive frames is drastically different from the initial frame. As such it is not trivial to use a temporal predictor. We used Viper as a proxy for Cityscapes because the task is similar and that dataset contains dense video annotations. Additionally, a concurrent work [3] on unsupervised video-to-video translation also use Viper dataset for evaluation. However, they restrict to a small subset of sequences from daylight and walking only whereas we use all the varying environmental conditions available in the dataset.

Table 1. Image2Labels (Semantic Segmentation): We use the Viper [36] dataset to evaluate the performance improvement when using spatiotemporal constraints as opposed to only spatial cyclic consistency [53]. We report results using three criteria: (1). Mean Pixel Accuracy (MP); (2). Average Class Accuracy (AC); and (3). Intersection over union (IoU). We observe that our approach achieves significantly better performance than prior work over all the criteria in all the conditions.

Full size table

Image2Labels: In this setting, we use the real world image as input to generator that output segmentation label maps. We compute three statistics to compare the output of two approaches: (1). Mean Pixel Accuracy (MP); (2). Average Class Accuracy (AC); (3). Intersection over Union (IoU). These statistics are computed using the ground truth for the held-out sequences under varying environmental conditions. Table 1 contrast the performance of our approach (Recycle-GAN) with Cycle-GAN. We observe that Recycle-GAN achieves significantly better performance than Cycle-GAN over all criteria and under all conditions.

Labels2Image: In this setting, we use the segmentation label map as an input to generator and output an image that is close to a real image. The goal of this evaluation is to compare the quality of output images obtained from both approaches. We follow Pix2Pix [23] for this evaluation. We use the generated images from each of the algorithm with a pre-trained FCN-style segmentation model. We then compute the performance of synthesized images against the real images to compute a normalized FCN-score. Higher performance on this criterion suggest that generated image is closer to the real images. Table 2 compares the performance of our approach with Cycle-GAN. We observe that our approach achieves overall better performance and sometimes competitive in different conditions when compared with Cycle-GAN for this task. Figure 4 qualitatively compares our approach with Cycle-GAN.

Table 2. Normalized FCN score for Labels2Image: We use a pre-trained FCN-style model to evaluate the quality of synthesized images over real images using the Viper [36] dataset. Higher performance on this criteria suggest that the output of a particular approach produces images that look closer to the real images.

Full size table

In these experiments, we make two observations: (i) Cycle-GAN learnt a good translation model within a few initial iterations (seeing only a few examples) but this model degraded as reconstruction loss started to decrease. We believe that minimizing reconstruction loss alone on input lead it to a bad local minima, and having a combined spatiotemporal constraint avoided this behavior; (ii) Cycle-GAN learns better translation model for Cityscapes as opposed to Viper. Cityscapes consists of images from mostly daylight and agreeable weather. This is not the case with Viper as it is rendered, and therefore has a large and varied distribution of different sunlight and weather conditions such as day, night, snow, rain etc. This makes it harder to learn a good mapping because for each labelled input, there are potentially many output images. We find that standard conditional GANs suffer from mode collapse in such scenarios, producing “average” outputs (as pointed by prior works [2]). Our experiments suggest that spatiotemporal constraints help ameliorate such challenging translation problems.

4.2 Qualitative Analysis

Face to Face: We use the publicly available videos of various public figures for the face-to-face translation task. The faces are extracted using the facial keypoints generated using the OpenPose Library [6] and a minor manual efforts are made to remove false positives. Figure 5 shows an example of face-to-face translation between John Oliver and Stephen Colbert, Barack Obama to Donald Trump, and Martin Luther King Jr. (MLK) to Barack Obama, and John Oliver to a cartoon character. Note that without any additional supervisory signal or manual alignment, our approach can learn to do face-to-face translation and captures stylistic expression for these personalities, such as the dimple on the face of John Oliver while smiling, the characteristic shape of mouth of Donald Trump, facial expression of Bill Clinton, and the mouth lines for Stephen Colbert.

Flower to Flower: Extending from faces and other traditional translations, we demonstrate our approach for flowers. We use various flowers, and extracted their time-lapse from publicly available videos. The time-lapses show the blooming of different flowers but without any sync. We use our approach to align the content, i.e. both flowers bloom or die together. Figure 6 shows how our video retargeting approach can be viewed as an approach for learning association between the events of different flowers life.

4.3 Video Manipulation via Retargeting

Clouds and Wind Synthesis: Our approach can be used to synthesize a new video that has the required environmental condition such as clouds and wind without the need for physical efforts of recapturing. We use the given video and video data from required environmental condition as two domains in our experiment. The conditional video and trained translation model is then used to generate a required output.

For this experiment, we collected the video data for various wind and cloud conditions, such as calm day or windy day. Using our approach, we can convert a calm-day to a windy-day, and a windy-day to a calm-day, without modifying the aesthetics of the place. Shown in Fig. 7 is an example of synthesizing clouds and winds on a windy day at a place when the only information available was a video captured at same place with a light breeze. More videos for these clouds and wind synthesis are available on our project webpage.

Sunrise and Sunset: We extracted the sunrise and sunset data from various web videos, and show how our approach could be used for both video manipulation and content alignment. This is similar to settings in our experiments on clouds and wind synthesis. Figure 8 shows an example of synthesizing a sunrise video from an original sunset video by conditioning it on a sunrise video. We also show examples of alignment of various sunrise and sunset scenes.

Note: We refer the reader to our project webpage for different videos synthesized using our approach, and extension of our work utilizing both 2D images and videos by combining Cycle-loss and Recycle-loss in a generative adversarial formulation.

4.4 Human Studies

We performed human studies on the synthesized output, particularly faces and flowers, following the protocol of MoCoGAN [43] who also evaluate videos. However, our analysis consist of three parts: (1). In the first study, we showed synthesized videos individually from both Cycle-GAN and ours to 15 sequestered human subjects, and asked them if it is a real video or a generated video. The subjects misclassified $28.3\%$ times generated videos from our approach as real, and $7.3\%$ times for Cycle-GAN. (2). In the second study, we show the synthesized videos from Cycle-GAN and our approach simultaneously, and asked them to tell which one looks more natural and realistic. Human subjects chose the videos synthesized from our approach $76\%$ times, $8\%$ times Cycle-GAN, and $16\%$ times they were confused. (3). In the final study, we showed the video-to-video translation. This is an extension of (2), except now we also include input and ask which looks like a more realistic and natural translation. We showed each video to 15 human subjects. The human subjects selected our approach $74.7\%$ times, $13.3\%$ times they selected Cycle-GAN, and $12\%$ times they were confused. From the human study, we can clearly see that combining spatial and temporal constraints lead to better retargeting.

4.5 Failure Example: Learning Association Beyond Data Distribution

We show an example of transformation from a real bird to a origami bird to demonstrate a case where our approach failed to learn the association. The real bird data was extracted using web videos, and we used the origami bird from the synthesis of Kholgade et al. [25]. Shown in Fig. 9 is the synthesis of origami bird conditioned on the real bird. While the real bird is sitting, the origami bird stays and attempts to imitate the actions of real bird. The problem comes when the bird begins to fly. The initial frames when the bird starts to fly are fine. After some time the origami bird reappears. From an association perspective, the origami bird should not have reappeared. Looking back at the training data, we found that the original origami bird data does not have a example of frame without the origami bird, and therefore our approach is not able to associate an example when the real bird is no more visible. Perhaps, our approach could only learn to interpolate over a given data distribution and fails to capture anything beyond it. One possible way to address this problem is by using a lot of training data such that the data distribution encapsulates all possible scenarios and can lead to an effective interpolation.

5 Discussion and Future Work

In this work, we explore the influence of spatiotemporal constraints in learning video retargeting and image translation. Unpaired video/image translation is a challenging task because it is unsupervised, lacking any correspondences between training samples from the input and output space. We point out that many natural visual signals are inherently spatiotemporal in nature, which provides strong temporal constraints for free to help learn such mappings. This results in significantly better mappings. We also point that unpaired and unsupervised video retargeting and image translation is an under-constrained problem, and so more constraints using auxiliary tasks from the visual data itself (as used for other vision tasks [33, 50]) could help in learning better transformation models.

Recycle-GANs learn both a mapping function and a recurrent temporal predictor. Thus far, our results make use of only the mapping function, so as to facilitate fair comparisons with previous work. But it is natural to synthesize target videos by making use of both the single-image translation model and the temporal predictor. Additionally, the notion of style in video retargeting can be incorporated more precisely by using spatiotemporal generative models as this would allow to even learn the speed of generated output. E.g. Two people may have different ways of content delivery and that one person can take longer than other to say the same thing. A true notion of style should be able to generate even this variation in amount of time for delivering speech/content. We believe that better spatiotemporal neural network architecture could attempt this problem in near future. Finally, our work could also utilize the concurrent approach from Huang et al. [22] to learn a one-to-many translation model.

References

Bansal, A., Russell, B., Gupta, A.: Marr revisited: 2D–3D model alignment via surface normal prediction. In: CVPR (2016)
Google Scholar
Bansal, A., Sheikh, Y., Ramanan, D.: PixelNN: example-based image synthesis. In: ICLR (2018)
Google Scholar
Bashkirova, D., Usman, B., Saenko, K.: Unsupervised video-to-video translation. CoRR abs/1806.03698 (2018)
Google Scholar
Brand, M., Hertzmann, A.: Style machines. ACM Trans. Graph. (2000)
Google Scholar
Cao, C., Hou, Q., Zhou, K.: Displaced dynamic expression regression for real-time facial tracking and animation. ACM Trans. Graph. 33, 43 (2014)
Google Scholar
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: CVPR (2017)
Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)
Google Scholar
Denton, E.L., Chintala, S., Szlam, A., Fergus, R.: Deep generative image models using a Laplacian pyramid of adversarial networks. In: NIPS (2015)
Google Scholar
Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV (2015)
Google Scholar
Freeman, W.T., Tenenbaum, J.B.: Learning bilinear models for two-factor problems in vision. In: CVPR (1997)
Google Scholar
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: CVPR (2016)
Google Scholar
Gibson, J.J.: The ecological approach to visual perception (1979)
Google Scholar
Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., Russell, B.: ActionVLAD: learning spatio-temporal aggregation for action classification. In: CVPR (2017)
Google Scholar
Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR (2017)
Google Scholar
Goodfellow, I.J., et al.: Generative adversarial networks. In: NIPS (2014)
Google Scholar
He, J., Lehrmann, A., Marino, J., Mori, G., Sigal, L.: Probabilistic video generation using holistic attribute control. In: ECCV (2018)
Google Scholar
Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.H.: Image analogies. ACM Trans. Graph. (2001)
Google Scholar
Hoffman, J., et al.: Cycada: cycle-consistent adversarial domain adaptation. In: ICML (2018)
Google Scholar
Hsu, E., Pulli, K., Popović, J.: Style translation for human motion. ACM Trans. Graph. 24, 1082–1089 (2005)
Article Google Scholar
Huang, Q.X., Guibas, L.: Consistent shape maps via semidefinite programming. In: Eurographics Symposium on Geometry Processing (2013)
Article Google Scholar
Huang, X., Li, Y., Poursaeed, O., Hopcroft, J.E., Belongie, S.J.: Stacked generative adversarial networks. In: CVPR (2017)
Google Scholar
Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: ECCV (2018)
Google Scholar
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)
Google Scholar
Kanazawa, A., Jacobs, D.W., Chandraker, M.: WarpNet: weakly supervised matching for single-view reconstruction. In: CVPR (2016)
Google Scholar
Kholgade, N., Simon, T., Efros, A., Sheikh, Y.: 3D object manipulation in a single photograph using stock 3D models. ACM Trans. Graph. 33, 127 (2014)
Article Google Scholar
Kim, T., Cha, M., Kim, H., Lee, J.K., Kim, J.: Learning to discover cross-domain relations with generative adversarial networks. In: ICML (2017)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Ledig, C., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: CVPR (2017)
Google Scholar
Liu, C., Yuen, J., Torralba, A.: Sift flow: dense correspondence across scenes and its applications. IEEE Trans. Pattern Anal. Mach. Intell. 33, 978–994 (2011)
Article Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional models for semantic segmentation. In: CVPR (2015)
Google Scholar
Long, J., Zhang, N., Darrell, T.: Do convnets learn correspondence? In: NIPS (2014)
Google Scholar
Malisiewicz, T., Efros, A.A.: Beyond categories: the visual memex model for reasoning about object relationships. In: NIPS (2009)
Google Scholar
Meister, S., Hur, J., Roth, S.: UnFlow: unsupervised learning of optical flow with a bidirectional census loss. In: AAAI (2018)
Google Scholar
Misra, I., Shrivastava, A., Hebert, M.: Watch and learn: semi-supervised learning of object detectors from videos. In: CVPR (2015)
Google Scholar
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR abs/1511.06434 (2015)
Google Scholar
Richter, S.R., Hayder, Z., Koltun, V.: Playing for benchmarks. In: International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115, 211–252 (2015)
Article MathSciNet Google Scholar
Saito, M., Matsumoto, E., Saito, S.: Temporal generative adversarial nets with singular value clipping. In: ICCV (2017)
Google Scholar
Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R.: Learning from simulated and unsupervised images through adversarial training. In: CVPR (2017)
Google Scholar
Thies, J., Zollhofer, M., Niessner, M., Valgaerts, L., Stamminger, M., Theobalt, C.: Real-time expression transfer for facial reenactment. ACM Trans. Graph. (2015)
Google Scholar
Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., Niessner, M.: Face2face: real-time face capture and reenactment of RGB videos. In: CVPR (2016)
Google Scholar
Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: Mocogan: decomposing motion and content for video generation. In: CVPR (2018)
Google Scholar
Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., Lee, H.: Learning to generate long-term future via hierarchical prediction. In: ICML (2017)
Google Scholar
Walker, J., Doersch, C., Gupta, A., Hebert, M.: An uncertain future: forecasting from static images using variational autoencoders. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 835–851. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_51
Chapter Google Scholar
Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: video forecasting by generating pose futures. In: ICCV (2017)
Google Scholar
Xia, Y., et al.: Dual learning for machine translation. In: NIPS (2016)
Google Scholar
Xie, S., Tu, Z.: Holistically-nested edge detection. In: ICCV (2015)
Google Scholar
Zhang, H., et al.: Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In: ICCV (2017)
Google Scholar
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR (2017)
Google Scholar
Zhou, T., Krähenbühl, P., Aubry, M., Huang, Q., Efros, A.A.: Learning dense correspondence via 3D-guided cycle consistency. In: CVPR (2016)
Google Scholar
Zhou, T., Lee, Y.J., Yu, S.X., Efros, A.A.: FlowWeb: joint image set alignment by weaving consistent, pixel-wise correspondences. In: CVPR (2015)
Google Scholar
Zhu, J., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, USA
Aayush Bansal, Deva Ramanan & Yaser Sheikh
Facebook Reality Lab, Pittsburgh, USA
Shugao Ma & Yaser Sheikh

Authors

Aayush Bansal
View author publications
You can also search for this author in PubMed Google Scholar
Shugao Ma
View author publications
You can also search for this author in PubMed Google Scholar
Deva Ramanan
View author publications
You can also search for this author in PubMed Google Scholar
Yaser Sheikh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aayush Bansal .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bansal, A., Ma, S., Ramanan, D., Sheikh, Y. (2018). Recycle-GAN: Unsupervised Video Retargeting. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11209. Springer, Cham. https://doi.org/10.1007/978-3-030-01228-1_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-01228-1_8
Published: 06 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01227-4
Online ISBN: 978-3-030-01228-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics