Multi-style Generative Network for Real-Time Transfer

Zhang, Hang; Dana, Kristin

doi:10.1007/978-3-030-11018-5_32

Multi-style Generative Network for Real-Time Transfer

Hang Zhang^14,15 &
Kristin Dana¹⁵

Conference paper
First Online: 23 January 2019

1616 Accesses
57 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11132))

Abstract

Despite the rapid progress in style transfer, existing approaches using feed-forward generative network for multi-style or arbitrary-style transfer are usually compromised of image quality and model flexibility. We find it is fundamentally difficult to achieve comprehensive style modeling using 1-dimensional style embedding. Motivated by this, we introduce CoMatch Layer that learns to match the second order feature statistics with the target styles. With the CoMatch Layer, we build a Multi-style Generative Network (MSG-Net), which achieves real-time performance. In addition, we employ an specific strategy of upsampled convolution which avoids checkerboard artifacts caused by fractionally-strided convolution. Our method has achieved superior image quality comparing to state-of-the-art approaches. The proposed MSG-Net as a general approach for real-time style transfer is compatible with most existing techniques including content-style interpolation, color-preserving, spatial control and brush stroke size control. MSG-Net is the first to achieve real-time brush-size control in a purely feed-forward manner for style transfer. Our implementations and pre-trained models for Torch, PyTorch and MXNet frameworks will be publicly available (Links can be found at http://hangzhang.org/).

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Style transfer can be approached as reconstructing or synthesizing texture based on the target image semantic content [1]. Many pioneering works have achieved success in classic texture synthesis starting with methods that resample pixels [2,3,4,5] or match multi-scale feature statistics [6,7,8]. These methods employ traditional image pyramids obtained by handcrafted multi-scale linear filter banks [9, 10] and perform texture synthesis by matching the feature statistics to the target style. In recent years, the concepts of texture synthesis and style transfer have been revisited within the context of deep learning. Gatys et al. [11] shows that using feature correlations (i.e. Gram Matrix) of convolutional neural nets (CNN) successfully captures the image styles. This framework has brought a surge of interest in texture synthesis and style transfer using iterative optimization [1, 11, 12] or training feed-forward networks [13,14,15,16]. Recent work extends style flexibility using feed-forward networks and achieves multistyle or arbitrary style transfer [17,18,19,20]. These approaches typically encode image styles into 1-dimensional space, i.e. tuning the featuremap mean and variance (bias and scale) for different styles. However, the comprehensive appearance of image style is fundamentally difficult to represent in 1D embedding space. Figure 3 shows style transfer results using the optimization-based approach [12] and we can see Gram matrix representation produces more appealing image quality comparing to mean and variance of CNN featuremap.

In addition to the image quality, concerns about the flexibility of current feed-forward generative models have been raised in Jing et al. [21], and they point out that no generative methods can adjust the brush stroke size in real-time. Feeding the generative network with high-resolution content image usually results in unsatisfying images as shown in Fig. 6. The generative network as a fully convolutional network (FCN) can accept arbitrary input image sizes. Resizing the style image changes the relative brush size and the multistyle generative network matching the image style at run-time should naturally enable brush-size control by changing the input style image size. What limits the current generative model from being aware of the brush size? The 1D style embedding (featuremap mean and variance) fundamentally limits the potential of exploring finer behavior for style representations. Therefore, a 2D method is desired for finer representation of image styles.

As the first contribution of the paper, we introduce an CoMatch Layer which embeds style with a 2D representation and learns to match the second-order feature statistics (Gram Matrix) of the style targets inherently during the training. The CoMatch Layer is differentiable and end-to-end learnable with existing generative network architectures without additional supervision. The proposed CoMatch Layer enables multi-style generation from a single feed-forward network (Fig. 2).

The second contribution of this paper is building Multi-style Generative Network (MSG-Net) with the proposed CoMatch Layer and a novel Upsample Convolution. The MSG-Net as a feed-forward network runs in real-time after training. Generative networks typically have a decoder part recovering the image details from downsampled representations. Learning fractionally-strided convolution [22] typically brings checkerboard artifacts. For improving the image quality, we employ a strategy we call upsampled convolution, which successfully avoids the checkerboard artifacts by applying an integer stride convolution and outputs an upsampled featuremap (details in Sect. 4.1). In addition, we extend the Bottleneck architecture [23] to an Upsampling Residual Block, which reduces computational complexity without losing style versatility by preserving larger number of channels. Passing identity all the way through the generative network enables the network to extend deeper and converge faster. The experimental results show that MSG-Net has achieved superior image fidelity and test speed compared to previous work. We also study the scalability of the model by extending 100-style MSG-Net to 1K styles using a larger model size and longer training time, and we observe no obvious quality differences. In addition, MSG-Net as a general multi-style strategy is compatible to most existing techniques and progress in style transfer, such as content style trade-off and interpolation [17], spatial control, color preserving and brush-size control [24, 25].

To our knowledge, MSG-Net is the first to achieve real-time brush-size control in a purely feed-forward manner for multistyle transfer.

1.1 Related Work

Relation to Pyramid Matching. Early methods for texture synthesis were developed using multi-scale image pyramids [4, 6,7,8]. The discovery in these earlier methods was that realistic texture images could be synthesized from manipulating a white noise image so that its feature statistics were matched with the target at each pyramid level. Our approach is inspired by classic methods, which match feature statistics within the feed-forward network, but it leverages the advantages of deep learning networks while placing the computational costs into the training process (feed-forward vs. optimization-based).

Relation to Fusion Layers. Our proposed CoMatch Layer is a kind of fusion layer that takes two inputs (content and style representations). Current work in fusion layers with CNNs include feature map concatenation and element-wise sum [26,27,28]. However, these approaches are not directly applicable, since there is no separation of style from content. For style transfer, the generated images should not carry semantic information of the style target nor styles of the content image.

Relation to Generative Adversarial Training. The Generative Adversarial Network (GAN) [29], which jointly trains an adversarial generator and discriminator simultaneously, has catalyzed a surge of interest in the study of image generation [26, 27, 30,31,32,33,34,35,36,37,38,39]. Recent work on image-to-image GAN [26] adopts a conditional GAN to provide a general solution for some image-to-image generation problems. For those problems, it was previously hard to define a loss function. However, the style transfer problem cannot be tackled using the conditional GAN framework, due to missing ground-truth image pairs. Instead, we follow the work [13, 14] to adopt a discriminator/loss network that minimizes the perceptual difference of synthesized images with content and style targets and provides the supervision of the generative network learning. The initial idea of employing Gram Matrix to trigger style synthesis is inspired by a recent work [30] that suggests using an encoder instead of random vector in GAN framework.

Recent Work in Multiple or Arbitrary Style Transfer. Recent/concurrent work explores multiple or arbitrary style transfer [17, 18, 20]. A style swap layer is proposed in [20], but gets lower quality and slower speed (compared to existing feed-forward approaches). An adaptive instance normalization is introduced in [18] to match the mean and variance of the feature maps with the style target. Instead, our CoMatch Layer matches the second order statistics of Gram Matrices for the feature maps. We also explore the scalability of our approach in the Experiment Sect. 5.

2 Content and Style Representation

CNNs pre-trained on a very large dataset such as ImageNet can be regarded as descriptive representations of image statistics containing both semantic content and style information. Gatys et al. [12] provides explicit representations that independently model the image content and style from CNNs, which we briefly describe in this section for completeness.

The semantic content of the image can be represented as the activations of the descriptive network at i-th scale $\mathcal {F}^i(x)\in \mathbb {R}^{C_i\times H_i\times W_i}$ with a given the input image x, where the $C_i$, $H_i$ and $W_i$ are the number of feature map channels, feature map height and width. The texture or style of the image can be represented as the distribution of the features using Gram Matrix $\mathcal {G}(\mathcal {F}^i(x))\in \mathbb {R}^{C_i\times C_i}$ given by

$$\begin{aligned} \mathcal {G}\left( \mathcal {F}^i(x)\right) = \sum _{h=1}^{H_i}\sum _{w=1}^{W_i}\mathcal {F}^i_{h,w}(x){\mathcal {F}^i_{h,w}(x)}^T . \end{aligned}$$

(1)

The Gram Matrix is orderless and describes the feature distributions. For zero-centered data, the Gram Matrix is the same as the covariance matrix scaled by the number of elements $C_i\times H_i\times W_i$. It can be calculated efficiently by first reshaping the feature map $\varPhi \left( \mathcal {F}^i(x)\right) \in \mathbb {R}^{C_i\times (H_iW_i)}$, where $\varPhi ()$ is a reshaping operation. Then the Gram Matrix can be written as $\mathcal {G}\left( \mathcal {F}^i(x)\right) = \varPhi \left( \mathcal {F}^i(x)\right) \varPhi \left( \mathcal {F}^i(x)\right) ^T$.

3 CoMatch Layer

In this section, we introduce CoMatch Layer, which explicitly matches second order feature statistics based on the given styles. For a given content target $x_c$ and a style target $x_s$, the content and style representations at the i-th scale using the descriptive network can be written as $\mathcal {F}^i(x_c)$ and $\mathcal {G}(\mathcal {F}^i(x_s))$, respectively. A direct solution $\hat{\mathcal {Y}^i}$ is desirable which preserves the semantic content of input image and matches the target style feature statistics:

$$\begin{aligned} \begin{aligned} \hat{\mathcal {Y}^i} = \mathop {\text {argmin}}\limits _{\mathcal {Y}^i} \{ \Vert \mathcal {Y}^i-\mathcal {F}^i(x_c)\Vert ^2_F \quad \\ +\, \alpha \Vert \mathcal {G}(\mathcal {Y}^i)-\mathcal {G}\left( \mathcal {F}^i(x_s)\right) \Vert ^2_F \}. \end{aligned} \end{aligned}$$

(2)

where $\alpha $ is a trade-off parameter that balancing the contribution of the content and style targets.

The minimization of the above problem is solvable by using an iterative approach, but it is infeasible to achieve it in real-time or make the model differentiable. However, we can still approximate the solution and put the computational burden to the training stage. We introduce an approximation which tunes the feature map based on the target style:

$$\begin{aligned} \hat{\mathcal {Y}^i} = \varPhi ^{-1}\left[ \varPhi \left( \mathcal {F}^i(x_c)\right) ^TW\mathcal {G}\left( \mathcal {F}^i(x_s)\right) \right] ^T, \end{aligned}$$

(3)

where $W\in \mathbb {R}^{C_i\times C_i}$ is a learnable weight matrix and $\varPhi ()$ is a reshaping operation to match the dimension, so that $\varPhi \left( \mathcal {F}^i(x_c)\right) \in \mathbb {R}^{C_i\times (H_iW _i)}$. For intuition on the functionality of W, suppose $W={\mathcal {G}\left( \mathcal {F}^i(x_s)\right) }^{-1}$, then the first term in Eq. 2 (content term) is minimized. Now let $W=\varPhi \left( \mathcal {F}^i(x_c)\right) ^{-T}{\mathcal {L}(\mathcal {F}^i(x_s))}^{-1}$, where $\mathcal {L}\left( \mathcal {F}^i(x_s)\right) $ is obtained by the Cholesky Decomposition of $\mathcal {G}\left( \mathcal {F}^i(x_s)\right) =\mathcal {L}\left( \mathcal {F}^i(x_s)\right) {\mathcal {L}\left( \mathcal {F}^i(x_s)\right) }^T$, then the second term of Eq. 2 (style term) is minimized. We let W be learned directly from the loss function to dynamically balance the trade-off. The CoMatch Layer is differentiable and can be inserted in the existing generative network and directly learned from the loss function without any additional supervision.

4 Multi-style Generative Network

4.1 Network Architecture

Prior feed-forward based single-style transfer work learns a generator network that takes only the content image as the input and outputs the transferred image, i.e. the generator network can be expressed as $G(x_c)$, which implicitly learns the feature statistics of the style image from the loss function. We introduce a Multi-style Generative Network which takes both content and style target as inputs. i.e. $G(x_c, x_s)$. The proposed network explicitly matches the feature statistics of the style targets at runtime.

As part of the Generator Network, we adopt a Siamese network sharing weights with the encoder part of transformation network, which captures the feature statistics of the style image $x_s$ at different scales, and outputs the Gram Matrices $\{\mathcal {G}(\mathcal {F}^i(x_s))\} (i=1,...K)$ where K is the total number of scales. Then a transformation network takes the content image $x_c$ and matches the feature statistics of the style image at multiple scales with CoMatch Layers.

Upsampled Convolution. Standard CNN for image-to-image tasks typically adopts an encoder-decoder framework, because it is efficient to put heavy operations (style switching) in smaller featuremaps and also important to keep a larger receptive field for preserving semantic coherence. The decoder part learns a fractionally-strided convolution to recover the detail information from downsampled featuremaps. However, the fractionally strided convolution [22] typically introduces checkerboard artifacts [40]. Prior work suggests using upsampling followed by convolution to replace the standard fractionally-strided convolution [40]. However, this strategy will decrease the receptive field and it is inefficient to apply convolution on an upsampled area. For this, we use upsampled convolution, which has an integer stride, and outputs upsampled featuremaps. For an upsampling factor of 2, the upsampled convolution will produce a 2 $\times $ 2 outputs for each convolutional window as visualized in Fig. 4. Comparing to fractionally-strided convolution, this method has the same computation complexity and 4 times parameters. This strategy successfully avoid upsampling artifacts in the network decoder.

Upsample Residual Block. Deep residual learning has achieved great success in visual recognition [23, 41]. Residual block architecture plays an important role by reducing the computational complexity without losing diversity by preserving the large number of feature map channels. We extend the original architecture with an upsampling version as shown in Fig. 5 (right), which has a fractionally-strided convolution [22] as the shortcut and adopts reflectance padding to avoid artifacts of the generative process. This upsampling residual architecture allows us to pass identity all the way through the network, so that the network converges faster and extends deeper.

Brush Stroke Size Control. Feeding the generative model with high-resolution image usually results in unsatisfying style transfer outputs, as shown in Fig. 6(c). Controlling brush stroke size can be achieved using optimization-based approach [25]. Resizing the style image changes the brush-size, and feed-forward generative model matches the feature statistics at runtime should naturally achieve brush stoke size control. However, prior work is mainly limited by the 1D style embedding, because this finer style behavior cannot be captured using merely featuremap mean and variance. With MSG-Net, the CoMatch Layer matching the second order statistics elegantly solves the brush-size control. During training, we train the network with different style image sizes to learn from different brush stroke sizes. After training, the brush stroke size can be an option to the user by changing style input image size. Note that the MSG-Net can accept different input sizes for style and content images. Example results are shown in Fig. 8.

Other Details. We only use in-network down-sample (convolutional) and up-sample (upsampled convolution) in the transformation network. We use reflectance padding to avoid artifacts at the border. Instance normalization [16] and ReLU are used after weight layers (convolution, fractionally-strided convolution and the CoMatch Layer), which improves the generated image quality and is robust to the image contrast changes.

4.2 Network Learning

Style transfer is an open problem, since there is no gold-standard ground-truth to follow. We follow previous work to minimize a weighted combination of the style and content differences of the generator network outputs and the targets for a given pre-trained loss network $\mathcal {F}$ [13, 14]. Let the generator network be denoted by $G(x_c,x_s)$ parameterized by weights $W_G$. Learning proceeds by sampling content images $x_c\sim X_c$ and style images $x_s\sim X_s$ and then adjusting the parameters $W_G$ of the generator $G(x_c,x_s)$ in order to minimize the loss:

$$\begin{aligned} \begin{array}{lll} &{}&{} \hat{W}_G =\mathop {\text {argmin}}\limits _{W_G}E_{x_c,x_s}\{\\ &{}&{}\qquad \lambda _c\Vert \mathcal {F}^c\left( G(x_c,x_s)\right) -\mathcal {F}^c(x_c)\Vert _F^2\\ &{}&{}\qquad +\lambda _s\sum _{i=1}^K\Vert \mathcal {G}\left( \mathcal {F}^i(G(x_c,x_s))\right) -\mathcal {G}(\mathcal {F}^i(x_s))\Vert ^2_F\\ &{}&{}\qquad +\lambda _{TV}\ell _{TV}\left( G(x_c,x_s)\right) \} , \end{array} \end{aligned}$$

(4)

where $\lambda _c$ and $\lambda _s$ are the balancing weights for content and style losses. We consider image content at scale c and image style at scales $i\in \{1,..K\}$. $\ell _{TV}()$ is the total variation regularization as used prior work for encouraging the smoothness of the generated images [14, 42, 43].

5 Experimental Results

5.1 Style Transfer

Baselines. We use the implementation of the work of Gatys et al. [12] as a gold standard baseline for style transfer approach (technical details will be included in the supplementary material). We also compare our approach with state-of-the-art multistyle or arbitrary style transfer methods, including patch-based approach [20] and 1D style embedding [17, 18]. The implementations from original authors are used in this experiments.

Table 1. Comparing model size on disk and inference/test speed fps (frames/sec) of images with the size of 256 $\times $ 256 and 512 $\times $ 512 on a NVIDIA Titan Xp GPU average over 50 samples. MSG-Net-100 and MSG-Net-1K have 2.3M and 8.9M parameters respectively.

Full size table

Method Details. We adapt 16-layer VGG network [44] pre-trained on ImageNet as the loss network in Eq. 4, because the network features learned from a diverse set of images are likely to be generic and informative. We consider the style representation at 4 different scales using the layers ReLU1_2, ReLU2_2, ReLU3_3 and ReLU4_3, and use the content representation at the layer ReLU2_2. The Microsoft COCO dataset [45] is used as the content image image set $X_c$, which has around 80,000 natural images. We collect 100 style images, choosing from previous work in style transfer. Additionally 900 real paintings are selected from the open-source artistic dataset wikiart.org [46] as additional style images for training MSG-Net-1K. We follow the work [13, 14] and adopt Adam [47] to train the network with a learning rate of $1\times 10^{-3}$. We use the loss function as described in Eq. 4 with the balancing weights $\lambda _c=1,\lambda _s=5,\lambda _{TV}=1\times 10^{-6}$ for content, style and total regularization. We resize the content images $x_c\sim X_c$ to $256\times 256$ and learn the network with a batch size of 4 for 80,000 iterations. We iteratively update the style image $x_s$ every iteration with size from {256, 512, 768} for runtime brush-size control. After training, the MSG-Net as a fully convolutional network [22] can accept arbitrary input image size. For comparing the style transfer approaches, we use the same content image size, by resizing the image to 512 along the long side. Our implementations are based on Torch [48], PyTorch [49] and MXNet [50]. It takes roughly 8 h for training MSG-Net-100 model on a Titan Xp GPU.

Model Size and Speed Analysis. For mobile applications or cloud services, the model size and test speed are crucial. We compare the model size and inference/test speed of style transfer approaches in Table 1. Our proposed MSG-Net-100 has a comparable model size and speed with single style network [13, 14]. The MSG-Net is faster than Arbitrary Style Transfer work [18], because of using a learned compact encoder instead of pre-trained VGG network.

Qualitative Comparison. Our proposed MSG-Net achieves superior performance comparing to state-of-the-art generative network approaches as shown in Fig. 9. One may argue that the arbitrary style work has better scalability/capacity [18, 20]. The style flexibility and image quality are always hard trade-off for generative model, and we particularly focus on the image quality in this work. More examples of the transfered images using MSG-Net are shown in Fig. 12.

Model Scalability. Prior work using 1D style embedding has achieved success in the scalability of style transfer towards the goal of arbitrary style transfer [18]. To test the scalability of MSG-Net, we augment the style set to 1K images, by adding 900 extra images from the wikiart.org [46]. We also build a larger model MSG-Net-1K with larger model capacity by increasing the width/channels of the model at mid stage (64 $\times $ 64) by a factor of 2, resulting in 8.9M parameters. We also increase the training iterations by 4 times (320K) and follow the same training procedure as MSG-Net-100. We observe no quality degradation when increasing the number of styles (examples shown in the supplementary material).

5.2 Runtime Manipulation

MSG-Net as a general approach for real-time style transfer is compatible with existing recent progress for both feed-forward and optimization methods, including but not limited to: content-style trade-off and interpolation (Fig. 7), color-preserving transfer (Fig. 10), spatial manipulation (Fig. 11) and brush stroke size control (Figs. 6 and 8). For style interpolation, we use an affine interpolation of our style embedding following the prior work [17, 18]. For color pre-serving, we match the color of style image with the content image as Gatys et. al. [24]. Brush-size control has been discussed in the Sect. 4.1. We use the segmentation mask provided by Shen et al. [51] for spatial control. The source code and technical detail of runtime manipulation will be included in our PyTorch implementation.

6 Conclusion and Discussion

To improve the quality and flexibility of generative models in style transfer, we introduce a novel CoMatch Layer that learns to match the second order statistics as image style representation. Multi-style Generative Network has achieved superior image quality comparing to state-of-the-art approaches. In addition, the proposed MSG-Net is compatible with most existing techniques and recent progress of stye transfer including style interpolation, color-preserving and spatial control. Moreover, MSG-Net first enables real-time brush-size control in a fully feed-forward manor. The compact MSG-Net-100 model has only 2.3M parameters and runs at more than 90 fps (frame/sec) on NVIDIA Titan Xp for the input image of size 256 $\times $ 256 and at 15 fps on a laptop GPU (GTX 750M-2GB).

References

Li, C., Wand, M.: Combining Markov random fields and convolutional neural networks for image synthesis. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
Google Scholar
Efros, A.A., Leung, T.K.: Texture synthesis by non-parametric sampling. In: The Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 2, pp. 1033–1038. IEEE (1999)
Google Scholar
Efros, A.A., Freeman, W.T.: Image quilting for texture synthesis and transfer. In: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, pp. 341–346. ACM (2001)
Google Scholar
Wei, L.Y., Levoy, M.: Fast texture synthesis using tree-structured vector quantization. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, pp. 479–488. ACM Press/Addison-Wesley Publishing Co. (2000)
Google Scholar
Kwatra, V., Schödl, A., Essa, I., Turk, G., Bobick, A.: Graphcut textures: image and video synthesis using graph cuts. ACM Trans. Graph. (ToG) 22, 277–286 (2003)
Article Google Scholar
De Bonet, J.S.: Multiresolution sampling procedure for analysis and synthesis of texture images. In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, pp. 361–368. ACM Press/Addison-Wesley Publishing Co. (1997)
Google Scholar
Heeger, D.J., Bergen, J.R.: Pyramid-based texture analysis/synthesis. In: Proceedings of the 22nd Annual Conference on Computer Graphics and Interactive Techniques, pp. 229–238. ACM (1995)
Google Scholar
Portilla, J., Simoncelli, E.P.: A parametric texture model based on joint statistics of complex wavelet coefficients. Int. J. Comput. Vis. 40(1), 49–70 (2000)
Article Google Scholar
Simoncelli, E.P., Freeman, W.T.: The steerable pyramid: a flexible architecture for multi-scale derivative computation. In: Proceedings of the International Conference on Image Processing, vol. 3, pp. 444–447. IEEE (1995)
Google Scholar
Burt, P., Adelson, E.: The laplacian pyramid as a compact image code. IEEE Trans. Commun. 31(4), 532–540 (1983)
Article Google Scholar
Gatys, L., Ecker, A.S., Bethge, M.: Texture synthesis using convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 262–270 (2015)
Google Scholar
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2414–2423 (2016)
Google Scholar
Ulyanov, D., Lebedev, V., Vedaldi, A., Lempitsky, V.: Texture networks: feed-forward synthesis of textures and stylized images. In: International Conference on Machine Learning (ICML) (2016)
Google Scholar
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part II. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Chapter Google Scholar
Li, C., Wand, M.: Precomputed real-time texture synthesis with Markovian generative adversarial networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part III. LNCS, vol. 9907, pp. 702–716. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_43
Chapter Google Scholar
Ulyanov, D., Vedaldi, A., Lempitsky, V.: Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis (2017). arXiv preprint: arXiv:1701.02096
Dumoulin, V., Shlens, J., Kudlur, M.: A learned representation for artistic style (2016)
Google Scholar
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization (2017). arXiv preprint: arXiv:1703.06868
Chen, D., Yuan, L., Liao, J., Yu, N., Hua, G.: Stylebank: an explicit representation for neural image style transfer. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Chen, T.Q., Schmidt, M.: Fast patch-based style transfer of arbitrary style (2016). arXiv preprint: arXiv:1612.04337
Jing, Y., Yang, Y., Feng, Z., Ye, J., Song, M.: Neural style transfer: A review (2017). arXiv preprint: arXiv:1705.04058
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Gatys, L.A., Bethge, M., Hertzmann, A., Shechtman, E.: Preserving color in neural artistic style transfer (2016). arXiv preprint: arXiv:1606.05897
Gatys, L.A., Ecker, A.S., Bethge, M., Hertzmann, A., Shechtman, E.: Controlling perceptual factors in neural style transfer (2016). arXiv preprint: arXiv:1611.07865
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks (2016). arXiv preprint: arXiv:1611.07004
Zhang, H., et al.: StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks (2016). arXiv preprint: arXiv:1612.03242
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
Google Scholar
Che, T., Li, Y., Jacob, A.P., Bengio, Y., Li, W.: Mode regularized generative adversarial networks (2016). arXiv preprint: arXiv:1612.02136
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks (2015). arXiv preprint: arXiv:1511.06434
Huang, X., Li, Y., Poursaeed, O., Hopcroft, J., Belongie, S.: Stacked generative adversarial networks. arXiv (2016)
Google Scholar
Sindagi, V.A., Patel, V.M.: Generating high-quality crowd density maps using contextual pyramid CNNs. In: 2017 IEEE International Conference on Computer Vision (ICCV). IEEE (2017)
Google Scholar
Zhang, H., Sindagi, V., Patel, V.M.: Image de-raining using a conditional generative adversarial network (2017). arXiv preprint: arXiv:1701.05957
Xian, W., Sangkloy, P., Lu, J., Fang, C., Yu, F., Hays, J.: TextureGAN: Controlling deep image synthesis with texture patches. arXiv preprint (2018)
Google Scholar
Zhang, Z., Xie, Y., Yang, L.: Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Zhang, H., Patel, V.M.: Density-aware single image de-raining using a multi-stream dense network (2018). arXiv preprint: arXiv:1802.07412
Zhang, H., Patel, V.M.: Densely connected pyramid dehazing network. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Xu, T., et al.: AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. arXiv preprint (2017)
Google Scholar
Odena, A., Dumoulin, V., Olah, C.: Deconvolution and checkerboard artifacts. Distill (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part IV. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
Chapter Google Scholar
Mahendran, A., Vedaldi, A.: Understanding deep image representations by inverting them. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5188–5196 (2015)
Google Scholar
Zhang, H., Yang, J., Zhang, Y., Huang, T.S.: Non-local kernel regression for image and video restoration. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part III. LNCS, vol. 6313, pp. 566–579. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15558-1_41
Chapter Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv preprint: arXiv:1409.1556
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Duck, S.Y.: Painter by numbers (2016). https://www.kaggle.com/c/painter-by-numbers
Kingma, D., Ba, J.: Adam: A method for stochastic optimization (2014). arXiv preprint: arXiv:1412.6980
Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a matlab-like environment for machine learning. In: BigLearn, NIPS Workshop, Number EPFL-CONF-192376 (2011)
Google Scholar
Paszke, A., et al.: Automatic differentiation in PyTorch (2017)
Google Scholar
Chen, T., et al.: MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems (2015). arXiv preprint: arXiv:1512.01274
Shen, X., et al.: Automatic portrait segmentation for image stylization. In: Computer Graphics Forum, vol. 35, pp. 93–102. Wiley Online Library (2016)
Google Scholar
Zhang, H., et al.: Context encoding for semantic segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
Google Scholar

Download references

Author information

Authors and Affiliations

Amazon AI, East Palo Alto, USA
Hang Zhang
Rutgers University, New Brunswick, USA
Hang Zhang & Kristin Dana

Authors

Hang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Kristin Dana
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hang Zhang .

Editor information

Editors and Affiliations

Technical University of Munich, Garching, Germany
Laura Leal-Taixé
Technische Universität Darmstadt, Darmstadt, Germany
Stefan Roth

1 Electronic supplementary material

Supplementary material 1 (mp4 70473 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, H., Dana, K. (2019). Multi-style Generative Network for Real-Time Transfer. In: Leal-Taixé, L., Roth, S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science(), vol 11132. Springer, Cham. https://doi.org/10.1007/978-3-030-11018-5_32

Download citation

DOI: https://doi.org/10.1007/978-3-030-11018-5_32
Published: 23 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11017-8
Online ISBN: 978-3-030-11018-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics