1 Introduction

Style transfer can be approached as reconstructing or synthesizing texture based on the target image semantic content [1]. Many pioneering works have achieved success in classic texture synthesis starting with methods that resample pixels [2,3,4,5] or match multi-scale feature statistics [6,7,8]. These methods employ traditional image pyramids obtained by handcrafted multi-scale linear filter banks [9, 10] and perform texture synthesis by matching the feature statistics to the target style. In recent years, the concepts of texture synthesis and style transfer have been revisited within the context of deep learning. Gatys et al. [11] shows that using feature correlations (i.e. Gram Matrix) of convolutional neural nets (CNN) successfully captures the image styles. This framework has brought a surge of interest in texture synthesis and style transfer using iterative optimization [1, 11, 12] or training feed-forward networks [13,14,15,16]. Recent work extends style flexibility using feed-forward networks and achieves multistyle or arbitrary style transfer [17,18,19,20]. These approaches typically encode image styles into 1-dimensional space, i.e. tuning the featuremap mean and variance (bias and scale) for different styles. However, the comprehensive appearance of image style is fundamentally difficult to represent in 1D embedding space. Figure 3 shows style transfer results using the optimization-based approach [12] and we can see Gram matrix representation produces more appealing image quality comparing to mean and variance of CNN featuremap.

Fig. 1.
figure 1

Examples of transferred images and the corresponding styles using the proposed MSG-Net.

In addition to the image quality, concerns about the flexibility of current feed-forward generative models have been raised in Jing et al. [21], and they point out that no generative methods can adjust the brush stroke size in real-time. Feeding the generative network with high-resolution content image usually results in unsatisfying images as shown in Fig. 6. The generative network as a fully convolutional network (FCN) can accept arbitrary input image sizes. Resizing the style image changes the relative brush size and the multistyle generative network matching the image style at run-time should naturally enable brush-size control by changing the input style image size. What limits the current generative model from being aware of the brush size? The 1D style embedding (featuremap mean and variance) fundamentally limits the potential of exploring finer behavior for style representations. Therefore, a 2D method is desired for finer representation of image styles.

Fig. 2.
figure 2

An overview of MSG-Net, Multi-style Generative Network. The transformation network explicitly matches the features statistics of the style targets captured by a Siamese network using the proposed CoMatch Layer (introduced in Sect. 3). A pre-trained loss network provides the supervision of MSG-Net learning by minimizing the content and style differences with the targets as discussed in Sect. 4.2.

As the first contribution of the paper, we introduce an CoMatch Layer which embeds style with a 2D representation and learns to match the second-order feature statistics (Gram Matrix) of the style targets inherently during the training. The CoMatch Layer is differentiable and end-to-end learnable with existing generative network architectures without additional supervision. The proposed CoMatch Layer enables multi-style generation from a single feed-forward network (Fig. 2).

Fig. 3.
figure 3

Comparing 1D and 2D style representation using an optimization-based approach [12]. (a) Input image and style. (b) Style transfer result minimizing difference of CNN featuremap mean and variance. (c) Style transfer result minimizing the difference in Gram matrix representation.

The second contribution of this paper is building Multi-style Generative Network (MSG-Net) with the proposed CoMatch Layer and a novel Upsample Convolution. The MSG-Net as a feed-forward network runs in real-time after training. Generative networks typically have a decoder part recovering the image details from downsampled representations. Learning fractionally-strided convolution [22] typically brings checkerboard artifacts. For improving the image quality, we employ a strategy we call upsampled convolution, which successfully avoids the checkerboard artifacts by applying an integer stride convolution and outputs an upsampled featuremap (details in Sect. 4.1). In addition, we extend the Bottleneck architecture [23] to an Upsampling Residual Block, which reduces computational complexity without losing style versatility by preserving larger number of channels. Passing identity all the way through the generative network enables the network to extend deeper and converge faster. The experimental results show that MSG-Net has achieved superior image fidelity and test speed compared to previous work. We also study the scalability of the model by extending 100-style MSG-Net to 1K styles using a larger model size and longer training time, and we observe no obvious quality differences. In addition, MSG-Net as a general multi-style strategy is compatible to most existing techniques and progress in style transfer, such as content style trade-off and interpolation [17], spatial control, color preserving and brush-size control [24, 25].

To our knowledge, MSG-Net is the first to achieve real-time brush-size control in a purely feed-forward manner for multistyle transfer.

1.1 Related Work

Relation to Pyramid Matching. Early methods for texture synthesis were developed using multi-scale image pyramids [4, 6,7,8]. The discovery in these earlier methods was that realistic texture images could be synthesized from manipulating a white noise image so that its feature statistics were matched with the target at each pyramid level. Our approach is inspired by classic methods, which match feature statistics within the feed-forward network, but it leverages the advantages of deep learning networks while placing the computational costs into the training process (feed-forward vs. optimization-based).

Relation to Fusion Layers. Our proposed CoMatch Layer is a kind of fusion layer that takes two inputs (content and style representations). Current work in fusion layers with CNNs include feature map concatenation and element-wise sum [26,27,28]. However, these approaches are not directly applicable, since there is no separation of style from content. For style transfer, the generated images should not carry semantic information of the style target nor styles of the content image.

Relation to Generative Adversarial Training. The Generative Adversarial Network (GAN) [29], which jointly trains an adversarial generator and discriminator simultaneously, has catalyzed a surge of interest in the study of image generation [26, 27, 30,31,32,33,34,35,36,37,38,39]. Recent work on image-to-image GAN [26] adopts a conditional GAN to provide a general solution for some image-to-image generation problems. For those problems, it was previously hard to define a loss function. However, the style transfer problem cannot be tackled using the conditional GAN framework, due to missing ground-truth image pairs. Instead, we follow the work [13, 14] to adopt a discriminator/loss network that minimizes the perceptual difference of synthesized images with content and style targets and provides the supervision of the generative network learning. The initial idea of employing Gram Matrix to trigger style synthesis is inspired by a recent work [30] that suggests using an encoder instead of random vector in GAN framework.

Recent Work in Multiple or Arbitrary Style Transfer. Recent/concurrent work explores multiple or arbitrary style transfer [17, 18, 20]. A style swap layer is proposed in [20], but gets lower quality and slower speed (compared to existing feed-forward approaches). An adaptive instance normalization is introduced in [18] to match the mean and variance of the feature maps with the style target. Instead, our CoMatch Layer matches the second order statistics of Gram Matrices for the feature maps. We also explore the scalability of our approach in the Experiment Sect. 5.

2 Content and Style Representation

CNNs pre-trained on a very large dataset such as ImageNet can be regarded as descriptive representations of image statistics containing both semantic content and style information. Gatys et al. [12] provides explicit representations that independently model the image content and style from CNNs, which we briefly describe in this section for completeness.

The semantic content of the image can be represented as the activations of the descriptive network at i-th scale \(\mathcal {F}^i(x)\in \mathbb {R}^{C_i\times H_i\times W_i}\) with a given the input image x, where the \(C_i\), \(H_i\) and \(W_i\) are the number of feature map channels, feature map height and width. The texture or style of the image can be represented as the distribution of the features using Gram Matrix \(\mathcal {G}(\mathcal {F}^i(x))\in \mathbb {R}^{C_i\times C_i}\) given by

$$\begin{aligned} \mathcal {G}\left( \mathcal {F}^i(x)\right) = \sum _{h=1}^{H_i}\sum _{w=1}^{W_i}\mathcal {F}^i_{h,w}(x){\mathcal {F}^i_{h,w}(x)}^T . \end{aligned}$$
(1)

The Gram Matrix is orderless and describes the feature distributions. For zero-centered data, the Gram Matrix is the same as the covariance matrix scaled by the number of elements \(C_i\times H_i\times W_i\). It can be calculated efficiently by first reshaping the feature map \(\varPhi \left( \mathcal {F}^i(x)\right) \in \mathbb {R}^{C_i\times (H_iW_i)}\), where \(\varPhi ()\) is a reshaping operation. Then the Gram Matrix can be written as \(\mathcal {G}\left( \mathcal {F}^i(x)\right) = \varPhi \left( \mathcal {F}^i(x)\right) \varPhi \left( \mathcal {F}^i(x)\right) ^T\).

Fig. 4.
figure 4

Left: fractionally-strided convolution. Right: Upsampled convolution, which reduces the checkerboard artifacts by applying an integer stride convolution and outputting an upsampled featuremaps.

3 CoMatch Layer

In this section, we introduce CoMatch Layer, which explicitly matches second order feature statistics based on the given styles. For a given content target \(x_c\) and a style target \(x_s\), the content and style representations at the i-th scale using the descriptive network can be written as \(\mathcal {F}^i(x_c)\) and \(\mathcal {G}(\mathcal {F}^i(x_s))\), respectively. A direct solution \(\hat{\mathcal {Y}^i}\) is desirable which preserves the semantic content of input image and matches the target style feature statistics:

$$\begin{aligned} \begin{aligned} \hat{\mathcal {Y}^i} = \mathop {\text {argmin}}\limits _{\mathcal {Y}^i} \{ \Vert \mathcal {Y}^i-\mathcal {F}^i(x_c)\Vert ^2_F \quad \\ +\, \alpha \Vert \mathcal {G}(\mathcal {Y}^i)-\mathcal {G}\left( \mathcal {F}^i(x_s)\right) \Vert ^2_F \}. \end{aligned} \end{aligned}$$
(2)

where \(\alpha \) is a trade-off parameter that balancing the contribution of the content and style targets.

Fig. 5.
figure 5

We extend the original down-sampling residual architecture (left) to an up-sampling version (right). We use a 1 \(\times \) 1 fractionally-strided convolution as a shortcut and adopt reflectance padding.

Fig. 6.
figure 6

Comparing Brush-size control. (a) High-resolution input image and dense styles. (b) Style transfer results using MSG-Net with brush-size control. (c) Standard generative network [14] without brush-size control. See also Fig. 8

The minimization of the above problem is solvable by using an iterative approach, but it is infeasible to achieve it in real-time or make the model differentiable. However, we can still approximate the solution and put the computational burden to the training stage. We introduce an approximation which tunes the feature map based on the target style:

$$\begin{aligned} \hat{\mathcal {Y}^i} = \varPhi ^{-1}\left[ \varPhi \left( \mathcal {F}^i(x_c)\right) ^TW\mathcal {G}\left( \mathcal {F}^i(x_s)\right) \right] ^T, \end{aligned}$$
(3)

where \(W\in \mathbb {R}^{C_i\times C_i}\) is a learnable weight matrix and \(\varPhi ()\) is a reshaping operation to match the dimension, so that \(\varPhi \left( \mathcal {F}^i(x_c)\right) \in \mathbb {R}^{C_i\times (H_iW _i)}\). For intuition on the functionality of W, suppose \(W={\mathcal {G}\left( \mathcal {F}^i(x_s)\right) }^{-1}\), then the first term in Eq. 2 (content term) is minimized. Now let \(W=\varPhi \left( \mathcal {F}^i(x_c)\right) ^{-T}{\mathcal {L}(\mathcal {F}^i(x_s))}^{-1}\), where \(\mathcal {L}\left( \mathcal {F}^i(x_s)\right) \) is obtained by the Cholesky Decomposition of \(\mathcal {G}\left( \mathcal {F}^i(x_s)\right) =\mathcal {L}\left( \mathcal {F}^i(x_s)\right) {\mathcal {L}\left( \mathcal {F}^i(x_s)\right) }^T\), then the second term of Eq. 2 (style term) is minimized. We let W be learned directly from the loss function to dynamically balance the trade-off. The CoMatch Layer is differentiable and can be inserted in the existing generative network and directly learned from the loss function without any additional supervision.

4 Multi-style Generative Network

4.1 Network Architecture

Prior feed-forward based single-style transfer work learns a generator network that takes only the content image as the input and outputs the transferred image, i.e. the generator network can be expressed as \(G(x_c)\), which implicitly learns the feature statistics of the style image from the loss function. We introduce a Multi-style Generative Network which takes both content and style target as inputs. i.e. \(G(x_c, x_s)\). The proposed network explicitly matches the feature statistics of the style targets at runtime.

As part of the Generator Network, we adopt a Siamese network sharing weights with the encoder part of transformation network, which captures the feature statistics of the style image \(x_s\) at different scales, and outputs the Gram Matrices \(\{\mathcal {G}(\mathcal {F}^i(x_s))\} (i=1,...K)\) where K is the total number of scales. Then a transformation network takes the content image \(x_c\) and matches the feature statistics of the style image at multiple scales with CoMatch Layers.

Upsampled Convolution. Standard CNN for image-to-image tasks typically adopts an encoder-decoder framework, because it is efficient to put heavy operations (style switching) in smaller featuremaps and also important to keep a larger receptive field for preserving semantic coherence. The decoder part learns a fractionally-strided convolution to recover the detail information from downsampled featuremaps. However, the fractionally strided convolution [22] typically introduces checkerboard artifacts [40]. Prior work suggests using upsampling followed by convolution to replace the standard fractionally-strided convolution [40]. However, this strategy will decrease the receptive field and it is inefficient to apply convolution on an upsampled area. For this, we use upsampled convolution, which has an integer stride, and outputs upsampled featuremaps. For an upsampling factor of 2, the upsampled convolution will produce a 2 \(\times \) 2 outputs for each convolutional window as visualized in Fig. 4. Comparing to fractionally-strided convolution, this method has the same computation complexity and 4 times parameters. This strategy successfully avoid upsampling artifacts in the network decoder.

Fig. 7.
figure 7

Content and style trade-off and interpolation.

Upsample Residual Block. Deep residual learning has achieved great success in visual recognition [23, 41]. Residual block architecture plays an important role by reducing the computational complexity without losing diversity by preserving the large number of feature map channels. We extend the original architecture with an upsampling version as shown in Fig. 5 (right), which has a fractionally-strided convolution [22] as the shortcut and adopts reflectance padding to avoid artifacts of the generative process. This upsampling residual architecture allows us to pass identity all the way through the network, so that the network converges faster and extends deeper.

Brush Stroke Size Control. Feeding the generative model with high-resolution image usually results in unsatisfying style transfer outputs, as shown in Fig. 6(c). Controlling brush stroke size can be achieved using optimization-based approach [25]. Resizing the style image changes the brush-size, and feed-forward generative model matches the feature statistics at runtime should naturally achieve brush stoke size control. However, prior work is mainly limited by the 1D style embedding, because this finer style behavior cannot be captured using merely featuremap mean and variance. With MSG-Net, the CoMatch Layer matching the second order statistics elegantly solves the brush-size control. During training, we train the network with different style image sizes to learn from different brush stroke sizes. After training, the brush stroke size can be an option to the user by changing style input image size. Note that the MSG-Net can accept different input sizes for style and content images. Example results are shown in Fig. 8.

Fig. 8.
figure 8

Brush-size control using MSG-Net. Top left: High-resolution input image and dense style. Others: Style transfer results using MSG-Net with brush-size control.

Other Details. We only use in-network down-sample (convolutional) and up-sample (upsampled convolution) in the transformation network. We use reflectance padding to avoid artifacts at the border. Instance normalization [16] and ReLU are used after weight layers (convolution, fractionally-strided convolution and the CoMatch Layer), which improves the generated image quality and is robust to the image contrast changes.

4.2 Network Learning

Style transfer is an open problem, since there is no gold-standard ground-truth to follow. We follow previous work to minimize a weighted combination of the style and content differences of the generator network outputs and the targets for a given pre-trained loss network \(\mathcal {F}\) [13, 14]. Let the generator network be denoted by \(G(x_c,x_s)\) parameterized by weights \(W_G\). Learning proceeds by sampling content images \(x_c\sim X_c\) and style images \(x_s\sim X_s\) and then adjusting the parameters \(W_G\) of the generator \(G(x_c,x_s)\) in order to minimize the loss:

$$\begin{aligned} \begin{array}{lll} &{}&{} \hat{W}_G =\mathop {\text {argmin}}\limits _{W_G}E_{x_c,x_s}\{\\ &{}&{}\qquad \lambda _c\Vert \mathcal {F}^c\left( G(x_c,x_s)\right) -\mathcal {F}^c(x_c)\Vert _F^2\\ &{}&{}\qquad +\lambda _s\sum _{i=1}^K\Vert \mathcal {G}\left( \mathcal {F}^i(G(x_c,x_s))\right) -\mathcal {G}(\mathcal {F}^i(x_s))\Vert ^2_F\\ &{}&{}\qquad +\lambda _{TV}\ell _{TV}\left( G(x_c,x_s)\right) \} , \end{array} \end{aligned}$$
(4)

where \(\lambda _c\) and \(\lambda _s\) are the balancing weights for content and style losses. We consider image content at scale c and image style at scales \(i\in \{1,..K\}\). \(\ell _{TV}()\) is the total variation regularization as used prior work for encouraging the smoothness of the generated images [14, 42, 43].

Fig. 9.
figure 9

The tradeoff between style-flexibility and output-image quality is challenging for generative models. Our approach enables multi-style transfer and has minimal difference in quality compared to the optimization-based Gatys approach [12].

5 Experimental Results

5.1 Style Transfer

Baselines. We use the implementation of the work of Gatys et al. [12] as a gold standard baseline for style transfer approach (technical details will be included in the supplementary material). We also compare our approach with state-of-the-art multistyle or arbitrary style transfer methods, including patch-based approach [20] and 1D style embedding [17, 18]. The implementations from original authors are used in this experiments.

Table 1. Comparing model size on disk and inference/test speed fps (frames/sec) of images with the size of 256 \(\times \) 256 and 512 \(\times \) 512 on a NVIDIA Titan Xp GPU average over 50 samples. MSG-Net-100 and MSG-Net-1K have 2.3M and 8.9M parameters respectively.

Method Details. We adapt 16-layer VGG network [44] pre-trained on ImageNet as the loss network in Eq. 4, because the network features learned from a diverse set of images are likely to be generic and informative. We consider the style representation at 4 different scales using the layers ReLU1_2, ReLU2_2, ReLU3_3 and ReLU4_3, and use the content representation at the layer ReLU2_2. The Microsoft COCO dataset [45] is used as the content image image set \(X_c\), which has around 80,000 natural images. We collect 100 style images, choosing from previous work in style transfer. Additionally 900 real paintings are selected from the open-source artistic dataset wikiart.org [46] as additional style images for training MSG-Net-1K. We follow the work [13, 14] and adopt Adam [47] to train the network with a learning rate of \(1\times 10^{-3}\). We use the loss function as described in Eq. 4 with the balancing weights \(\lambda _c=1,\lambda _s=5,\lambda _{TV}=1\times 10^{-6}\) for content, style and total regularization. We resize the content images \(x_c\sim X_c\) to \(256\times 256\) and learn the network with a batch size of 4 for 80,000 iterations. We iteratively update the style image \(x_s\) every iteration with size from {256, 512, 768} for runtime brush-size control. After training, the MSG-Net as a fully convolutional network [22] can accept arbitrary input image size. For comparing the style transfer approaches, we use the same content image size, by resizing the image to 512 along the long side. Our implementations are based on Torch [48], PyTorch [49] and MXNet [50]. It takes roughly 8 h for training MSG-Net-100 model on a Titan Xp GPU.

Model Size and Speed Analysis. For mobile applications or cloud services, the model size and test speed are crucial. We compare the model size and inference/test speed of style transfer approaches in Table 1. Our proposed MSG-Net-100 has a comparable model size and speed with single style network [13, 14]. The MSG-Net is faster than Arbitrary Style Transfer work [18], because of using a learned compact encoder instead of pre-trained VGG network.

Fig. 10.
figure 10

Color control using MSG-Net, (left) content and style images, (right) color-preserved transfer result. (Color figure online)

Fig. 11.
figure 11

Spatial control using MSG-Net. Left: input image, middle: foreground and background styles, right: style transfer result. (Input image and segmentation mask from Shen et al. [51, 52].)

Qualitative Comparison. Our proposed MSG-Net achieves superior performance comparing to state-of-the-art generative network approaches as shown in Fig. 9. One may argue that the arbitrary style work has better scalability/capacity [18, 20]. The style flexibility and image quality are always hard trade-off for generative model, and we particularly focus on the image quality in this work. More examples of the transfered images using MSG-Net are shown in Fig. 12.

Fig. 12.
figure 12

Diverse images that are generated using a single MSG-Net-100 (2.3M parameters). First row shows the input content images and the other rows are generated images with different style targets (first column).

Model Scalability. Prior work using 1D style embedding has achieved success in the scalability of style transfer towards the goal of arbitrary style transfer [18]. To test the scalability of MSG-Net, we augment the style set to 1K images, by adding 900 extra images from the wikiart.org [46]. We also build a larger model MSG-Net-1K with larger model capacity by increasing the width/channels of the model at mid stage (64 \(\times \) 64) by a factor of 2, resulting in 8.9M parameters. We also increase the training iterations by 4 times (320K) and follow the same training procedure as MSG-Net-100. We observe no quality degradation when increasing the number of styles (examples shown in the supplementary material).

5.2 Runtime Manipulation

MSG-Net as a general approach for real-time style transfer is compatible with existing recent progress for both feed-forward and optimization methods, including but not limited to: content-style trade-off and interpolation (Fig. 7), color-preserving transfer (Fig. 10), spatial manipulation (Fig. 11) and brush stroke size control (Figs. 6 and 8). For style interpolation, we use an affine interpolation of our style embedding following the prior work [17, 18]. For color pre-serving, we match the color of style image with the content image as Gatys et. al. [24]. Brush-size control has been discussed in the Sect. 4.1. We use the segmentation mask provided by Shen et al. [51] for spatial control. The source code and technical detail of runtime manipulation will be included in our PyTorch implementation.

6 Conclusion and Discussion

To improve the quality and flexibility of generative models in style transfer, we introduce a novel CoMatch Layer that learns to match the second order statistics as image style representation. Multi-style Generative Network has achieved superior image quality comparing to state-of-the-art approaches. In addition, the proposed MSG-Net is compatible with most existing techniques and recent progress of stye transfer including style interpolation, color-preserving and spatial control. Moreover, MSG-Net first enables real-time brush-size control in a fully feed-forward manor. The compact MSG-Net-100 model has only 2.3M parameters and runs at more than 90 fps (frame/sec) on NVIDIA Titan Xp for the input image of size 256 \(\times \) 256 and at 15 fps on a laptop GPU (GTX 750M-2GB).