Keywords

1 Introduction

Single image super-resolution (SISR) is a task to restore the original high-resolution (HR) image from a single low-resolution (LR) image counterpart. Successful super-resolution (SR) is of great value in that it can be effectively utilized for diverse applications such as surveillance imaging, medical imaging, and ultra high definition contents generation. However, SISR is still a challenging problem despite extensive research for decades because of its inherent ill-posedness, i.e., for a given LR image, there exist a numerous number of HR images that can be downsampled to the same LR image.

Most existing SISR approaches try to minimize pixel-wise mean squared errors (MSEs) between the super-resolved image and the target image. Minimizing pixel-wise errors inherently maximizes peak signal-to-noise ratio (PSNR), which is commonly used to compare different methods. However, it is well-known that measuring pixel-wise difference can hardly capture perceptual differences between images [17, 48, 49], thus higher PSNR does not necessarily lead to a perceptually better image. Rather, it prefers blurry results without high-frequency details as minimization of the errors regresses to an average of possible solutions.

Fig. 1.
figure 1

Our SR results. The final result from our network trained with GAN (right) is much more perceptually realistic than the result obtained by our network trained with MSE only (left).

Recently, Goodfellow et al. [14] introduced a novel framework called generative adversarial network (GAN), which consists of two neural networks competing each other: a generator and a discriminator. The generator tries to fool the discriminator by generating a realistic image, while the discriminator tries to distinguish generated fake images from real ones. Joint training of these two networks leads to a generator that is able to produce remarkably realistic fake images. Thanks to its effectiveness in image generation, GAN has been widely applied to various tasks such as image synthesis, style transfer, image inpainting, and object detection [19, 20, 23, 25, 28, 30, 37, 55].

Recently GAN has also been applied to SISR to overcome the aforementioned limitation and produce super-resolved images with synthesized high-frequency details. Ledig et al. proposed SRGAN [27] that employs an adversarial loss term with a data term for obtaining visually-pleasing results rather than maximizing PSNR. Sajjadi et al. proposed EnhanceNet [40], which is also based on GAN. EnhanceNet additionally adopts a texture matching loss inspired by Gatys et al. [13] to encourage super-resolved results to have the same textures as the ground truth HR images.

While GAN-based SISR methods show dramatic improvements over previous approaches in terms of perceptual quality, they often tend to produce less meaningful high-frequency noise in super-resolved images. We argue that this is because the most dominant difference between super-resolved images and real HR images is high-frequency information, where super-resolved images obtained by minimizing pixel-wise errors lack high-frequency details. The simplest way for a discriminator to distinguish super-resolved images from real HR images could be simply inspecting the presence of high-frequency components in a given image, and the simplest way for a generator to fool the discriminator would be to put arbitrary high-frequency noise into result images.

In this paper, we propose a novel GAN-based SISR method that can produce perceptually pleasing images (Fig. 1). To overcome the limitation of previous GAN-based SISR approaches and produce more realistic results, our method adopts two discriminators: an image discriminator and a feature discriminator, differently from previous approaches. The image discriminator takes an image in the pixel domain as input as done in previous approaches. On the other hand, the feature discriminator feeds an image into a VGG network and extracts an intermediate feature map. The feature discriminator then tries to distinguish super-resolved images from real HR images based on the extracted feature map. As the feature map encodes structural information, the feature discriminator distinguishes super-resolved images and real HR images based not simply on high-frequency components but on structural components. Eventually, our generator is trained to synthesize realistic structural features rather than arbitrary high-frequency noise.

To achieve high-quality SR, we also propose a novel generator network with long-range skip connections. Skip connections are first introduced in [18] to enable efficient propagation of information between neural network layers, and have been shown effective in training very deep networks. We further extend the idea of skip connections and introduce long-range skip connections to our generator network so that information in distant layers can be more effectively propagated. Our novel network architecture enables our generator to achieve state-of-the-art PSNRs when it is trained alone without discriminators, as well as perceptually pleasing results when trained with our discriminators.

Our contributions can be summarized as follows.

  • We propose a new SISR framework that employs two different discriminators: an image discriminator working in the image domain, and a feature discriminator in the feature domain. Thanks to our feature discriminator, our generator network can produce perceptually realistic SR results. To our knowledge, this is the first attempt to apply GAN to the feature domain for SISR.

  • We propose a novel generator with long-range skip connections for SISR. Our generator achieves the state-of-the-art performance in PSNR when compared to existing methods with the same amount of parameters.

2 Related Work

SISR has been intensively studied in computer vision and image processing. Early methods are based on simple interpolation such as bicubic and Lanczos interpolation [11]. While interpolation-based methods perform efficiently, they cannot restore fine textures and structures, producing over smoothed images. To overcome this limitation, and to enhance edges, edge preserving interpolation [3, 29] and edge prior-based approaches [4, 8, 43] were proposed. However, because of the complexity of natural images, modeling global priors is not sufficient to deal with fine structures of various natural images.

To more effectively restore high-frequency details, a number of methods utilizing external information have been proposed. Freeman et al. [12] proposed to collect LR and HR patch pairs from a set of training images, and directly replace patches in an input LR image with collected HR patches. To further improve the quality, several other approaches along this line have been proposed such as neighborhood embedding [7, 36, 45, 46], sparse coding [16, 51, 52, 54], and local mapping function regression [15, 38, 50]. All these approaches collect pairs of LR and HR patches from a set of training images, and learn a mapping function between LR and HR patches in a low dimensional space. While these approaches show substantial quality improvement, their qualities are still limited due to their less capable mapping models for LR and HR images.

Recent advancement of deep learning has enabled to learn a more powerful mapping function from a LR image to a HR image. Dong et al. [9, 10] trained a shallow convolutional neural network (CNN) with three layers using pairs of LR and HR image patches, and showed comparable performance to contemporary state-of-the-arts methods. To further improve the accuracy and also speed and memory efficiency, a number of CNN models have been proposed since then [6, 24, 31, 41, 44, 47]. Specifically, Kim et al. [24] proposed very deep neural networks with one long skip-connection and showed that deeper networks can achieve better accuracy. Shi et al. [41] proposed a sub-pixel convolution layer that aggregates feature maps from the LR space to the HR space. Their sub-pixel convolution layer makes it possible to directly feed a LR image into a network, instead of a bicubic upsampled LR image, reducing memory usage and processing time. Thanks to the modeling power of CNNs, these methods have achieved high performance in terms of PSNR. However, they are still unable to restore high-frequency information because they rely on minimizing MSE losses, which results in blurry images as the minimization regresses to an average of solutions.

Recently, a few methods have been proposed to overcome the limitation of MSE losses and to produce perceptually more pleasing results. Johnson et al. [22] proposed a perceptual loss inspired by the content loss of [13]. A perceptual loss measures the difference between feature maps of two images extracted from image recognition networks such as VGG networks [42]. They showed that minimizing a perceptual loss results in low PSNRs but perceptually more pleasing results. However, their method is not able to restore high-frequency details completely lost in input images. GANs have also been recently employed for SISR [27, 40] to synthesize perceptually pleasing high frequency details in super-resolved images. Ledig et al. [27] introduced an adversarial loss in addition to a perceptual loss. Sajjadi et al. [40] extends Ledig et al.’s work by introducing a texture matching loss inspired by a style loss in [13] in order to encourage super-resolved images to have the same texture styles as the ground truth HR images. While these methods are not able to restore high-frequency details completely lost in input images, they instead synthesize high-frequency details so that the results look perceptually pleasing. However, they tend to produce arbitrary high-frequency artifacts as discussed in Sect. 1. In addition, these GAN-based SR methods adopt a perceptual loss that minimizes MSE of VGG features. Similarly to MSE on pixels, simply minimizing MSE of VGG features would not be enough to fully represent the actual characteristics of feature maps. To remedy this, we adopt a feature discriminator to better regress to a real distribution of features and to generate perceptually more pleasing high frequency details.

3 Super-Resolution with Feature Discrimination

Our goal is to generate a HR image \(I^g\) from a given LR image \(I^l\) that looks as similar to the original HR image \(I^h\) as possible, and at the same time, perceptually pleasing. The LR image \(I^l\) of size \(W'\times H'\times C\) can be obtained by applying various downsampling operations to the HR image \(I^h\) of size \(W\times H\times C\) where \(W=sW'\), \(H=sH'\), and s is the scaling factor. In this paper, we assume only bicubic downsampling without loss of generality, i.e., we assume that \(I^l\) is obtained by downsampling with bicubic interpolation.

To recover \(I^h\) from \(I^l\), we design a new deep CNN (DCNN)-based generator utilizing multiple long-range skip connections. The network generates a HR image \(I^g\) from \(I^l\) where \(I^g\) has the same dimensions as \(I^h\). The network is first trained to reduce the pixel-wise difference between \(I^g\) and \(I^h\). Pixel-wise loss well reproduces \(I^h\) in terms of PSNR, but generally results in a blurry and visually-unsatisfactory image \(I^g\).

To improve the visual quality of \(I^g\), we employ a perceptual loss and propose additional GAN-based loss functions. These losses enable the network to generate a visually more realistic image by approximating the distributions of natural HR images and their feature maps.

In the following subsections, we first describe the architecture of our generator. Then, we explain training loss functions in detail.

3.1 Architecture

We design a DCNN generator as illustrated in Fig. 2. The network consists of residual blocks [18] and multiple long-range skip connections. Specifically, the network takes \(I^l\) as input and first applies a \(9\times 9\) convolution layer to extract low-level features. The network then employs multiple residual blocks similarly to previous works [27, 40] to learn higher-level features with more nonlinearities and larger receptive fields. The residual block is successfully applied in various recent architectures [18, 32, 35] as it has been well proven that residual blocks enable efficient training process. Each block has a short-range skip connection as an identity mapping that preserves the signal from the previous layer and lets the network learn residuals only, while allowing back-propagation of gradients through the skip-connection path. Inspired by SRResNet [27], our residual block consists of multiple successive layers: \(3\times 3\) convolution, batchnorm, leaky ReLU [33], \(3\times 3\) convolution, and batchnorm layers. We use 16 residual blocks in our experiments to extract deep features. All the residual blocks are applied to the features of the LR spatial dimensions for efficient memory usage and fast inference. All the convolution layers in our generator network except the sub-pixel convolution layers have the same number of filters. In our experiments, we tried 64 and 128 filters for each convolution layer to analyze the performance of different network configurations.

We utilize additional long-range skip connections to aggregate features from different residual blocks. Specifically, we connect the output of each residual block to the end of the residual blocks with one \(1\times 1\) convolution layer. The purpose of long-range skip connection is to further encourage back-propagation of gradients, and to give potentials to re-use intermediate features to improve the final feature. As the outputs of different residual blocks correspond to different levels of abstraction of image features, we apply \(1\times 1\) convolution to each long-range skip connection to adjust the outputs and balance them. The effect of this \(1\times 1\) convolution will be discussed in Sect. 4.3.

To upsample the feature map obtained by the residual blocks to the target resolution, we use sub-pixel convolution layers (also known as pixel shuffler layers) proposed in [41]. Specifically, a sub-pixel convolution layer consists of two sub-modules: one convolution layer with \(s'^2N_c\) filters where \(N_c\) is the number of input channels, and a shuffling layer that rearranges data from channels into different spatial locations. A sub-pixel convolution layer enlarges an input feature map by the scale factor \(s'\) in each spatial dimension. In our experiments, we consider only \(4\times \) upsampling, so we use two sub-pixel convolution layers with \(s'=2\) in a row. Finally, the upsampled feature map goes into a \(3\times 3\) convolution layer with three filters to obtain a 3-channel color image.

Fig. 2.
figure 2

Architecture of our generator network with short and long-range skip connections. We use 16 residual blocks for our experiments.

3.2 Pre-training of the Generator Network

We train our generator network through two steps: pre-training, and adversarial training. In the pre-training step, we train the network by minimizing a MSE loss defined as:

$$\begin{aligned} L_{MSE}=\frac{1}{WHC}\sum _{i}^{W}\sum _{j}^{H}\sum _{k}^{C}(I_{i,j,k}^h-I_{i,j,k}^g)^2 . \end{aligned}$$
(1)

The resulting network obtained from the pre-training step is already able to achieve high PSNRs. However, it cannot produce perceptually pleasing results with desirable high-frequency information.

3.3 Adversarial Training with a Feature Discriminator

To improve perceptual quality, we employ the GAN framework [14]. The GAN framework solves a minimax problem defined as:

$$\begin{aligned} \min _{g}\max _{d}\left( \mathbb {E}_{{\varvec{y}}\sim p_{data}({\varvec{y}})}[\log \left( d\left( {\varvec{y}}\right) \right) ]+\mathbb {E}_{{\varvec{x}}\sim p_{{\varvec{x}}}({\varvec{x}})}[\log \left( 1-d\left( g\left( {\varvec{x}}\right) \right) \right) \right] ), \end{aligned}$$
(2)

where \(g({{\varvec{x}}})\) is the output of a generator network for \({{\varvec{x}}}\), and d is a discriminator network. \({\varvec{y}}\) is a sample from a real data distribution and \({{\varvec{x}}}\) is random noise.

Fig. 3.
figure 3

Architecture of our discriminator network. The number above a convolution layer represents the number of filters, while s2 below represents the stride of 2.

While the conventional GAN framework consists of a pair of a single generator and a single discriminator, we use two discriminators: an image discriminator \(d^i\) and a feature discriminator \(d^f\). Our image discriminator \(d^i\) discriminates real HR images and fake SR images by inspecting their pixel values. On the other hand, our feature discriminator \(d^f\) discriminates real HR images and fake SR images by inspecting their feature maps so that the generator can be trained to synthesize more meaningful high-frequency details.

To train our pre-trained generator network with discriminators, we minimize a loss function defined as:

$$\begin{aligned} L_g = L_p + \lambda \left( L_a^i + L_a^f\right) , \end{aligned}$$
(3)

where \(L_p\) is a perceptual similarity loss that enforces SR results to look similar to the ground truth HR images in the training set. \(L_a^i\) is an image GAN loss for the generator to synthesize high-frequency details in the pixel domain. \(L_a^f\) is a feature GAN loss for the generator to synthesize structural details in the feature domain. \(\lambda \) is a weight for the GAN loss terms. While \(L_g\) looks similar to the loss functions of previous methods, it has an additional feature GAN loss term \(L_a^f\) that makes a significant difference in terms of perceptual quality as shown in our experiments. To train discriminators \(d^i\) and \(d^f\), we minimize loss functions \(L_d^i\) and \(L_d^f\), each of which corresponds to \(L_a^i\) and \(L_a^f\), respectively. The generator and discriminators are trained by alternatingly minimizing \(L_g\), \(L_d^i\) and \(L_d^f\). In the following, we will describe each of the loss terms in more detail.

Perceptual Similarity Loss \({\varvec{L}}_{\varvec{p}}\). The perceptual similarity loss measures the difference between two images in the feature domain instead of the pixel domain so that minimizing it leads to perceptually consistent results [22]. The perceptual similarity loss \(L_p\) between \(I^h\) and \(I^g\) is defined in the following manner. First, \(I^h\) and \(I^g\) are fed into a pre-trained recognition network such as a VGG network. Then, the feature maps of the two images at the m-th layer are extracted. The MSE difference between the extracted feature maps is defined as the perceptual similarity loss. Mathematically, \(L_p\) is defined as:

$$\begin{aligned} L_p=\frac{1}{W_mH_mC_m}\sum _{i}^{W_m}\sum _{j}^{H_m}\sum _{k}^{C_m}\left( \phi _{i,j,k}^m(I^h)-\phi _{i,j,k}^m(I^g)\right) ^2, \end{aligned}$$
(4)

where \(W_m, H_m,\) and \(C_m\) denote the dimensions of the m-th feature map \(\phi ^m\). In our experiments, we use VGG-19 [42] for the recognition network. Here \(\phi ^m\) represents the output of the ReLU layer after the convolution before the m-th pooling.

Image GAN Losses \({\varvec{L}}_{\varvec{a}}^{\varvec{i}}\) and \({\varvec{L}}_{\varvec{d}}^{\varvec{i}}\) . The image GAN loss term \(L_a^i\) for the generator and the loss function \(L_d^i\) for the image discriminator are defined as:

$$\begin{aligned} L_a^i=-\log \left( d^i\left( I^g\right) \right) ,~~~~~~~\text {and} \end{aligned}$$
(5)
$$\begin{aligned} L_d^i=-\log \left( d^i\left( I^h\right) \right) -\log \left( 1-d^i\left( I^g\right) \right) , \end{aligned}$$
(6)

where \(d^i(I)\) is the output of the image discriminator \(d^i\), i.e., the probability that the image I is an image sampled from the distribution of natural HR images. Note that we minimize \(-\log (d^i(I^g))\) instead of \(\log (1-d^i(I^g))\) for stable optimization [14]. For the image discriminator \(d^i\), we use the same discriminator network used in [27] following the guidelines proposed by [37] (Fig. 3).

Feature GAN Losses \({\varvec{L}}_{\varvec{a}}^{\varvec{f}}\) and \({\varvec{L}}_{\varvec{d}}^{\varvec{f}}\). The feature GAN loss term \(L_a^f\) for the generator and the loss function \(L_d^f\) for the feature discriminator are defined as:

$$\begin{aligned} L_a^f=-\log \left( d^f\left( \phi ^m\left( I^g\right) \right) \right) ,~~~~~\text {and} \end{aligned}$$
(7)
$$\begin{aligned} L_d^f=-\log \left( d^f\left( \phi ^m\left( I^h\right) \right) \right) -\log \left( 1-d^f\left( \phi ^m\left( I^g\right) \right) \right) , \end{aligned}$$
(8)

where \(d^f(\phi ^m)\) is the output of the feature discriminator \(d^f\), i.e., the probability that the feature map \(\phi ^m\) is sampled from the distribution of the natural HR image feature maps. As features correspond to abstracted image structures, we can encourage the generator to produce realistic structural high-frequency rather than noisy artifacts. Both the perceptual similarity loss and the feature GAN losses are based on feature maps. However, in contrast to the perceptual similarity loss that promotes perceptual consistency between \(I^g\) and \(I^h\), the feature GAN losses \(L_a^f\) and \(L_d^f\) enable synthesis of perceptually valid image details. We use the network architecture in Fig. 3 for the feature discriminator \(d^f\) in our experiments. We also tried variations of the network architecture, but observed no significant performance difference between them, while all the variations showed similar tendency of improvement. We refer the reader to our supplementary material for the results with other variations.

4 Experiments

In this section, we first present details about our dataset and training process. We then analyze the performance of a pre-trained generator network, and a fully trained version with the feature discriminator.

4.1 Dataset

We used ImageNet [39] dataset for pre-training the generator as done in [27]. The dataset contains millions of images in 1000 categories. We randomly sampled about 120 thousands of images that have width and height larger than 400 pixels and then we took a center-cropped version of the sampled images for pre-training. For evaluation, we use three widely used datasets: Set5 [5], Set14 [53], and 100 test images of BSD300 [34].

To train our final GAN-based model, we used DIV2K dataset [2] which consists of 800 HR training images and 100 HR validation images. In our experiments, we observed that training our GAN-based model with DIV2K dataset is faster and more stable than with ImageNet. We conjecture that this is partly because DIV2K images are in lossless PNG format while ImageNet images are in lossy JPEG format. To expand the volume of training data, we applied data augmentation to DIV2K images. Specifically, we applied random flipping, rotation, and cropping to the images to make target HR images. We additionally sampled a small number of training images and included their downscaled versions by 1 / 2 and 1 / 4 for data augmentation in order to train the network to be able to deal with contents of different scales.

4.2 Training Details

Here we explain training details in our experiments. We obtained the target HR images by cropping the HR images to \(296\times 296\) sub images. We downsampled the images using bicubic interpolationFootnote 1 to obtain the \(74\times 74\) low-resolution input training images. We normalized the intensity ranges of \(I^h\) and \(I^l\) to \(\left[ -1, 1\right] \). We set the weight \(\lambda \) in Eq. (3) as \(10^{-3}\). Regarding \(\phi ^m\) in Eqs. (4), (7) and (8), we used Conv5 layer in VGG-19 in our experiments as we found that Conv5 generally produces better results than other layers. To balance different loss terms, we scaled the feature map \(\phi ^m\) with a scaling factor 1 / 12.75 before we computed loss terms.

For both pre-training and adversarial training, we used Adam optimizer [26] with the momentum parameter \(\beta _1=0.9\). For pre-training, we performed about 280 thousand iterations, which are roughly 20 epochs for our randomly sampled ImageNet dataset. We set the initial learning rate for pre-training as \(10^{-4}\) and decreased it by 1 / 10 when the training loss stopped decreasing. After the learning rate reached at \(10^{-6}\), we used the value without further decreasing. We performed adversarial training for about five epochs, which are roughly 100,000 iterations. We used \(10^{-4}\) as the learning rate for the first two epochs, \(10^{-5}\) for the next two epochs, and \(10^{-6}\) for the final one epoch of adversarial training. We fixed the parameters in batch-normalization layers during the test phase. All the models were trained on an NVIDIA Titan XP with 12 GB memory.

Table 1. Quantitative comparison of SISR methods for \(\times 4\) upscaling; A+ [46], SRCNN [10], VDSR [24], Enhance [40], SRDense [47], SRRes [27]. Our network (SRFeat\(_\text {M}\)) obtains the best accuracy in terms of PSNR and SSIM. With a similar number of parameters, our network with 64 feature channels (SRFeat\(_\text {M}\)-64) shows better accuracy than SRResNet.

4.3 Evaluation of the Pre-trained Generator

As our pre-trained network is trained using only a MSE loss, it is supposed to maximize PSNRs. To evaluate the performance of the pre-trained network, we measure PSNRs and SSIMs [48] on Y channel and compare them with those of other state-of-the-arts methods. For fair comparison, we excluded four pixels from the image boundaries as most existing SISR methods are not able to restore image boundaries properly. For our network, we tested two different configurations, one with 128 channels, and the other with 64 channels. We denote them as SRFeat\(_\text {M}\) and SRFeat\(_\text {M}\)-64, respectively. SRFeat\(_\text {M}\)-64 has a similar number of parameters to SRResNet [27]. Specifically, the difference between the model sizes of SRFeat\(_\text {M}\)-64 and SRResNet is less than 0.06MB. Table 1 shows that SRFeat\(_\text {M}\) achieves the state-of-the-art accuracy and outperforms all the other methods. SRFeat\(_\text {M}\)-64 also achieves higher PSNRs and SSIMs than SRResNet [27], where they have similar numbers of parameters.

In Table 2, we compare variations of our architecture to see the effect of each component. We first verify the necessity of \(1\times 1\) convolution in the long-range skip connection. Without \(1\times 1\) convolution (w/o Conv), features from different residual blocks equally contribute to the final feature regardless that they are high-level or low-level features. Table 2 shows that long-range skip connections without \(1\times 1\) convolution result in worse quality than SRFeat\(_\text {M}\)-64. The table also shows that the network with long-range skip connections with \(1\times 1\) convolution achieves higher quality than the network without long-range skip connections (w/o Skip), which verifies the effectiveness of long-range skip connections with \(1\times 1\) convolution.

Table 2. Comparison between variations of our generator network.

4.4 Evaluation of the Fully Trained Generator

We evaluate the performance of our GAN-based final generator. Existing quantitative assessment measures such as PSNR and SSIM are not appropriate to measure the perceptual quality of images. To provide a measure reasonably correlated with human perception, Sajjadi et al. [40] used object recognition performance. They first downsample original HR images and perform SISR on those images. Then, they apply a state-of-the-art object recognition model to the SR results as well as the original HR images. They assume that the gap between the object recognition accuracies from those results implies degradation of perceptual qualities. We also adopt the approach to validate the perceptual quality of our method.

We used the official Caffe model of ResNet-50 [18] for the recognition model, which obtained the state-of-the-art classification accuracy. For evaluation, we used the first 1000 images from the validation set of ILSVRC2016 CLS-LOC dataset as done in [40]. To compute the baseline accuracy, we resized the images to have 256 pixels along the shorter side and cropped the center of \(224\times 224\) pixels as done in [18]. Then, we made four different degraded versions of the dataset by downsampling the images to \(56\times 56\) and applying four different versions of our generator network: SRFeat\(_\text {M}\) trained with MSE, SRFeat\(_\text {I}\) trained with the perceptual loss and the image GAN loss but without the feature GAN loss, and SRFeat\(_\text {IF}\)-64 and SRFeat\(_\text {IF}\) trained with all loss terms. All the networks use 128 filters in their convolution layers except SRFeat\(_\text {IF}\)-64 that uses 64 filters. We also report the error rates of [40] taken from their paper although the baseline error rates reported in the paper using the same ResNet-50 network is slightly different from ours (e.g. Top-5 error rate: 7.1% in ours and 7.2% in [40]). We suspect that the gap comes from the differences in deep learning platforms such as Caffe [21] and Tensorflow [1].

Table 3. Performances of classification tests using images from the validation dataset of ILSVRC 2016. The baseline error rate was calculated from the inference results of ResNet-50 for the original \(224\times 224\) cropped images. SRFeat\(_\text {I}\) and SRFeat\(_\text {IF}\) denote our networks trained using GAN-based perceptual losses without and with the feature GAN loss, respectively.
Fig. 4.
figure 4

Samples of original input and SR images used in the classification test. Top row: original images (\(224\times 224\)). Bottom row: SR images (\(224\times 224\)) and the LR images (\(56\times 56\)) at the lower right corners.

The results are shown in Table 3. Obviously, our SRFeat\(_\text {M}\) without GAN shows much worse accuracy than the baseline obtained using the original images as it generates blurry images without high-frequency details. However, our SRFeat\(_\text {I}\) with the image GAN loss considerably improves the accuracy by restoring textures lost in downsampling. With our feature GAN loss (SRFeat\(_\text {IF}\)), the gap between the baseline and ours reduces up to 3.9% in the case of Top-5 error. Figure 4 shows some samples drawn from the validation dataset. From the samples, we can see that the accuracy is reasonable as the perceptual quality difference between the original images and our results is not significant. The gap between SRFeat\(_\text {I}\) and SRFeat\(_\text {IF}\) in Top-5 error (0.9) is larger than the gap between SRFeat\(_\text {IF}\)-64 and SRFeat\(_\text {IF}\) in Top-5 error (0.8), which implies the effectiveness of our feature GAN loss. There is also a large gap between EnhanceNet [40] and all our networks except SRFeat\(_\text {M}\), which clearly shows the effectiveness of our method.

We also qualitatively exhibit the improvement in perceptual quality obtained by employing the feature GAN loss. As shown in Fig. 5, our feature GAN loss suppresses noisy high frequencies, while generating perceptually plausible structured textures. Figure 6 shows a qualitative comparison of GAN-based SR methods. EnhanceNet results have high-frequency artifacts around edges, and SRGAN results have blurry structural textures. On the other hand, our results have naturally synthesized sharp details without blurriness or high-frequency artifacts thanks to our feature GAN loss. We refer the readers to the supplementary material for more results including a user study.

Fig. 5.
figure 5

Qualitative comparison between our models without the feature GAN loss (SRFeat\(_\text {I}\)) and with the feature GAN loss (SRFeat\(_\text {IF}\)). In all examples, SRFeat\(_\text {IF}\) generates more realistic textures than SRFeat\(_\text {I}\) while suppressing arbitrary high-frequency artifacts.

Fig. 6.
figure 6

Qualitative comparison of GAN-based SR methods with our results at scaling factor 4. Result images of the other methods are taken from their websites.

5 Discussion and Conclusion

We proposed a novel SISR method that can produce perceptually pleasing images by employing two discriminators: an image discriminator and a feature discriminator. Especially, our feature discriminator encourages the generator to make more structural high-frequency details rather than noisy artifacts. We also proposed a novel generator network architecture employing long-range skip connections for more effective propagation of information between distant layers. Experiments showed that our results achieve the state-of-the-art performance quantitatively and qualitatively.

For the feature GAN loss and perceptual similarity loss, our network uses features of only one fixed layer. However, we found that the optimal layer for the feature GAN loss and perceptual similarity loss depends on image contents. Therefore, we may further improve perceptual quality if we can adaptively choose a layer according to image contents. We leave this content-dependent SR as our future work. Applying the GAN framework to feature maps may also be beneficial to other problems besides SR. Exploring other applications can be another interesting future work.