Keywords

1 Introduction

In recent years, Single Image Super Resolution (SISR) has received considerable attention for its applications that includes surveillance imaging [1, 2], medical imaging [3, 4] and object recognition [5, 6]. Given a low-resolution image (LR), SISR aims to reconstruct a super-resolved image (SR) that is as similar as possible to the original high-resolution image (HR). This is an ill-posed problem since there are many possible ways to generate SR from LR.

Recent example-based methods using deep convolutional neural networks (CNNs) have achieved significant performance. However, most of the methods aim to maximize peak-signal-rate-ratio (PSNR) between SR and HR, which tends to produce blurry and overly-smoothed reconstructions. In order to obtain non-blurry and realistic reconstruction, this paper considers the following three issues. First, standard GAN [7] (SGAN) based SISR methods which are known to be effective in reconstructing natural images are notoriously difficult to train and unstable. One reason might be attributed to the fact that the generator is generally trained without taking real high-resolution images into account. Second, texture-rich high-resolution samples that are generally difficult to reconstruct from low-resolution images should be emphasized during training. Third, trading-off between PSNR and perceptual quality at test time with existing methods is impossible without retraining. Exiting methods are commonly trained to improve either PSNR or perceptual quality, and depending on the application, one objective might be better than the other.

Fig. 1.
figure 1

Super-resolution result comparison on image lenna from Set14 dataset. Our method exhibits more convincing textures and perceptual quality compared to those of the state-of-the-art PSNR-based method.

To address these issues, this paper proposes a GAN based SISR method referred to as the Perception-Enhanced Super-Resolution (PESR) that aims to enhance the perceptual quality of reconstruction and to allow users to flexibly control the perceptual degree at test time. In order to improve GAN performance, PESR is trained to minimize relativistic loss instead of an absolute loss. While SGAN aims to generate data that looks real, the PESR attempts to generate fake data to be more real than real data. This philosophy is extensively studied in [9] with Relativistic GAN (RGAN). In PESR, valuable texture-rich samples are emphasized in training. It is observed that the texture-rich patches, which play an important role in user-perceived quality, are more difficult to reconstruct and play an important role in user-perceived quality. In training PESR, easy examples with smooth texture are deemphasized by combining GAN loss with a focal loss function. Furthermore, at test time, we proposed a quality-control mechanism. The perceptual degree is controlled by interpolating between a perception-optimized model and a distortion-optimized model. Experiment results show that the proposed PESR achieves significant improvements compared to other state-of-the-art SISR methods.

The rest of this paper is organized as follows. Section 2 reviews various SISR methods. Section 3 presents the proposed networks and the loss functions to train the networks. Section 4 presents extensive experiments results on six benchmark datasets. Finally, Sect. 5 summarizes and concludes the paper.

2 Related Work

2.1 Single Image Super-Resolution

To address the super-resolution problem, early methods are mostly based on interpolation such as bilinear, bicubic, and Lancroz [10]. These methods are simple and fast but usually produce overly-smoothed reconstructions. To mitigate this problem, some edge-directed interpolation methods have been proposed [11, 12]. More advanced methods such as dictionary learning [13,14,15,16], neighborhood embedding [17,18,19] and regression trees [20, 21] aim to learn complex mapping between low- and high-resolution image features. Although these methods have shown better results compared to their predecessors, their performances compared to that of recent deep architectures leave much to be desired.

Deep architectures have made great strides in SISR. Dong et al. [22, 23] first introduced SRCNN for learning the LR-HR mapping in an end-to-end manner. Although SRCNN is only a three-convolutional-layer network, it outperformed previous methods. As expected, SISR also benefits from very deep networks. The 5-layer FSRCNN [24], 20-layer VDSR [25], and 52-layer DRRN [26] have shown significant improvements in terms of accuracy. Lim et al. [8] proposed a very deep modified ResNet [27] to achieve state-of-the-art PSNR performance.

Beside building very deep networks, utilizing advanced deep learning techniques lead to more robust, stable, and compact networks. Kim et al. [25] introduced residual learning for SISR showing promising results just by predicting residual high-frequency components in SISR. Tai et al. [26] and Kim et al. [28] investigated recursive networks in SISR, which share parameters among recursive blocks and show superior performance with fewer parameters compared to previous work. Densely connected networks [29] have also shown to be conducive for SISR [30, 31].

2.2 Loss Functions

The most common loss function to maximize PSNR is the mean-squared error (MSE). Other losses such as L1 or Charbonnier (a differentiable variant of L1) have also been studied to improve PSNR. It is well-known that pixel-wise loss functions produce blurry and overly-smoothed output as a result of averaging all possible solutions in the pixel space. As shown in Fig. 1, the natural textures are missing even in the state-of-the-art PSNR-based method. In [32], Zhao et al. studied Structural Similarity (SSIM) and its variants as a measure for evaluating the quality of the reconstruction in SISR. Although SSIM takes the image structure into account, this approach exposes the limitation in recovering realistic textures.

Instead of using pixel-wise errors, high-level feature distance has been considered for SISR [5, 33,34,35]. The distance is measured based on the feature maps which are extracted using a pre-trained VGG network [36]. Blau et al. [37] demonstrated that the distance between VGG features are well correlated to human opinion based quality assessment. Relying on the VGG features, a number of perceptual loss functions have been proposed. Instead of measuring the Euclidean distance between the VGG features, Sajjadi et al. [5] proposed a Gram loss function which exploits correlations between feature activations. Meanwhile, Mechrez et al. [35] introduced contextual loss, which aims to maintain natural statistics of images.

To enhance training computational efficiency, images are cropped into multiple small patches. However, training samples are usually dominated by a large number of easily reconstructable patches. When these easy samples overwhelm the generator, reconstructed results tend to be blurry and smooth. This is analogous to an observation in dense object detection [38], where the background samples overwhelm the detector. Focal loss which emphasizes difficult examples should be considered for SISR.

2.3 Adversarial Learning

Ever since it was first proposed by Goodfellow et al., GANs [7] have been incorporated for various tasks such as image generation, style transfer, domain adaptation, and super-resolution. The general idea of GANs is that it allows training a generative model G to produce real-like fake data with the goal of fooling a discriminator D while D is trained to distinguish between the generated data and real data. The generator G and the discriminator D compete in an adversarial manner with each other to achieve their individual objectives; thus, the generator mimics the real data distribution. In SISR, adversarial loss was introduced by Ledig et al. [34], generating images with convincing textures. Since then, GANs have emerged as the most common architecture for generating photo-realistic SISR [5, 35, 39,40,41]. Wang et al. [41] proposed a conditional GAN for SISR, where the semantic segmentation probability maps are exploited as the prior. Yuan et al. [40] investigated the use of cycle-in-cycle GANs for SISR, where HR labels are not available and LR images further degraded by noise, showing promising results. In a recent study, Blau et al. [37] have demonstrated that GANs provide a principle way to enhance perceptual quality for SISR.

2.4 Contribution

The four main contributions of this paper are as follows:

  1. 1.

    We demonstrate that stabilizing GAN training plays a key role in enhancing perceptual quality for SISR. When GAN performance is improved, the generated images are closer to natural manifolds.

  2. 2.

    We replace SGAN by RGAN loss function to fully utilize data at training time. A focal loss is used to emphasize valuable examples. The total variance loss is also added to mitigate high-frequency noise amplification of adversarial training.

  3. 3.

    We propose a quality control scheme at test time that allows users to adaptively emphasize between the perception and fidelity.

  4. 4.

    We evaluate the proposed method using recently-proposed quality metric [37] that encourages the SISR prediction to be close to natural manifold. We quantitatively and qualitatively show that the proposed method achieves better perceptual quality compared to other state-of-the-art SISR algorithms.

3 Proposed Method

3.1 Network Architecture

The proposed PESR method utilizes the SRGAN architecture [34] with its generator replaced by the EDSR [8]. As shown in Fig. 2, a low-resolution image is first embedded by a convolutional layer, before being fed into a series of 32 residual blocks. The spatial dimensions of the residual blocks are maintained until the very end of the generator such that the computational cost is kept low. The output of the 32 residual blocks is summed with the embedded input. Then it is upsampled to the high-resolution space, after which it is reconstructed.

Fig. 2.
figure 2

Architecture of Generator and Discriminator networks.

The discriminator is trained to discriminate between generated and real high-resolution image. An image is fed into four basic blocks, each of which contains two convolutional layers followed by batch normalization and leaky ReLU activations. After the four blocks, a binary classifier, which consists of two dense layers, predicts whether the input is generated or real.

The generator and discriminator are trained by alternating gradient update based on their individual objectives which are denoted as \(\mathcal {L}_G\) and \(\mathcal {L}_D\) respectively. To enhance the stability and improve texture rendering, the generator loss is a linear sum of three loss functions: focal RGAN loss \(\mathcal {L}_{FRG}\), content loss \(\mathcal {L}_C\), and total variance loss \(\mathcal {L}_{TV}\), shown as below:

$$\begin{aligned} \mathcal {L}_G = \alpha _{FRG} \mathcal {L}_{FRG} + \alpha _C \mathcal {L}_C + \alpha _{TV}\mathcal {L}_{TV}. \end{aligned}$$
(1)

Here \(\alpha _{FRG}\), \(\alpha _C\), and \(\alpha _{TV}\) are trade-off parameters. The three loss functions are described in more detail in the following subsections.

3.2 Loss Functions

Focal RGAN Loss. In the GAN setting, the input and output of the generator and the real samples are respectively the low-resolution image \(I^{LR}\), generated super-resolved image \(I^{SR}\) and the original high-resolution image \(I^{HR}\). As in SGAN, a generator \(G_\theta \) and a discriminator \(D_\varphi \) are trained to optimize a min-max problem:

$$\begin{aligned} \min _\theta \max _\varphi \mathbb {E}_{I^{HR} \sim \mathbb {P}^{HR}}\log D_\varphi (I^{HR} ) + \mathbb {E}_{I^{LR} \sim \mathbb {P}^{LR}} \log (1 - D_\varphi (G_\theta (I^{LR}))). \end{aligned}$$
(2)

Here \(\mathbb {P}^{HR}\) and \(\mathbb {P}^{LR}\) are the distributions of real data (original high-resolution image) and fake data (low-resolution image), respectively. This min-max problem can be interpreted as minimizing explicit loss functions for the generator and the discriminator \(\mathcal {L}_{SG}\) and \(\mathcal {L}_{SD}\) respectively as follows:

$$\begin{aligned} \mathcal {L}_{SG} = - \mathbb {E}_{I^{LR} \sim \mathbb {P}^{LR}} \log ( D_\varphi (G_\theta (I^{LR}))), \end{aligned}$$
(3)

and

$$\begin{aligned} \mathcal {L}_{SD} = -\mathbb {E}_{I^{HR} \sim \mathbb {P}^{HR}}\log D_\varphi (I^{HR} ) - \mathbb {E}_{I^{LR} \sim \mathbb {P}^{LR}} \log (1 - D_\varphi (G_\theta (I^{LR}))). \end{aligned}$$
(4)

It is well known that SGAN is notoriously difficult and unstable to train, which results in low reconstruction performance. Furthermore, Eq. 3 shows that the generator loss function does not explicitly depend on \(I^{HR}\). In other words, the SGAN generator completely ignores high-resolution image in its updates. Instead, the loss functions of both generator and discriminator should exploit the information provided by both the high-resolution and fidelity of the synthesized image. The proposed method considers relative discriminative score between the \(I^{HR}\) and \(I^{SR}\) such that training is easier. This can be achieved by increasing the probability of classifying the generated high-resolution image as being real and simultaneously decreasing the probability of classifying the original high-resolution image as being real. Inspired by RGAN [9], the following loss functions for the generator and discriminator can be considered,

$$\begin{aligned} \mathcal {L}_{RG} = -\,\mathbb {E}_{(I^{LR}, I^{HR}) \sim (\mathbb {P}^{LR}, \mathbb {P}^{HR})} \log \left[ \sigma (C_\varphi (G_\theta (I^{LR})) - C_\varphi (I^{HR}))\right] , \end{aligned}$$
(5)

and

$$\begin{aligned} \mathcal {L}_{RD} = -\,\mathbb {E}_{(I^{LR}, I^{HR}) \sim (\mathbb {P}^{LR}, \mathbb {P}^{HR})} \log \left[ \sigma ( C_\varphi (I^{HR}) - C_\varphi (G_\theta (I^{LR})))\right] . \end{aligned}$$
(6)

Here \(C_\varphi \) which is referred to as the critic function [42] is taken before the last sigmoid function \(\sigma \) of the discriminator.

The generator loss can be further enhanced to emphasize texture-rich patches which tend to be difficult samples to reconstruct with high loss \(\mathcal {L}_{RG}\). Emphasizing difficult samples and down-weighting easy samples will lead to better texture reconstruction. This can be achieved by minimizing the focal function with a focusing parameter of \(\gamma \):

$$\begin{aligned} \mathcal {L}_{FRG} = -\sum _i (1-p_i)^\gamma \log (p_i), \end{aligned}$$
(7)

where \(p_i = \sigma (C_\varphi (G_\theta (I_i^{LR})) - C_\varphi (I_i^{HR}))\).

Content Loss. Beside enhancing realistic textures, the reconstructed image should be similar to the original high-resolution image which is ground truth. Instead of considering pixel-wise accuracy, perceptual loss that measures distance in a high-level feature space [33] is considered. The feature map, denoted as \(\phi \), is obtained by using a pre-trained 19-layer VGG network. Following [34], the feature map is extracted right before the fifth max-pooling layer. The content loss function is defined as,

$$\begin{aligned} \mathcal {L}_C = \sum _i \Vert \phi (I_i^{HR}) - \phi (I_i^{SR}) \Vert ^2_2. \end{aligned}$$
(8)

Total Variance Loss. High-frequency noise amplification is inevitable with GAN based synthesis, and in order to mitigate this problem, the total variance loss function [43] is considered. It is defined as

$$\begin{aligned} \mathcal {L}_{TV} = \sum _{i,j,k}\left( \left| I_{i, j+1, k}^{SR} - I_{i, j, k}^{SR} \right| + \left| I_{i, j, k+1}^{SR} - I_{i, j, k}^{SR} \right| \right) . \end{aligned}$$
(9)

4 Experiments

4.1 Dataset

The proposed networks are trained on DIV2K dataset [44], which consists of 800 high-quality (2K resolution) images. For testing, 6 standard benchmark datasets are used, including Set5 [17], Set14 [16], B100 [45], Urban100 [46], DIV2K validation set [44], and PIRM self-validation set [47].

4.2 Evaluation Metrics

To demonstrate the effectiveness of PESR, we measure GAN training performance and SISR image quality. The Fréchet Inception Distance (FID) [48] is used to measure GAN performance, where lower FID values indicate better image quality. In FID, feature maps \(\varvec{\psi }(I)\) are obtained by extracting the pool_3 layer of a pre-trained Inception V3 model [49]. Then, the extracted features are modeled under a multivariate Gaussian distribution with mean \(\varvec{\mu }\) and covariance \(\varvec{\varSigma }\). The FID \(d(\varvec{\psi }(I^{SR}), \varvec{\psi }(I^{HR}))\) between generated features \(\varvec{\psi }(I^{SR})\) and real features \(\varvec{\psi }(I^{HR})\) is given by [50]:

$$\begin{aligned} d^2( \varvec{\psi }(I^{SR}), \varvec{\psi }(I^{HR})) = \left\| \varvec{\mu }^{SR} - \varvec{\mu }^{HR} \right\| ^2_2 + \text {Tr}\left( \varvec{\varSigma }^{SR} + \varvec{\varSigma }^{HR} - 2\left( \varvec{\varSigma }^{SR}\varvec{\varSigma }^{HR}\right) ^{1/2}\right) . \end{aligned}$$
(10)

To evaluate SISR performance, we use a recently-proposed perceptual metric in [37]:

$$\begin{aligned} \text {Perceptual index} = \frac{(10 - \text {NRQM})+ \text {NIQE}}{2}, \end{aligned}$$
(11)

where NRQM and NIQE are the quality metrics proposed by Ma et al. [51] and Mittal et al. [52], respectively. The lower perceptual indexes indicate better perceptual quality. It is noted that the perceptual index in Eq. 11 is a non-reference metric, which does not reflect the distortion of SISR results. Therefore, the conventional PSNR metric is also used as a distortion reference.

4.3 Experiment Settings

Throughout the experiments, LR images are obtained by bicubically down-sampling HR images with a scaling factor of \(\times \)4 using MATLAB imresize function. We pre-process all the images by subtracting the mean RGB value of the DIV2K dataset. At training time, to enhance computational efficiency, the LR and HR images are cropped into patches of size \(48\times 48\) and \(196\times 194\), respectively. It is noted that our generator network is fully convolutional; thus, it can take arbitrary size input at test time.

We train our networks with Adam optimizer [53] with setting \(\beta _1 = 0.9\), \(\beta _2 = 0.999\), and \(\epsilon = 10^{-8}\). Batchsize is set to 16. We initialize the generator using L1 loss for \(2\times 10^5\) iterations, then alternately optimize the generator and discriminator with our full loss for other \(2\times 10^5\) iterations. The trade-off parameter for the loss function is set to \(\alpha _{FRG}=1, \alpha _{C}=50\) and \(\alpha _{TV}=10^{-6}\). We use a focusing parameter of 1 for the focal loss. The learning rate is initialized to \(10^{-4}\) for pretraining and \(5\times 10^{-5}\) for GAN training, which is halved after \(1.2\times 10^5\) batch updates.

Our model is implemented using Pytorch [54] deep learning framework, which is run on Titan Xp GPUs and it takes 20 h for the networks to converge.

4.4 GAN Performance Measurement

To avoid underestimated FID values of the generator, the number of samples should be at least \(10^4\) [48], hence the images are cropped into patches of \(32\times 32\). The proposed method is compared with standard GAN (SGAN) [7], least-squares GAN (LSGAN) [55], Hinge-loss GAN (HingeGAN) [56], and Wassertein GAN improved (WGAN-GP) [57]. All the considered GANs are combined with the content and total variance losses. Table 1 shows that LSGAN performs the worst at FID of 18.5. HingeGAN, WGAN-GP, and SGAN show better results compared to LSGAN. Our method relied on RGAN shows the best performance.

Table 1. FID comparison of RGAN with other GANs on DIV2K validation set.

4.5 Ablation Study

The effectiveness of the proposed method is demonstrated using an ablation analysis. As reported in Table 2, the perceptual index of L1 loss training is limited to 5.41, and after training with the VGG content loss, the performance is improved dramatically to 3.32. When adversarial training (RGAN) is added, the performance is further improved to 2.28. The total variance loss and focal loss show slightly perceptual index improvement. The proposed method with the default setting (e) obtains the best performance of 2.25.

The effect of each component in the proposed loss function is also visually compared in Fig. 3. As expected, L1 loss shows blurry and overly-smooth images. Although VGG loss improves perceptual quality, the reconstruction results are still unnatural since they expose square patterns. When RGAN is added, the reconstruction results are more visually pleasing with more natural texture and edges, and no square patterns are observed.

Fig. 3.
figure 3

Effect of each component in our loss function on B100 dataset (images 163085, 38082, 19021, 351093 from top to bottom rows). Each column from (a) to (e) represents the setting described in Table 2.

Table 2. Ablation analysis in terms of perceptual index on B100 dataset.

4.6 Comparison with State-of-the-Art SISR Methods

In this subsection, we quantitatively and qualitatively compare our PESR with other state-of-the-art SISR algorithms. Here, PESR is benchmarked against SRCNN [23], VDSR [25], DRCN [28], EDSR [8], SRGAN [34], ENET [5], and CX [35]. The performance of bicubic interpolation is also reported as the baseline. The results of SRGAN is obtained from a Tensorflow implementationFootnote 1. For CX, the source codes for super-resolution task was unavailable; however, the authors of CX provided the generated images at our request. For the others methods, the results were obtained using publicly available source codes.

Table 3. Perceptual index comparison of the proposed PESR with recent state-of-the-art SISR methods. Bold and italic indicate best and second best results, respectively.
Fig. 4.
figure 4

Qualitative comparison between our PESR and the others. and indicate best and second best perceptual index. (Color figure online)

Quantitative Results. Table 3 illustrates the perceptual indexes of PESR and the other seven state-of-the-art SISR methods. As expected, GAN-based methods, including SRGAN [34], ENET [5], CX [35], and the proposed PESR, outperform the PSNR-based methods in term of perceptual index with a large margin. Here, SRGAN and ENET methods have the best results in Set5 and Urban100 dataset, respectively; however, their performances are relatively limited in the other datasets. It is noted that ENET are trained on 200k images, which is much more than those of other methods (at most 800 images). Our PESR achieves the best performance in 4 out of 6 benchmark datasets.

Qualitative Results. The visual comparison of our PESR with other state-of-the-art SISR methods are illustrated in Fig. 4. Overall, PSNR-based methods produce blurry and smooth images while GAN-based methods synthesize a more realistic texture. However, SGRAN, ENET, and CX exhibit limitation when the textures are densely and structurally repeated as in image 0804 from DIV2K dataset. Meanwhile, our PESR provides sharper and more natural textures compared to the others.

4.7 Perception-Distortion Control at Test Time

In a number of applications such as medical imaging, synthesized textures are not desirable. To make our model robust and flexible, we proposed a quality control scheme that interpolates between a perception-optimized model \(G_{\theta _P}\) and a distortion-optimized model \(G_{\theta _D}\). The \(G_{\theta _P}\) and \(G_{\theta _D}\) models are obtained by training our network with the full loss function and L1 loss function, respectively. The perceptual quality degree is controlled by adjusting the parameter \(\lambda \) in the following equation:

$$\begin{aligned} I^{SR} = \lambda G_{\theta _P}(I^{LR}) + (1-\lambda )G_{\theta _D}(I^{LR}). \end{aligned}$$
(12)

Here, the networks attempt to predict the most accurate results when \(\lambda = 0\) and synthesize the most perceptually-plausible textures when \(\lambda = 1\).

We demonstrate that flexible SISR method is effective in a number of cases. In Fig. 5, two types of textures are presented: a wire entanglement with sparse textures, and shutter with dense textures. The results show that high perceptual quality weights provide more plausible visualization for the dense textures while reducing the weight seems to be pleasing for the easy ones. We also compare our interpolated results and the others, as shown in Fig. 6. It is clear that we can obtain better perceptual quality with the same PSNR, and vice versa, compared to the other methods.

Fig. 5.
figure 5

Perception-distortion trade-off with different perceptual quality weights.

Fig. 6.
figure 6

Our interpolated results in comparison with the others on Set14 dataset. Left- and right-most triangle markers indicate \(\lambda \) being 1 and 0, respectively.

4.8 PIRM 2018 Challenge

The Perceptual Image Restoration and Manipulation (PIRM) 2018 challenge aims to produce images that are visually appealing to human observers. The authors participated in the Super-resolution challenge to improve perceptual quality while constraining the root-mean-squared error (RMSE) to be less than 11.5 (region 1), between 11.5 to 12.5 (region 2) and between 12.5 and 16 (region 3).

Our main target is region 3, which aims to maximize the perceptual quality. We ranked 4th with perceptual index 0.04 lower than the top-ranking teams. For region 1 and 2, we use interpolated results without any fine-tuning and ranked 5th and 6th, respectively. We believe further improvements can be achieved with fine-tuning and more training data.

5 Conclusion

We have presented a deep Generative Adversarial Network (GAN) based method referred to as the Perception-Enhanced Super-Resolution (PESR) for Single Image Super Resolution (SISR) that enhances the perceptual quality of the reconstructed images by considering the following three issues: (1) ease GAN training by replacing an absolute by relativistic discriminator (2) include in a loss function a mechanism to emphasize difficult training samples which are generally rich in texture, and (3) provide a flexible quality control scheme at test time to trade-off between perception and fidelity. Each component of proposed method is demonstrated to be effective through the ablation analysis. Based on extensive experiments on six benchmark datasets, PESR outperforms recent state-of-the-art SISR methods in terms of perceptual quality.