Keywords

1 Introduction

Single image super-resolution (SR) is an algorithm to reconstruct a high-resolution (HR) image from a single low-resolution (LR) image [20]. It allows a system to overcome limitations of LR imaging sensors or from image processing steps in multimedia systems. Several SR algorithms [17, 22, 24, 28, 29] have been proposed and applied in the fields of computer vision, image processing, surveillance systems, etc. However, SR is still challenging due to its ill-posedness, which means that multiple HR images are solutions for a single LR image. Furthermore, the reconstructed HR image should be close to the real one and, at the same time, visually pleasant.

In recent years, various deep learning-based SR algorithms have been proposed in literature. Convolutional neural network architectures are adopted in many deep learning-based SR methods following the super-resolution convolutional neural network (SRCNN) [5], which showed better performance than the classical SR methods. They typically consist of two parts, feature extraction part and upscaling part. With improving these parts in various ways, recent deep learning-based SR algorithms have achieved significant enhancement in terms of distortion-based quality such as root mean squared error (RMSE) or peak signal-to-noise ratio (PSNR) [7,8,9, 12, 14, 15].

However, it has been recently shown that there exists the trade-off relationship between distortion and perception for image restoration problems including SR [4]. In other words, as the mean distortion decreases, the probability for correctly discriminating the output image from the real one increases. Generative adversarial networks (GANs) are a way to approach the perception-distortion bound. This is achieved by controlling relative contributions of the two types of losses popularly employed in the GAN-based SR methods, which are a content loss and an adversarial loss [14]. For the content loss, a reconstruction loss such as the L1 or L2 loss is used. However, optimizing to the content loss usually leads to unnatural blurry reconstruction, which can improve the distortion-based performance, but decreases the perceptual quality. On the other hand, focusing on the adversarial loss leads to perceptually better reconstruction, which tends to decrease the distortion-based quality.

One of the keys to improve both the distortion and perception is to consider perceptual part in the content loss. In this matter, consideration of proper high frequency components would be helpful, because many perceptual quality metrics consider the frequency domain to measure the perceptual quality [16, 19]. Not only traditional SR algorithms such as [10, 26] but also deep learning-based methods [6, 13] focus on restoration of high frequency components. However, there exists little attempt to consider the frequency domain to compare the real and fake (i.e., super-resolved) images in GAN-based SR.

In this study, we propose a novel GAN model for SR considering the trade-off relationship between perception and distortion. Based on good distortion-based performance of our base model, i.e., the deep residual network using enhanced upscale modules (EUSR) [9], the proposed GAN model is trained to improve both the perception and distortion. Together with the conventional content loss for deep networks, we consider additional loss functions, namely, the discrete cosine transform (DCT) loss and differential content loss. These loss functions directly consider the high frequency parts of the super-resolved images, which are related to the perception of image quality by the human visual system. The proposed model was ranked in the 2nd place among 13 participants in Region 1 of the PIRM Challenge [3] on perceptual super-resolution at ECCV 2018.

The rest of the paper is organized as follows. We first describe the base model of the proposed method and the proposed loss functions in Sect. 2. Then, in Sect. 3, we explain the experiments conducted for this study. The results and analysis are given in Sect. 4. Finally, we conclude the study in Sect. 5.

2 Proposed Method

2.1 Super-Resolution with Enhanced Upscaling Modules

As the generator in the proposed model, we employ the recently developed EUSR model [9]. Its overall structure is shown in Fig. 1. It is a multi-scale approach performing reconstruction in three different scales (\(\times 2\), \(\times 4\), and \(\times 8\)) simultaneously. Low-level features for each scale are extracted from the input LR image by two residual blocks (RBs). And, higher-level features are extracted by the residual module (RM), which consists of several local RBs, one convolution layer, and global skip connection. Then, for each scale, the extracted features are upscaled by enhanced upscaling modules (EUMs). This model showed good performance for some benchmark datasets in the NTIRE 2018 Challenge [23] in terms of PSNR and structural similarity (SSIM) [25]. We set the number of RBs in each RM to 80, which is larger than that used in [9] (i.e., 48) in order to enhance the learning capability of the network.

Fig. 1.
figure 1

Overall structure of the EUSR model [9]

The discriminator network in the proposed method is based on that of the super-resolution using a generative adversarial network (SRGAN) model [14]. The network consists of 10 convolutional layers followed by leaky ReLU activations and batch normalization units. The resulting feature maps are processed by two dense layers and a final sigmoid activation function in order to determine the probability whether the input image is real (HR) or fake (super-resolved).

2.2 Loss Functions

In addition to the conventional loss functions for GAN models for SR, i.e., content loss (\({ l }_{ c }\)) and adversarial loss (\({ l }_{ D }\)), we consider two more content-related losses to train the proposed model. They are the DCT loss (\({ l }_{ dct }\)) and differential content loss (\({ l }_{ d }\)), which are named as perceptual content losses (PCL) in this study. Therefore, we use four loss functions in total in order to improve both the perceptual quality and distortion-based quality. The details of the loss functions are described below.

  • Content loss (\({ l }_{ c }\)): The content loss is a pixel-based reconstruction loss function. The L1-norm and L2-norm are generally used for SR. We employ the L1-norm between the HR image and SR image:

    $$\begin{aligned} { l }_{ c }=\frac{ 1 }{ WH } \sum _{ w }^{ }{ \sum _{ h }^{ }{ \left| { I }_{ w,h }^{ HR }-{ I }_{ w,h }^{ SR } \right| } }, \end{aligned}$$
    (1)

    where W and H are the width and height of the image, respectively. And, \({ I }_{ w,h }^{ HR }\) and \({ I }_{ w,h }^{ SR }\) are the pixel values of the HR and SR images, respectively, where w and h are the horizontal and vertical pixel indexes, respectively.

  • Differential content loss (\({ l }_{ d }\)): The differential content loss evaluates the difference between the SR and HR images in a deeper level. It can help to reduce the over-smoothness and improve the performance of reconstruction particularly for high frequency components. We also employ the L1-norm for the differential content loss:

    $$\begin{aligned} { l }_{ d }=\frac{ 1 }{ WH } \left( \sum _{ w }^{ }{ \left| { { d }_{ x }I }_{ w }^{ HR }-{ d }_{ x }{ I }_{ w }^{ SR } \right| } +\sum _{ h }^{ }{ \left| { { d }_{ y }I }_{ h }^{ HR }-{ d }_{ y }{ I }_{ h }^{ SR } \right| } \right) , \end{aligned}$$
    (2)

    where \({d}_{x}\) and \({d}_{y}\) are horizontal and vertical differential operators, respectively.

  • DCT loss (\({ l }_{ dct }\)): The DCT loss evaluates the difference between DCT coefficients of the HR and SR images. This enables to explicitly compare the two images in the frequency domain for performance improvement. In other words, while different SR images can have the same value of \({l}_{c}\), the DCT loss forces the model to generate the one having a frequency distribution as similar to the HR image as possible. The L2-norm is employed for the DCT loss function:

    $$\begin{aligned} { l }_{ dct }=\frac{ 1 }{ WH } \sum _{ w }^{ }{ \sum _{ h }^{ }{ { \left\| { DCT }({ I }^{ HR })_{ w,h }-{ DCT }({ I }^{ SR })_{ w,h } \right\| }^{ 2 } } } , \end{aligned}$$
    (3)

    where DCT(I) means the DCT coefficients of image I.

  • Adversarial loss (\({ l }_{ D }\)): The adversarial loss is used to enhance the perceptual quality. It is calculated as

    $$\begin{aligned} { l }_{ D }=-\log { (D({ I }^{ SR } } |{ I }^{ HR })) \end{aligned}$$
    (4)

    where D is the probability of the discriminator calculated by a sigmoid cross-entropy of logits from the discriminator [14], which represents the probability that the input image is a real image.

3 Experiments

3.1 Datasets

We use the DIV2K dataset [1] for training of the proposed model in this experiment, which consists of 1000 2K resolution RGB images. LR training images are obtained by downscaling the original images using bicubic interpolation. For testing, we evaluate the performance of the SR models on several datasets, i.e., Set5 [2], Set14 [27], BSD100 [18], and PIRM self-validation set [3]. Set5 and Set14 consist of 5 and 14 images, respectively. And, BSD100 and PIRM self-validation set include 100 challenging images. All testing experiments are performed with a scale factor of \(\times 4\), which is the target scale of the PIRM Challenge on perceptual super-resolution.

3.2 Implementation Details

For the EUSR-based generator in the proposed model, we employ 80 and two local RBs in each RM and the upscaling part, respectively. We first pre-train the EUSR model as a baseline on the training set of the DIV2K dataset [1]. In the pre-training phase, we use only the content loss (\({l}_{c}\)) as the loss function.

For each training step, we feed two randomly cropped image patches having a size of 48 \(\times \) 48 from LR images into the networks. The patches are transformed by random rotation by three angles (90\(^{\circ }\), 180\(^{\circ }\), and 270\(^{\circ }\)) or horizontal flips. The Adam optimization method [11] with \(\beta 1 = 0.9\), \(\beta 2 = 0.999\), and \(\epsilon = {10}^{-8}\) is used for both pre-training and training phases. The initial learning rate is set to \({10}^{-5}\) and the learning rate is reduced by a half for every \({2\times 10}^{5}\) steps. A total of 500,000 training steps are executed. The networks are implemented using the Tensorflow framework. It roughly takes two days with NVIDIA GeForce GTX 1080 GPU to train the networks.

3.3 Performance Measures

As proposed in [4], we measure the performance of the SR methods using distortion-based quality and perception-based quality. First, we measure the distortion-based quality of the SR images using RMSE, PSNR, and SSIM [25], which are calculated by comparing the SR and HR images. In addition, we measure the perceptual quality of the SR image by [4]

$$\begin{aligned} Perceptual\ index ({ I }_{ SR }) = \frac{ 1 }{ 2 } \left( \left( 10-Ma({ I }_{ SR }) \right) +NIQE({ I }_{ SR } \right) ). \end{aligned}$$
(5)

where \({I}_{SR}\) is a SR image, \(Ma(\cdot )\) means the quality score measure proposed in [16], and \(NIQE(\cdot )\) means the quality score by the natural image quality evaluator (NIQE) metric [19]. This perceptual index is also adopted to measure the performance of the SR methods in the PIRM Challenge on perceptual super-resolution [3]. The lower the perceptual index is, the better the perceptual quality is. We compute all metrics after discarding the 4-pixel border and on the Y-channel of YCbCr channels converted from RGB channels as in [14].

4 Results

We evaluate the performance of the proposed method and the state-of-the-art SR algorithms, i.e., the generative adversarial network for image super-resolution (SRGAN) [14], the SRResNet (SRGAN model without the adversarial loss) [14], the dense deep back-projection networks (D-DBPN) [7], and the multi-scale deep Laplacian pyramid super-resolution network (MS-LapSRN) [12]. And, the bicubic upscaling method and pre-trained EUSR model are also included. Our proposed model, named as deep residual network using enhanced upscale modules with perceptual content losses (EUSR-PCL), and SRGAN are adversarial networks, and the others are non-adversarial models. Note that, the SRResNet and SRGAN have variants that are optimized in terms of MSE or in the feature space of a VGG net [21]. We consider SRResNet-VGG\(_{2,2}\) and SRGAN-VGG\(_{5,4}\) in this study, which show better perceptual quality among their variants. For the Set5, Set14, and BSD100 datasets, the SR images of the SR methods are either obtained from their supplementary materials (SRGANFootnote 1, SRResNet1, and MS-LapSRNFootnote 2) or reproduced from their pre-trained model (D-DBPNFootnote 3). For the PIRM set, the SR images of D-DBPN and EUSR are generated using their own pre-trained models.

Table 1. Performance of the SR methods in terms of the distortion (i.e., RMSE, PSNR, and SSIM) and perception (i.e., perceptual index) for Set5 [2], Set14 [27], and BSD100 [18]. The methods are sorted in an ascending order in terms of the perceptual index
Fig. 2.
figure 2

Examples of the HR image and SR images of the seven methods for butterfly from the Set5 dataset [2]

Fig. 3.
figure 3

Examples of the HR image and SR images of the seven methods for 86000 from the BSD100 dataset [18]

Table 1 shows the performance of the considered SR methods for the Set5, Set14, and BSD100 datasets. Our proposed model is ranked second among the SR methods in terms of the perceptual quality. The perceptual index of the proposed method is between those of SRGAN and SRResNet, which are an adversarial network and the best model among non-adversarial models, respectively. Considering the PSNR and SSIM results, EUSR-PCL shows better performance than both SRGAN and SRResNet. When we compare our model with other non-adversarial networks, i.e., EUSR, MS-LapSRN, and D-DBPN, our model shows slightly lower PSNR results, while the perceptual quality is significantly improved. These results show that our model achieves proper balance between the distortion and perception aspects.

Figures 2 and 3 show example images produced by the SR methods for qualitative evaluation. In Fig. 2, except the bicubic interpolation method, most of the methods restore high frequency details in the HR image to some extents. If the details of the SR images are examined, however, the models show different qualitative results. The SR images of the bottom row (i.e., SRResNet, SRGAN, EUSR, and EUSR-PCL) show relatively better perceptual quality with less blurring. However, the reconstructed details are different depending on the methods. The images by SRGAN contain noise, although the method shows the best perceptual quality for the Set5 dataset in Table 1. Our model shows lower performance than SRGAN in terms of perception, but the noise is less visible. In Fig. 3, it is also found that the details of the SR image of EUSR-PCL are perceptually better than those of SRGAN, although SRGAN shows better perceptual quality than EUSR-PCL for the BSD100 dataset in Table 1. These results imply that a proper balance between perception and distortion is important and our proposed model performs well for that.

Table 2. Performance of the SR methods in terms of the distortion (i.e., RMSE, PSNR, and SSIM) and perception (i.e., perceptual index) for PIRM [3]. The methods are sorted in an ascending order in terms of the perceptual index

The results for the PIRM dataset [3] are summarized in Table 2. In this case, we also consider variants of the EUSR-PCL model in order to examine the contributions of the losses. In the table, EUSR-PCL indicates the proposed model that considers all loss functions described in Sect. 2. The EUSR-PCL (\({l}_{c}\)) is the basic GAN model based on EUSR. EUSR-PCL (\({l}_{c}+{l}_{dct}\)) is the EUSR-PCL model considering the content loss and DCT loss, and EUSR-PCL (\({l}_{c}+{l}_{d}\)) is the model with the content loss and differential content loss. In all cases, the adversarial loss is included. It is observed that the performance of EUSR-PCL is the best in terms of perception among all methods in the table. Although the PSNR values of the EUSR-PCL variants are slightly lower than EUSR and D-DBPN, their perceptual quality scores are better. Comparing EUSR-PCL and its variants, we can find the effectiveness of the perceptual content losses. When the two perceptual content losses are included, we can obtain the best performance in terms of both the perception and distortion.

Figure 4 shows example SR images for the PIRM dataset. The images obtained by the EUSR-PCL models at the bottom row have better perceptual quality and are less blurry than those of the other methods. As mentioned above, these models show lower PSNR values, but the reconstructed images are better in terms of perception. When we compare the results of the variants of EUSR-PCL, there exist slight differences in the result images, in particular in the details. For instance, EUSR-PCL (\({l}_{c}+{l}_{dct}\)) generates a more noisy SR image than EUSR-PCL. Although the differences between their quality scores are not large in Table 2, the result images show noticeable perceptual differences. This demonstrates that the improvement of the perceptual quality of SR is important, and the proposed method achieves good performance for perceptual SR.

Fig. 4.
figure 4

Examples of the HR image and SR images of the seven methods for 6 from the PIRM self-validation set [3]

5 Conclusion

In this study, we focused on developing the perceptual content losses and proposed the GAN model in order to properly consider the trade-off problem between perception and distortion. We proposed two perceptual content loss functions, i.e., the DCT loss and the differential content loss, used to train the EUSR-based GAN model. The results showed that the proposed method is effective in SR applications with consideration of both the perception and distortion aspects.