Keywords

1 Introduction

Continuous improvement for the quality of tiny camera sensors and lens makes smartphone photography come into vogue. However, from the viewpoint of aesthetics, photos captured by mobile phones still cannot attain the DSLR-quality because of their compact sensors and lens. Larger sensors are conducive to improving image quality, reducing noise and shooting night scenes. In order to automatically translate the low-quality mobile phone pictures into the high-quality images, Ignatov et al. [11] propose an end-to-end deep learning approach uses a composite perceptual error function that combines content, color, and texture losses, where the content loss is simply defined as the VGG loss based on the ReLU activation layers of the pre-trained 19-layer VGG network described in [25]. The authors also present a weakly-supervised approach in [12] to overcome the requirement of matched input/target training image pairs. Though the above methods have achieved remarkable results, they still have the deficiencies to be addressed. One of the limitations of the existing CNN-based methods is that researchers always trying to deepen the generator network to reach better performance, which leads to a substantial computational cost and memory consumption which will further bring increasing power consumption. Therefore, these methods are not conducive to real mobile phone applications. The other cause is the artifacts and amplified noises appeared on the processed images in [11], which affects the user experience.

To tackle these issues, we propose a novel CNN-based image enhancement approach, which introduces the teacher-student information transfer to boost the performance of the compact student network and contextual loss that proposed in [22, 23] to preserve the nature of images. Moreover, we combine adversarial (GAN) [9], color, total variation losses to learn photo-realistic image quality. Finally, to guarantee the structural preservation of the enhanced images, we employ the SSIM loss as the constraint term. Fig. 1 depicts an example of image enhancement.

Fig. 1.
figure 1

DPED image enhanced by our method.

The main contributions of the perception-preserving CNN are summarized as follows:

  • We propose a novel compact network for single image enhancement as illustrated in Fig. 3, which adopts 1-D separable kernels and dilated convolutions to expand the network receptive field.

  • We exploit knowledge transfer to promote the performance of the student network.

  • We employ contextual and SSIM losses to maintain the nature of the image.

  • The effective network architecture for single image super-resolution is devised as shown in Fig. 2, which can fast super-resolve the low resolution images.

  • Our proposed method achieves superior performance compared with the state-of-the-art methods.

2 Related Work

The problem of image quality enhancement is part of the image-to-image translation task. In this section, we introduce several related works from the image transformation field.

2.1 Image Enhancement

We build our solution upon recent advances in image-to-image translation networks. Ignatov et al. [11] propose an end-to-end enhancer achieving photo-realistic results for arbitrary image resolutions by combining content, texture and color losses. However, it still has its disadvantages, such as slower inference speed, results with artifacts (color deviations and too high contrast levels) and noises. The authors also present WESPE [12], a weakly supervised solution for the image quality enhancement problem. This approach is trained to map low-quality photos into the domain of high-quality photos without requiring labeled data, only images from two different domains are needed.

2.2 Image Super-Resolution

Single image super-resolution aims to recover the visually pleasing high-resolution (HR) image from a low resolution (LR) one. Dong et al. [4, 5] first exploit a three-layer convolutional neural network, named SRCNN, to approximate the complex nonlinear mapping between the LR image and the HR counterpart. To reduce computational complexity, the authors propose a fast SRCNN (FSRCNN) [6], which adopts the transposed convolution to execute upscaling operation at the output layer. Kim et al. [15] present a very deep super-resolution network (VDSR) with residual architecture to achieve eminent SR performance, which utilizes broader contextual information with a larger model capacity. Lai et al. propose the Laplacian pyramid super-resolution network (LapSRN) [17] to progressively reconstruct the sub-band residuals of high-resolution images. Tai et al. [26] present a deep recursive residual network (DRRN), which employs the parameters sharing strategy. The authors also propose a very deep end-to-end persistent memory network (MemNet) [27] for image restoration task, which tackles the long-term dependency problem in the previous CNN architectures. The aforementioned approaches focus on promoting the objective evaluation index, while Ledig et al. [18] achieve the photo-realistic results on super-resolution task by using a VGG-based loss function [14] and adversarial networks [9].

2.3 Image Deraining

Rain is a common weather in our life. Since it can affect the line of sight, it is a significant task to remove the rain and recover the background from rain images for post image processing. Recently, several deep learning based deraining methods achieve promising performance. Fu et al. [7, 8] first introduce deep learning methods to the deraining problem. Yang et al. [30] design a deep recurrent dilated network to jointly detect and remove rain steaks. Zhang et al. [34] propose a density-aware image deraining method with the multi-stream densely connected network for jointly rain-density estimation and deraining. Li et al. [19] design a scale-aware multi-stage recurrent network that estimates rain steaks of different sizes and densities individually.

2.4 Contextual Loss

Mechrez et al. [22, 23] design a loss function that can measures the dissimilarity between a generated image x and a target image y, represented by feature sets \(X = \left\{ {{x_i}} \right\} \) and \(Y = \left\{ {{y_i}} \right\} \), respectively. Let \({A_{ij}}\) denote the affinity between features \({x_i}\) and \({y_j}\). The Contextual loss is defined as:

$$\begin{aligned} {\mathcal{L}_{CX}}\left( {x,y} \right) = - \log \left( {\frac{1}{M}\sum \limits _j {\mathop {\max }\limits _i {A_{ij}}} } \right) \end{aligned}$$
(1)

The affinities \({{A_{ij}}}\) are defined in a way that promotes a single close match of each feature \({y_i}\) in X. To implement this, first the Cosine distances \({d_{ij}}\) are computed between all pairs \({x_i}\), \({y_j}\). The distances are then normalized: \({\tilde{d}_{ij}} = {d_{ij}}/\left( {{{\min }_k}{d_{ik}} + \epsilon } \right) \) (with \(\epsilon = 1e - 5\)), and finally the pairwise affinities \({A_{ij}} \in \left[ {0,1} \right] \) are defined as:

$$\begin{aligned} A_{ij} = \frac{\exp {\left( {1 - \tilde{d}_{ij}}/{h} \right) }}{\sum _{l}{\exp {\left( {1 - \tilde{d}_{il}}/{h} \right) }}} = {\left\{ \begin{array}{ll} \approx 1 &{} \text {if } \tilde{d}_{ij}\!\ll \!\tilde{d}_{il} \qquad \forall l\ne j\\ \approx 0 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(2)

where \(h > 0\) is a bandwidth parameter.

2.5 Knowledge Transfer

This line of research aims at distilling knowledge from a complicated teacher model into a compact student model without performance drop. Recently, Zagoruyko et al. [32] present several ways of transferring attention from one network to another over several image recognition datasets. Yim et al. [31] propose a novel approach to generate distilled knowledge from the DNN, which determines the distilled knowledge as the flow of the solving procedure calculated with the proposed FSP matrix.

3 Proposed Method

In this section, we first describe the proposed solution for single image super-resolution (SR) task and then introduce the image quality enhancement on smartphones.

3.1 Single Image Super-Resolution

As shown in Fig. 2, the presented SR method first adopts two convolutional layers with stride 2 to reduce the resolutions of feature maps. This way can dramatically decrease the computational cost during the testing phase. The following operations are two residual blocks, each of them consists of two residual modules and one transition convolution. Finally, we employ a global residual for fast model optimization and an upsampler that is composed of two convolutions with \(3 \times 3\) kernels and the sub-pixel convolution [24].

Fig. 2.
figure 2

The schematics of the proposed network for image super-resolution.

When it comes to the loss function, mean absolute error (MAE) and structural similarity index (SSIM) loss are applied to our SR methods. Given a training set \(\left\{ {I_{LR}^i,I_{HR}^i} \right\} _{i = 1}^N\), which contains N LR inputs and their counterparts. The \({L_1}\) loss can be formulated as follows:

$$\begin{aligned} {\mathcal{L}_{MAE}} = \frac{1}{N}\sum \limits _{i = 1}^N {{{\left\| {I_{HR}^i - G\left( {I_{LR}^i} \right) } \right\| }_1}}, \end{aligned}$$
(3)

where G denotes the proposed SR network. In addition, SSIM loss is as follows:

$$\begin{aligned} {\mathcal{L}_{SSIM}} = \frac{1}{N}\sum \limits _{i = 1}^N {1 - SSIM\left( {I_{HR}^i,G\left( {I_{LR}^i} \right) } \right) }, \end{aligned}$$
(4)

where,

$$\begin{aligned} SSIM\left( {x,y} \right) = \frac{{2{\mu _x}{\mu _y} + {C_1}}}{{\mu _x^2 + \mu _y^2 + {C_1}}} \cdot \frac{{2{\sigma _{xy}} + {C_2}}}{{\sigma _x^2 + \sigma _y^2 + {C_2}}}, \end{aligned}$$
(5)

where \({{\mu _x}}\), \({{\mu _y}}\) are the mean, \({{\sigma _{xy}}}\) is the covariance of x and y and \({C_1}\), \({C_2}\) are constants. Therefore, the total loss can be expressed as

$$\begin{aligned} {\mathcal{L}_{total}} = {\mathcal{L}_{MAE}} + 25{\mathcal{L}_{SSIM}} \end{aligned}$$
(6)

3.2 Single Image Enhancement

For image quality enhancement, we devote to adjusting the contrast, suppressing noises and enhancing the image details. Considering that the time performance is a vital aspect of image processing on smartphones with limited computational sources, the enhancer must be lightweight and efficient. Moreover, since the resolutions of inputs are arbitrary, the model should be the fully convolutional network. Thus, we prune our generator (student) as much as possible. In Fig. 3, the upper model indicates teacher generator with more convolution filters and the below one denotes student generator that is more compact. This topological structure is conducive to elevate the quantitative and qualitative performances of student generator without increasing parameters and computational cost.

Fig. 3.
figure 3

The structure of the proposed generator for image enhancement.

Fig. 4.
figure 4

The structure of the proposed discriminator for image enhancement.

Another core of this method is loss functions. In consideration of making the enhanced picture more photo-realistic, we follow the practice of Ignatov et al. [11], i.e., assume the overall perceptual image quality can be resolved into three portions: (i) content quality, (ii) texture quality and (iii) color quality.

Content Loss. Inspired by [22, 23], we choose contextual loss based on layer ‘conv4_2’ of the VGG-19 network [25]. In addition, to perverse the structural information of images, SSIM loss mentioned in Eq. 4 is also utilized. Thus, the content loss can be defined as

$$\begin{aligned} {\mathcal{L}_{content}} = \frac{4}{N}\sum \limits _{i = 1}^N {{\mathcal{L}_{CX}}\left( {G\left( {I_{input}^i} \right) ,I_{t\arg et}^i} \right) } + 25{\mathcal{L}_{SSIM}} , \end{aligned}$$
(7)

where \({I_{input}^i}\) and \({I_{t\arg et}^i}\) constitute the training pairs \(\left\{ {I_{input}^i,I_{t\arg et}^i} \right\} _{i = 1}^N\), G represents the generator for image quality enhancement.

Texture Loss. Image texture quality is addressed by an adversarial discriminator as depicted in Fig. 4, which simply consists of 6 convolutional layers with leaky ReLU, 2 fully connected layers, and a sigmoid function. Following the way in [11, 12], this discriminator is applied to grayscale images and is trained to identify the authenticity of a given image. The texture loss is defined as:

$$\begin{aligned} {\mathcal{L}_{texture}} = - \sum \limits _i {\log D\left( {G\left( {I_{input}^i} \right) } \right) }, \end{aligned}$$
(8)

where D is the discriminator as illustrated in Fig. 4.

Color Loss. Image color quality is measured by MSE function that is trained to minimize the difference between the blurred versions of the low-quality input \({I_{input}}\) and the high-quality target \({I_{target}}\). The blurred input can be expressed as

$$\begin{aligned} {I_{input\_b}} = \sum \limits _{k,l} {{I_{input}}\left( {i + k,j + l} \right) \cdot {G_{k,l}}}, \end{aligned}$$
(9)

where \({G_{k,l}} = A\exp \left( { - \frac{{{{\left( {k - {\mu _x}} \right) }^2}}}{{2{\sigma _x}}} - \frac{{{{\left( {l - {\mu _y}} \right) }^2}}}{{2{\sigma _y}}}} \right) \) indicates Gaussian blur with \(A=0.053\), \({\mu _{x,y}} = 0\), and \({\sigma _{x,y}} = 3\) proposed in [11, 12]. Therefore, color loss can be written as:

$$\begin{aligned} {\mathcal{L}_{color}} = \left\| {{I_{input\_b}} - {I_{t\arg et\_b}}} \right\| _2^2. \end{aligned}$$
(10)

Tv Loss. To suppress noises of the generated images we add a total variation loss [2] defined as follows:

$$\begin{aligned} {\mathcal{L}_{tv}} = \frac{1}{{CHW}}\left\| {{\nabla _x}G\left( {{I_{input}}} \right) + {\nabla _y}G\left( {{I_{input}}} \right) } \right\| , \end{aligned}$$
(11)

where C, H, W are the dimensions of the enhanced image \(G\left( {{I_{input}}} \right) \).

Kd Loss. The knowledge distillation loss is used to boost the performance of student model and is defined as follows:

$$\begin{aligned} {\mathcal{L}_{kd}} = {\sum \limits _{j \in \mathcal{J}} {\left\| {\frac{{Q_S^j}}{{{{\left\| {Q_S^j} \right\| }_2}}} - \frac{{Q_T^j}}{{{{\left\| {Q_T^j} \right\| }_2}}}} \right\| } _2}, \end{aligned}$$
(12)

where \(Q_S^j = vec\left( {F\left( {A_S^j} \right) } \right) \) and \(Q_T^j = vec\left( {F\left( {A_T^j} \right) } \right) \) are respectively the j-th pair of student and teacher mean feature maps in vectorized form, and \(F\left( A \right) = \frac{1}{C}\sum \nolimits _{i = 1}^C {{A_i}} \).

Sum of Losses. We formulate the total loss as the weighted sum of aforementioned losses as:

$$\begin{aligned} {\mathcal{L}_{total}} = 10{\mathcal{L}_{content}} + {\mathcal{L}_{texture}} + {\mathcal{L}_{color}} + 2 \times {10^3}{\mathcal{L}_{tv}} + 75{\mathcal{L}_{kd}}. \end{aligned}$$
(13)

4 Experiments

4.1 Datasets

Image Super-Resolution Task. For the instructions of the Perceptual Image Restoration and Manipulation (PIRM) challenges on Perceptual Enhancement on SmartphonesFootnote 1 [13], we use the DIV2K dataset [1, 28, 29], which consists of 1000 high-quality RGB images (800 training images, 100 validation images, and 100 test images) with 2K resolution. HR image patches from HR images with the size of \(384 \times 384\) are randomly sampled for training. An HR image patch and its corresponding LR image patch are treated as a training pair.

For testing, we evaluate the performance of our network on five widely used benchmark datasets: Set5 [3], Set14 [33], BSD100 [20], Urban100 [10], and Manga109 [21].

Image Enhancement Task. As for image enhancement task, we use the DPED dataset [11], which contains patches of size \(100 \times 100\) pixels for CNN training (139K, 160K and 162K pairs for BlackBerry, iPhone, and Sony, respectively). In this work, according to the illustration of the challenge, we consider only a sub-task of improving images from a very low-quality iPhone 3GS device. As for testing, we use the 400 patches provided by challengeFootnote 2.

4.2 Implementation and Training Details

Image Super-Resolution Task. We randomly extract 16 LR RGB patches with the size of \(96 \times 96\) and interpolate them bicubically with the upscaling factor of 4. We augment LR patches with a random horizontal flip and 90\(^{\circ }\) rotation. Experimentally, we set the initial learning rate to \(5 \times {10^{ - 4}}\) and decreases by the factor 5 for every 1000 epochs (\(5 \times {10^4}\) iterations). The Adam optimizer [16] with \({\beta _1} = 0.9\), \({\beta _2} = 0.999\) is used to train our model.

Image Enhancement Task. Drawing on the experience of [11], we take 50 image patches with the size of \(100 \times 100\) as inputs. The learning rate is initialized to \(5 \times {10^{ - 4}}\) for all layers and decreases by the factor 10 for every \({10^4}\) iterations. We use the Adam optimizer [16] with \({\beta _1} = 0.9\), \({\beta _2} = 0.999\), and \(\epsilon = {10^{ - 8}}\) for training. To improve the performance of the student, we first train the teacher with the same training hyper-parameters and then use it to guide the training of the student network by using Eq. 12.

All the experiments are implemented in the platform Ubuntu 16.04 operation system, TensorFlow 1.8 development environment, 3.7 GHz Intel i7-8700k CPU, 64 GB memory and Nvidia GTX1080Ti GPU.

4.3 Comparison with Baseline Methods

Table 1. Quantitative evaluation results in terms of PSNR and SSIM. Bold and italic indicates the best and second best methods, respectively.
Fig. 5.
figure 5

Visual comparison for \(4 \times \) SR on Urban100 dataset.

Image Super-Resolution Task. To evaluate the performance of our proposed SR network, we use two baseline approaches SRCNN [4, 5] and VDSR [15]. Table 1 shows the average PSNR and SSIM values on five benchmark datasets with the scaling factor of 4. From this table, we can see that the proposed method performs favorably against benchmark results. Table 2 indicates our solution better leverages the execution speed and the performance. In Fig. 5, it is obvious that the fidelity of geometric structure in our result is superior to the other methods. From Fig. 6, we can see that the color of the lines is closer to the ground-truth.

Fig. 6.
figure 6

Visual comparison for \(4 \times \) SR on B100 dataset.

Table 2. Performance of our method on track A.
Table 3. The effectiveness of knowledge transfer.

Image Enhancement Task. In order to better transfer the model to practical application, we must weigh the performance and the speed of image enhancement. From Table 3, the teacher network achieves high performance in terms of PSNR and MS-SSIM, but the execution speed is slightly slow. It is worth noting that the student model with L1 and VGG losses is our submitted version in the challenge. We experimentally find that when removing these two losses, the performance of the proposed student net can be prominently improved as shown in the third row of Table 3. Considering time testing, three student models have the same computational complexity, and the differences in Table 3 are caused by test errors. In Fig. 7, the generated result of the teacher model performs more realistic and the wood grain is clearer. But in terms of color saturation, the student network performs better. The DPED [11] produces color deviations in Fig. 8, whereas the student model successfully suppresses this typical artifact.

Previous results of our student model with L1 and VGG losses is shown in Table 4, which ranks 2nd in the challenge. Trained with losses in Eq. 13, we improve our model as shown in Table 3.

Fig. 7.
figure 7

Visual qualitative comparison of the “7” image in DPED test images.

4.4 Ablation Study

Effectiveness of Knowledge Transfer. To demonstrate the effectiveness of the proposed knowledge transfer, we remove Kd loss in the training of student model while the network structure and other losses remain unchanged. Table 3 shows the effectiveness of knowledge transfer. From the visual assessment as shown in Fig. 10, the image generated by the student model is more saturated in color and more expressive.

4.5 Limitations

Although visually realistic, the reconstructed images may contain emphasized high-frequency noise (see generated image of Teacher model in Fig. 8). It’s remarkable that the produced image of Student model successfully suppresses the noises, but the result appears smooth (Fig. 9).

Fig. 8.
figure 8

Visual qualitative comparison of the “8” image in DPED test images.

Fig. 9.
figure 9

Visual qualitative comparison of the “10” image in DPED test images.

5 Conclusions

In this paper, we propose the perception-preserving convolution network (PPCN) to enhance the image quality. Specifically, we devise a novel lightweight architecture that directly maps the low-quality images to the DSLR-quality counterparts to adapt to the environment with limited resource. To attain a more realistic visual effect, we introduce contextual and SSIM losses as the content loss. Furthermore, to improve the ability of the network, we adopt the knowledge transfer strategy, which enables the student model to learn information from the pre-trained teacher network. In addition, we propose a compact network for super-resolution task. Extensive experiments demonstrate the effectiveness of our proposed models.

Fig. 10.
figure 10

Result images for three student models.

Table 4. Performance of our method on track B.