Keywords

1 Introduction

Limited by imaging equipment, digital images tend to be low-resolution (LR) for missing high-frequency constituent. Single image super-resolution (SISR) is introduced to recover the high-resolution (HR) from its LR input. This technique has been widely adopted in security monitoring and medical imaging where requires extra image details.

In recent years, Convolutional neural network (CNN) provides impetus for SISR methods. Dong et al. [2] certify the validity of learning a nonlinear LR-to-HR mapping in an end-to-end manner. They further deploy a shrink layer and a expand layer [3] in their model, termed FSRCNN, to downsize the scale of parameters and maintain the accuracy. Kim et al. [6] substitute a plain structure composed of 20 convolutional layers for anterior shallow network. This very deep CNN (VDSR) achieves impressive results for its substantial increase in accuracy. Subsequent articles demonstrate various ingenious methods, only a few of them, however, could surpass VDSR in accuracy.

In order to further improve the accuracy of current SR methods, we propose a high-accuracy deep convolution neural network to reconstruct image in a fix upscaling factor. By cascading HDCN, we can achieve scalable super-resolution and larger magnification. The merits of proposed HDCN are stated as follows:

  1. (1)

    For the purpose of reducing the blurry or over-smooth prediction, a robust penalty function is adopted to substitute for \(L_2\) loss function.

  2. (2)

    During training process, gradual training method is introduced to accelerate the rate of convergence. Training samples are divided into individual parts via calculating the average local gray value difference (ALGD) of their origin and residual patches.

  3. (3)

    Experimental results on benchmark datasets demonstrate the superior accuracy of proposed HDCN to some state-of-art SR methods.

2 Related Work

Cascade network for SISR: Original neural network based SISR methods can only reconstruct HR images in a fixed upscaling factor [2, 3]. Altering the scale factor would be accompanied by fine-tuning the whole model. It’s quite an ordeal for those non-professionals, their computers would cost extensive time. In order to augment practicability, researchers dedicate to implement models with feasible upscaling factors. Some scalable SISR methods, such as VDSR, utilize the similarity that images across different scales share some common structures and textures. Thus, images with different scales are combined into a large dataset to train a multi-scale model. Other methods cascade a fixed scale network until the desired size is reached. Wang et al. observe that a cascade of sparse coding based network (CSCN) trained for small scaling factors performs better than the single sparse coding based network trained for a large scaling factor. This strategy is simple but suffers the risk of error accumulation during repetitive upscaling. Fortunately, this risk could be reduced by enhancing the performance of each fixed-scale network. For instance, Zhen et al. employ non-local self-similarity (NLSS) search and shrink the upscaling factor to a small degree in their experiment. In contrast to the multi-scale model trained directly, cascade network could simplify training period and achieve better accuracy. In addition, larger upscaling factor can be achieved by cascading more sub-models.

Residual-like learning: Before introducing residual-like learning, NN based SISR methods tend to implement narrow network and prudent step during iteration. Gradient exploding/vanishing problem hampers deeper network structure and rate of convergence. Since He et al. accomplish a very deep residual network for image recognition and achieve impressive results, subsequent SISR methods tend to adopt residual-like structure to improve the performance of their models. On behalf of them, VDSR attempts to learn the residual image, defined as the difference between input and output image, rather than origin ground-truth image and substantial increases the accurate of reproduce results. From the perspective of network structure, VDSR, inspired by the merits of VGG net, deploys a deep and plain network. In order to approximate the initial structure of residual network, Yang et al. carry out identity mapping shortcuts as a projection to change feature dimensions. Considering the deep structure deployed in proposed model, residual learning is also utilized to improve the efficiency of training process.

3 Experiment

3.1 Model Structure

Inspired by ‘the deeper, the better’ [6], proposed network, as shown in Fig. 1, deploys a plain structure with d layers (\(d=25\)). The first layer casts as the feature extraction and representation operating on the input patches. 64 filters with size of \(3\times 3\) are utilized in this layer. Subsequent layers, except the last, dedicate to learn the end-to-end mapping from LR patch to its relative HR patch. 64 filters with size \(3\times 3\times 64\) are deployed in these layers. The last layer is aimed to reconstruct the HR image. One \(3\times 3\times 64\) filter is implemented to reproduce the image.

Fig. 1.
figure 1

HDCN and cascaded HDCN structure.

A single HDCN can only upscale the image with a fix factor s. To achieve other upscaling factors, HDCN can be cascaded as shown in Fig. 1. Considering the practicability and flexibility, s is set to 2. Thus, \(4\times \)magnification can be implemented by cascading two HDCN models and images with upscaling factor 3 can be reconstructed from \(4\times \)magnification via using bicubic interpolation. In addition, in Fig. 1, I represents the input LR image and x is the bicubic result of I. We denote the residual image as \(r=y-x\) where y is the output of single HDCN. Obviously, \(x_1\) and \(y_1\) are the input and output image of the next HDCN.

3.2 Enhancement Strategies

Loss function: When an original NN based SR model is utilized to reconstruct images, it can be observed that the reproduce results tend to contain blurry prediction or over-smooth edge. The primary cause is one LR image patch may correspond different similar HR patches while \(L_2\) loss function fails to recognize the complex potential relation [8]. Hence, a robust penalty function, proposed by Charbonnier et al., is introduced in proposed model to substitute for \(L_2\) loss function.

The loss function is defined as (1):

$$\begin{aligned} L(x,R;f)=\frac{1}{N}\sum _{i=1}^N\sqrt{(R-f(x))^2-\xi ^2} \end{aligned}$$
(1)

In (1), we denote R as the residual image, computed as the difference between the x and ground truth image. In addition, N is the amount of patches in each mini-batch and \(\xi \) is set to \(1e-3\) empirically.

Gradual training: The chosen of training images is a crucial factor of model’s performance. With the structure of network going deeper, more training images are required to suffice the training process, specifically, improve the model performance and overcome overfitting. However, extensive training images arise attendant problem, retarding the rate of convergence. Meanwhile, convergence curve of training process tends to be wavery. Those small undulations indicate instability of the model in some degree. The primary cause is the inhomogeneous distribution of edge-like patterns. Patches with sharp edges are easier to be learned than other kinds of patches. Considering the extensive patches (over 700000) prepared for the experiment, gradual learning is implemented to improve the efficiency of training process. In contrast to the gradual upsampling network (GUN) proposed by Zhao et al. [14], residual learning and gradient clipping strategies are retained to expedite rate of convergence. In addition, edge-like patches are classified by means of the ALGD, computed as (2):

$$\begin{aligned} V_{ALGD}=\sum _{p=1}^{N_p}|G_p-\bar{G}| \end{aligned}$$
(2)

where \(N_p\) denotes the amount of pixels in an image patch and \(G_p\) (\(p=1,2,\cdots , N_p\)) is the gray value of according pixel while \(\bar{G}\) represents the average gray value of that patch. The ALGD of the whole training samples is also calculated and denoted as \(\overline{V_{ALGD}}\). Hence, the evaluation parameter \(\delta \) of each patch can be calculated as (3):

$$\begin{aligned} \delta =V_{ALGD}/\overline{V_{ALGD}} \end{aligned}$$
(3)

The essential of the gradual learning is the learning process from easy to difficult. In other words, the model would achieve better performance gradually. The rate of convergence will be accelerated due to the reduction of instable undulations. While selecting edge-like patches, the ALGD of residual images is also taken into consideration to ensure the simpleness. We denote \(\delta _r\) as the ALGD of residual patches and \(\delta _o\) presents the ALGD of original patches. The training set is divided into three parts in accordance with the discriminant condition as shown in Table 1. These parameters are set by referring to [14].

In Fig. 2, we demonstrate several image samples. Patches in blue border are easier to be learned by model than patches in green one, whereas their \(\delta _o\) are both greater than 1.2. It indicates the significance of deploying the ALGD of residual patches.

Table 1. Discriminant condition on division of training set
Fig. 2.
figure 2

\(\delta _o\) and \(\delta _r\) of training samples

Those images with small parameters are combined in one subset for their tiny contribution on improving performance. During gradual learning, sharp-edge patches of first part are prior utilized to form the initial training set. After temporary convergence, we then feed second subset to the network and it can be observed in Fig. 3 that the reproduce image becomes brilliant. Ultimately, those remainder with small parameters are added to fine-tune the model. However, the improvement is not perceptible for human visual. Details are demonstrated in the next chapter.

Fig. 3.
figure 3

\(2\times \)magnification results in each state. (a) Bicubic (b) training with Part 1(c) training with Part 2 (d) training with remainder.

4 Implement Details

Training dataset: The origin training set is comprised of two datasets, 91 images from Yang et al. [12] and 200 images from Berkeley Segmentation Dataset [9]. VDSR and RFL [10] employ the identical dataset. Data augmentation is a crucial measure of overcoming the overfitting problem. The training data is augmented through three approaches. (1) Rotation: images can be rotated by 90, 180 or 270. Other degrees are not adopted for unnecessary imresize process. (2) Flipping: flip images horizontally and vertically. (3) Scaling: randomly downscale between [0.5, 1.0]. Limited by hardware, we operate the augmentation on Yang’s dataset and part of Berkeley’s. Ultimately, training set consists of approximately 2000 images.

Test dataset: SISR experiments are carried on four datasets: Set5 [1], Set14 [13], urban100 [4], B100. Set5 and Set14 are two benchmarks used for years. Urban100 and B100 are adopted extensively among recent state-of-the-art papers.

Parameters settings: As other mature models, momentum parameter is set to 0.9 and the weight decay is \(1e-4\). The initial learning rate is set to 0.1 and then decreases by a factor of 10 every 10 epochs (not less than 0.001).

The model converges very quickly with the first part of training set, approximately 3–5 epochs. We then feed the second part, consists of over 200000 image patches, to the network and decrease the learning rate to 0.01 during the training process and it costs 12–15 epochs to converge. Ultimately the remainder is utilized to fine-tune the model and the learning rate drops to 0.001 after 20 epochs. The total training process requires 30–35 epochs and roughly costs 30 h on a personal computer using a GTX 970. In addition, image patches with \(\delta _r<0.5\) are deserted due to their residual images are almost dark completely. It can be observed in Fig. 4 that the model with gradual learning converges faster and performs better than origin model.

Fig. 4.
figure 4

Convergence analysis on the gradual learning.

Proposed model is implemented underwith Caffe [5]. Each mini-batch consists of 64 sub-images with size of 51*51. An epoch has 7812 iterations.

Table 2. Average PSNR/SSIM for \(2\times \), \(3\times \) and \(4\times \)magnification. indicates the best and indicates the second best performance.

Experiment results: Proposed model is compared to 6 state-of-the-art SR methods: A+[11], RFL, SRCNN, FSRCNN, DRCN [7], VDSR.

In Figs. 5 and 6, it can be observed that proposed model reduces blurry prediction, which proves the effectiveness of deploying robust loss function. In addition, Table 2 shows that proposed model can achieve higher accuracy in most of test set. Meanwhile, experiment results in B100 also indicate the existence of error accumulation during repetitive upscaling. However, cascaded model is more flexible to reconstruct images with larger magnification. Considering the Laplacian Pyramid Networks model [8], termed LapSRN, can also reproduce image with larger scale, proposed model is compared to LapSRN and VDSR. In Fig. 7, proposed model and LapSRN can both reconstruct image with less blurry prediction in virtue of the robust loss function, while proposed model performs better due to the model structure and gradual learning strategy.

Fig. 5.
figure 5

Super-resolution results of ‘ppt3’ (Set14) with scale factor\(\times 3\).

Fig. 6.
figure 6

Super-resolution results of ‘img_055’ (urban100) with scale factor\(\times 4\).

Fig. 7.
figure 7

Super-resolution results of ‘3096’ (B100) with scale factor\(\times 8\).

5 Conclusions and Future Perspectives

In this article, we present a high accuracy SR method using deep convolution neural network. Proposed model can reduce the blurry prediction and reconstruct images with larger scale. After a more stable training process, proposed model achieves higher accuracy than some state-of-the-art works. Our further work is to combine other image process techniques, such as object recognition and optical character recognition (OCR), into SISR model. We believe that the more useful information is involved, the more brilliant reconstruct images we can obtain.