1 Introduction

Image quality assessment (IQA) aims to evaluate image quality in various types of distortion during image acquisition, compression, transmission and restoration [23, 29, 34]. Typical ways of IQA include subjective quality assessment and objective quality assessment. The former requires manual intervention, which is usually time consuming. In this case, image quality should be automatically generated and consistent with human perception. Therefore, objective image quality assessment obtains great attention in research. However, objective IQA has been a challenging issue in computer vision due to the variety of image distortion types and the difficulty in understanding the visual mechanisms of human perception.

Generally, objective IQA methods could be divided into three categories: full-reference (FR) IQA, reduced-reference (RR) IQA and no-reference (NR) IQA. FR-IQA methods are based on the full accessibility of raw image, and use this information to evaluate how much the distorted image has deviated from the origin one. The state-of-the-art FR-IQA methods include SSIM [24], MS-SSIM [26], FSIM [32], VIF [19] and GMSD [28]. The RR-IQA methods, including [25] and [22], extract only partial information of the reference image to predict the target image quality. However, in most cases, raw image is not available, therefore the NR-IQA methods that do not require a reference image becomes very necessary. Therefore, the NR-IQA method becomes very attractive in practical applications. Nevertheless, the lack of prior information has forced NR-IQA methods to work in different manners, making it the most challenging one among the three categories.

Early NR-IQA researches focus on extracting features from distorted images [11,12,13,14, 17], based on the observation that some features are distinguishable from the distortion free images. Although huge efforts have been devoted in designing feature forms, the performance improved rather slow, which shows the limitations and drawbacks of these hand-crafted features. On the other hand, deep learning methods have shown great ability in many computer vision [3, 7, 18, 30, 33], it also can be applied in NR-IQA research field, being expected to achieve better performances. Deep learning methods can use convolutional layer and pooling layer to extract features for IQA, and fully connected layers are used to mapping features to quality score. Based on the superiority that image features could be trained automatically instead of manually designed, deep learning frameworks are supposed to extract more capable features with higher efficiency. Not surprisingly, many deep learning based IQA works have achieved good performances following [6].

The motivation of our method is that learning the complex relationship between visual content and the perceived quality via a novel recursive convolutional neural network. In [21], it has been demonstrated that deep convolutional neural networks (CNNs) with more layers outperform shallow network architectures. Based on this, we train a deep neural network with 7 convolutional layers (including recursive layer) and 3 pooling layers for IQA feature extraction. Since the learned features are based on a data-driven approach, they are able to describe the changes in local image which are relevant to quality perception.

In this paper we propose a new framework for IQA. The contributions of this work are summarized as follows. First, we propose a deep convolutional neural network to obtain effective features from different distortions types for the estimation image quality based on training samples. Second, our network repeatedly applies the same convolutional layer as many times as desired. The convolution layers have drawback which will introduce more parameters and the pooling layers discard too much information. Since the parameters are shared in our network, the number of parameters does not increase. Third, we employ a skip-connection between layers to combines coarse and fine information. Extracting multi-scale feature is the most prevalent approach in IQA. How to fusion different scale information is the problem should be considered in quality assessment. The experimental results show that the proposed network is accurate compared with the existing IQA methods.

2 Related Work

NR-IQA methods can be generally classified into two groups: natural scene statistic (NSS) approaches and learning based approaches. NSS approaches are based on the observation that statistic features of an image changes with the presence of distortion. These approaches first extract the features from the query image, and then a regression model, which is learned previously to map the features with corresponding subjective perception scores, is used to predict the final quality score. In [14], BIQI was proposed to first estimate the distortion type, then a distortion-associated metric is applied to evaluate image quality. Later, Moorthy et al. improved BIQI into DIIVINE [13] by extracting features in wavelet domain. However, these distortion-specific methods may not perform well dealing with a generalized problem, since only certain types of distortions are considered. In [17], Saad et al. came up with an approach, called BLINDS-II, solve the problem by combining contrast and structure information in DCT. After realizing the potential of spatial features, BRISQUE [11] was proposed to capture the statistics of locally normalized illumination coefficients. NIQE [12] works in spatial domain as well, but uses a multivariate model (MVG) to fit the local features.

For learning approaches, image features are learned to map with subjective scores directly. To capture the relevant features, lots of training samples are needed. In [31], spatial features of training images are extracted to construct a codebook, and the raw image quality is estimated via encoding and pooling. In [27], FR-IQA method was used to build a training database, where image patches of similar quality are clustered to evaluate the target image quality. In [10], a generalized regression neural network was deployed to train the IQA model. Inspired by the recent success of CNNs for classification and detection tasks, Kang [6] proposed a shallow CNN consisting of a convolutional layer with max and min pooling and contrasting the normalized image patches as input. Gu [4] introduced sparse autoencoder based Image Quality Index (DIQI) for blind quality assessment. Bianco [1] estimated the image quality by average-pooling the scores predicted on multiple sub-regions of the original image. But those networks cannot make full use the information of different layers. Hou [5] learned qualitative evaluations directly and outputs numerical scores for general utilization. Actually, images are represented by natural scene statistics features, some information is lost in this method. Bosse [2] constructed a network consisting of 10 convolutional layers and 5 pooling layers for feature extraction, and 2 fully connected layers for regression. Because it is more deeper than other networks, more parameters are introduced. In contrast, the proposed method considers taking the various advantages of different layer features while reducing the number of parameters.

Fig. 1.
figure 1

The framework of our recursive convolutional neural network. Features are extracted from the distorted patch image by a convolutional neural network in order to generate the score of distortion image. The dashed-boxed represents a recursive convolution layer which share the same parameters. The layers with different colors capture different information of distorted patch images.

3 Deep Neural Network for NR-IQA

The proposed network takes an RGB image as input. Given a distortion image, our goal is to get the quality score by estimating the mapping from the images to numerical ratings. The framework of our convolutional neural network is shown in Fig. 1. We sample non-overlapping patches from a given image, then the quality score of each patch can be estimated by multi-scale network with skip connection. The score of full size image is calculated by average the patch scores.

3.1 Network Architecture for NR-IQA

The proposed network consists of 12 layers as shown in Fig. 1. The layers are organized as conv7-32, max pool, conv5-32, max pool, conv3-32, conv3-32, conv3-32, conv3-32 (four layers have same parameters), concatenate, conv3-128, FC-512, FC-1. This makes about 34 thousand trainable parameters in the network.

The convolution layers consist of filter banks. The response of each convolution layer is given by \(f_n^{l+1} = \sum _{m}(f_{m}^{l}\,*\, k_{m,n}^{l+1}\,+\, b_n^{l+1})\), where \(k_{m,n}^{l+1}\) is the convolution kernel of the l layer m-th feature map to the \(l+1\) layer n-th feature map, \(f_{m}^{l}\) denotes the m-th feature of the l layer, and similar for \(f_{n}^{l+1}\). The network use convolutional to learn effective feature representations. The first part of network is conv7-32, max pooling, conv5-32 and max pooling to capture the effective semantic information, then further refined by the recursive convolution layer. In order to obtain an output of the same size as the previous input in recursive convolution layer, padding is used for convolutions layers.

Instead of traditional sigmoid or tanh neurons, all convolutional layers are activated through the rectified linear unit (ReLU) activation function:

$$\begin{aligned} g =max(0, \sum _{n=1}w_nf_n) \end{aligned}$$
(1)

where g, \(w_n\), \(f_n\) denote the output of the ReLU, the weights of the ReLU and the output of the previous layer, respectively [15]. ReLUs enable the network to train several times faster compared tanh units. The input is \(32 \times 32\) image patches. All the max pooling layers have \(2 \times 2\) pixel-sized kernels in network. The network is trained end-to-end, the last layer is a linear regression with one output which is the quality score of the image patches. More detail will be explained in the following subsection.

3.2 Recursive Convolution Layer

Recursive convolution layer [8] takes the input matrix \(R_0\) (after conv7-32, max pool, conv5-32 and max pool layer) and computes the matrix output \(R_1,R_2,R_3,R_4\). The same weight \(W_r\) and bias \(b_r\) are used for all operations in this step. For example, \(R_1\) is calculated by

$$\begin{aligned} R_1 = max(0, W_r *R_0+b_r) \end{aligned}$$
(2)

Similar operation are performed in the following layers. The recurrence relation is

$$\begin{aligned} R_d = max(0, W_r *R_{d-1}+b_r) \end{aligned}$$
(3)

where, \(d = 1,2,3,4\). Then, we will get four feature matrices with different kinds of information. The concat layer is used to concatenate \(R_1,R_2,R_3,R_4\) with skip structure in order to fuse coarse and fine information. As we can see, recursive convolution layer increases the depth of network and reduces the number of parameters simultaneously.

3.3 Training

Due to training need more samples for network, We train our network on non-overlapping \(32 \times 32\) patches taken from large images, thus we have numbers of patches for training. However it may cause one problem that we only have the ground truth score of full image. Fortunately, the dataset of training images in our experiments have homogeneous distortions, we put the source images quality score to each patch of image. In the process of testing, the full size images quality score calculated by average the predicted patch scores.

$$\begin{aligned} q = \frac{1}{N_{p}}\sum _{i=1}^{N_p}f(x_i;w) \end{aligned}$$
(4)

where, \(x_i\) denotes the input patch, \(f(x_i;w)\) represents the predicted score of patch \(x_i\) with parameters w and \(N_p\) is the number of patches sampled from the image.

Learning the mapping between distortion images and scores are achieved by minimizing the loss between the predicted score \(f(x_i;w)\) and the corresponding ground truth \(y_i\). We adopt a similar objective function as [6]:

$$\begin{aligned} min_w \quad \frac{1}{N} \sum _{i=1}^{N}\Vert f(x_i;w)-y_i \Vert _{l_1} \end{aligned}$$
(5)

where N is the number of images in the training set. We optimize the regression objective using the mini-batches gradient descent method based on the backpropagation learning rule. We implement our model using the Chainer package.

Table 1. LCC on LIVE dataset. The best two results are presented with bold face and italic fonts.

4 Experiments and Results

4.1 Datasets and Evaluation

Datasets: The following image quality datasets LIVE [20], TID2013 [16] and CSIQ [9] are used in our experiments. The LIVE dataset comprises 779 distorted images with five different types of distortions JPEG 2000 compression (JP2K), JPEG compression (JPEG), White Gaussian Noise (WN), Gaussian blur (GBlur) and Fast Fading (FF) based on 29 source reference images. Distortion types have 7–8 degradation levels. Quality ratings were captured use a single-stimulus methodology. Differential Mean Opinion Scores (DMOS) of each image quality score is in a range of [0 100], where a higher DMOS means lower quality of the image.

The TID2013 image quality dataset include 3000 distorted images by 24 different distortions which derived from 25 reference images at 5 degradation levels each. The distortion types cover a wide range for real world, which makes TID2013 to be a challenging database. Each image is associated with a Mean Opinion Score (MOS) values lie in the range [0, 9], where a lower MOS denotes bad visual quality.

The CSIQ image quality dataset also contains of a corresponding set of 866 distorted images based on 30 reference images from 35 different observers reported in the form of DMOS. For the proposed Deep Quality system, the DMOS scores are mapped into 5 different levels of image quality for evaluation purposes. After alignment and normalization the DMOS values range in [0 1], where a higher DMOS presents lower quality.

Evaluation: Random segmentation of the data set was repeated 10 times to eliminate bias from individual data. For each repetition we calculate the Linear Correlation Coefficient (LCC), Root Mean Square Error (RMSE) and Spearman Rank Order Correlation Coefficient (SROCC) between the predicted quality score and the ground truth, then compute the average of metrics. The value of correlation metrics close to 1, or RMSE close to 0 indicates high performance.

Table 2. RMSE on LIVE dataset.
Table 3. SROCC on LIVE dataset.

4.2 Consistency Experiment

In this subsection, we consider how the proposed network corresponds to human assessment on the LIVE database. We train and test on images of all five distortions (JP2K, JPEG, WN, BLUR and FF) together without providing a distortion type. Since machine learning requires training samples, we randomly divide into several groups and the rest are used as test sets. To eliminate effects from the separated data, the random division of the data set was repeated 10 times. Other learning-based BIQA approaches are all executed in this way.

We employ four traditional full-reference IQA methods as the benchmarks, including PSNR, SSIM, IFC, and VIF. In addition, there are 10 kinds of BIQA methods for comparison: (1) NSS; (2) BIQI; (3) BLIINDS-II; (4) DIIVINE; (5) SRNSS; (6) BRISQUE; (7) CORNIA; (8) DLIQA; (9) CNN; (10) SOM. All of these methods are based on machine learning and can be found in [5] and [2]. The results are evaluated by using \(90\%\) of the data for training, then testing on the other \(10\%\) of the data.

Tables 1, 2 and 3 show the experimental results of different methods on the LIVE dataset with different distortion types. The best two results are shown in bold face and italic fonts. Our proposed network outperforms all previous NR-IQA methods. From LCC and SROCC we can see that the proposed network works well on the entire database, especially on JP2K, WN and FF. SOM method ranks second in the entire database. In particular, our RMSE is significantly reduced compared to other methods. This phenomenon comes from recursive convolution and skip structure. We have reason to believe that the proposed network can obtain more useful features for describing image quality.

Fig. 2.
figure 2

Performance of the proposed network versus the percentage of training sets

Figure 2 shows the relationship between the percentage of training sets and the performance of proposed network. The random split of the LIVE II dataset is repeated 10 times, each group include training and testing data. The average score of LCC, RMSE and SROCC are calculated according this datasets. As can be seen, the RMSE curve decreases slowly with the increases of the training set. The trend of LCC and SROCC curves is consistent. The new network can get better results even when the training sets are fewer.

Table 4. SROCC results of the cross-dataset evaluations.

4.3 Extensibility Experiment

To evaluate the performance of generalization we perform a cross-dataset evaluation as shown in Table 4. The subset of CSIQ and TID2013 includes only four types of distortions that are shared with LIVE dataset. Unfortunately, no results are available for the other methods. All models are trained on the full LIVE dataset and evaluated on subset of CSIQ and TID2013 or full set. We can see that our network is superior to previous state of the art methods on full dataset.

5 Conclusion

This paper develops a CNN for no-reference image quality assessment. Our approach describes a deep recursive neural network which predict image quality accurately by learning the mapping between images and their corresponding scores. Recursive convolution layer increases the depth of net and reduces the number of parameters simultaneously. The experimental results prove its efficiency and robustness to different standard IQA datasets, and verifies the high consistency between the designed network and human perception.