Advertisement

Multimedia Tools and Applications

, Volume 75, Issue 4, pp 2367–2391 | Cite as

A new method for evaluating the subjective image quality of photographs: dynamic reference

  • Mikko Nuutinen
  • Toni Virtanen
  • Tuomas Leisti
  • Terhi Mustonen
  • Jenni Radun
  • Jukka Häkkinen
Article

Abstract

The Dynamic Reference (DR) method has been developed for subjective image quality experiments in which original or undistorted images are unavailable. The DR method creates reference image series from test images. Reference images are presented to observers as a slide show prior to evaluating their quality. As the observers view the set of reference images, they determine the overall variation in quality within the set of test images. This study compared the performance of the DR method to that of the standardized absolute category rating (ACR) and paired comparison (PC) methods. We measured the performance of each method in terms of time effort and discriminability. The results showed that the DR method is faster than the PC method and more accurate than the ACR method. The DR method is especially suitable for experiments that require highly accurate results in a short time.

Keywords

Image quality Subjective evaluation method Performance measure 

1 Introduction

In this article we present comprehensive performance study of subjective image quality evaluation methods. The Dynamic Reference (DR) method is compared with the standard methods. The DR method is based on the concept of dynamic reference, where test images create a valid representation of the variation in quality within the set of test images. This method was developed for image quality experiments in which original or undistorted images are unavailable or otherwise unusable. The main concept of the DR method is to create a set of reference images from test images in the form of a slide show. The set of reference images provides a general idea of the range of quality and types of distortion within a set of test images.

1.1 Methods for evaluating image quality

Many methods for evaluating image quality, such as Forced Choice Paired Comparison (PC) [9], Absolute Category Rating (ACR), Double Stimulus Impairment Scale (DSIS), Double Stimulus Categorical Rating (DSCR) [10] and Quality Ruler (QR) [7], have been standardized. One of the fundamental differences between these evaluation methods is their need for reference images.

The performance of the evaluation methods can be characterized in terms of discriminability and speed. The term discriminability refers to the ability of the method to identify statistically significant differences between the test images. The term speed refers to the time effort (as a function of the number of test images) needed to conduct the evaluation test.

Mantiuk et al. [15] compared the performance of these standardized evaluation methods. The results showed that the PC method was the most accurate of the methods. The PC method displays two images side by side, and the observer’s task is to select the preferred image. If the quality differences are clear, the task is simple. However, because the number of image pairs increases exponentially with the number of images, the PC method can be time-consuming. Therefore the PC method is only best suited for experiments with a relatively small number of test images.

The ACR method (or single stimulus (SS) method) displays one image at a time and assesses the quality value without a reference image. This value can be categorized as excellent, good, fair, poor or bad, or scored on a continuous scale. The study [15] showed that the ACR method is the fastest method for assessing many test images. However, the ACR method can be inaccurate because, when providing quality values for test images, observers use and compare test images without any external objective reference, using only their own internal reference.

DSIS and DSCR methods require undistorted versions of the test images. The DSIS method compares the reference and test images and defines the quality category of the test image. The DSCR method defines the categories of both the reference and test images.

The QR method is based on the calibrated reference image series. The observer selects from the reference image series a reference image of equal quality to that of the test image. A complex procedure served to prepare the reference image series. In this procedure, high-quality photographs were blurred and produced levels of degradation equal to one just noticeable difference (JND).

Different applications require different methods for evaluating quality. The main factor in selecting the method for a specific image quality evaluation application is the availability of reference images. For traditional image processing applications, such as compression level, image sharpening, noise removal and color boost optimization, the original undistorted image is often available. In these cases, the use of evaluation methods such as DSIS or DSCR is justified and even preferred. If reference images are unavailable, one should use evaluation methods such as ACR or PC.

In this article, we prove the performance of the DR method. The DR method has been developed for applications in which reference or undistorted images are unavailable and where the amount of images restricts the use of PC method, such as camera benchmarking. In a camera benchmarking study, different digital cameras capture test images from typical scenes [6, 17, 18]. For example, standard [6] proposes six different typical scenes for camera benchmarking research. In a camera benchmarking study, typical number of different cameras (test images) is from 10 to 15. Test images are various reconstructions of a scene [11, 22] that differ in dimensions such as sharpness, brightness and color [16, 21]. Moreover, image processing in cameras introduces computational enhancement operations, such as sharpening, noise removal and color boosting. As a consequence of multi-dimensionality and camera processing operations, no undistorted or original reproduction of a captured scene (reference image) is available. One could argue that the researcher could simply select the best picture and use it as a reference. However, that would then bias all the evaluations towards his preference, which may differ from the average opinion. Professional background can also influence the definition of good image quality [28].

1.2 Contributions

The main contribution of this study is the comprehensive performance analysis of the DR method and standard image quality evaluation methods. The DR method relates to its ability to form sets of images from test images and to provide observers with those images prior to evaluating the test image. In this article, we show that the accuracy of the evaluation of the quality increased thanks to the set of reference images. With the aid of the set of reference images, observers can more easily identify small differences between test images. Furthermore, observers can determine the overall variations in quality within a set of test images.

The DR method partly resembles the SAMVIQ (Subjective Assessment Method for Video Quality) method, which offers access to several samples of a video sequence [8, 19]. The observer randomly selects different video samples and can modify the score of each sample of a video sequence as desired. The fundamental difference between the DR and SAMVIQ methods is that the DR method shows reference samples at a constant speed and style prior to the evaluating the test images. For the SAMVIQ method, observers explore samples freely and then score them.

We selected the ACR and PC methods as references for this study because they are applicable to a camera benchmarking or similar study in which reference images are unavailable. No DSIS or DSCR methods were used because they require a reference image. The QR and SAMVIQ methods were excluded from the study. The QR method requires a set of calibrated reference images or a new calibration process [12]. The amount of work for the calibration process is greater than that required when conducting an image quality evaluation study with the accurate PC method. The style for showing test images in the SAMVIQ method is designed for video samples.

The first hypothesis (H1) of the study is that the DR method is more accurate than the ACR method. This hypothesis is based on the assumption that the set of reference images incorporated in the DR method helps to ensure more constant quality evaluation scores because the quality scale and distortion types are known before evaluating the test images. For the ACR method, observers should use their own internal references and quality scale when evaluating test images. This can cause unwanted variation, as these internal references are unknown to the researcher and can differ between subjects. Another benefit of the DR method over the ACR method is how the subjects can feel confident in their use of the given scale, as they no longer need to save the far ends in case a better or worse image comes up later in the experiment. This should prove especially helpful in differentiating those stimuli that fall into the highest and lowest quadrants of the scale.

The second hypothesis (H2) of this study is that the DR method is faster than the PC method. For the PC method, the goal is to select a higher-quality image from the image pair. The problem with the PC method is the test time, which increases rapidly when the number of test images rises.

The study presented in this article includes two contributions. The first contribution relates to the performance study in which the performance of the new DR method is proven in terms of accuracy and speed. The second contribution relates to the subjective quality evaluation data gathered in this study, which are now available to the research community from http://www.helsinki.fi/~msjnuuti/DR_data/.

This article is organized as follows: Section 2 defines the function of the DR method and explains the experimental design and measures to compare the performance of the ACR, PC and DR methods. Section 3 analyzes the results, and Section 4 discusses them.

2 Methods and experimental settings

To avoid ambiguity, a consistent naming convention for all subsequent sections appears below. In this study, we evaluate the performance of a new quality evaluation method called the DR method. A quality evaluation method describes how to display test images and possible reference images to observers as well as how to collect evaluation scores. We compared the performance of the DR method with the standardized ACR and PC methods. The performance of the quality evaluation method was defined as a combination of time effort to conduct the study and the ability to achieve statistically significant differences between test images. The reference image is low- and/or high-quality image, which aids observers in creating a quality scale that ranges from low to high. In this study, test images were captured with different digital cameras in different scenes, typically outdoor and indoor scenes with landscape, living room and portrait views. Differences between the test images in this study resulted from the use of different camera devices.

2.1 Dynamic reference

The fundamental goal of this study is to test whether the DR method is faster than the PC method, but still more accurate than the ACR method. The PC and ACR methods served as reference methods because both are well known and are applicable to camera benchmarking studies or similar studies in which reference images are unavailable.

The novel feature of the DR method is how the set of reference images is formed from test images. Let L s = {I i i = 1, ... n} be the set of test images from scene s (s = 1, ... k). Reference image set K s, i = {I j j = 1, ... n − 1} for test image I i is formed from test image set L s so that K s, i I i = L s and I j I i .

The DR method is conducted using two displays or two adjacent windows on a single display setup. One display (window) shows reference image set K s, i as a slide show, and the other display shows test image I i . Figure 1 shows an example in which a set of reference images is provided on Display 1 and a test image is provided on Display 2.
Fig. 1

DR method presents a reference image set on one display and the test image on the other

Hypothesis H1 of this study was that reference image set K s, i improves accuracy in determining differences between test images. The drawback of the reference image sets is that displaying them lenghtens the test duration. The average test duration of the DR, ACR and PC methods, T D R , T A C R and T P C , can be calculated as:
$$T_{DR}=k*n*(\bar{t}_{R}+t_{DR}*(n-1)) $$
(1)
$$T_{ACR}=k*n*\bar{t}_{R} $$
(2)
$$T_{PC}=0.5*k*\bar{t}_{R}(n(n-1)) $$
(3)
where k is the number of scenes, n is the number of test images captured in one scene, and \(\bar {t}_{R}\) is the average observer response time, where \(\bar {t}_{R}=\frac {1}{k*n}{\sum }_{s=1}^{k}{\sum }_{i=1}^{n} t_{R,s,i}\) and t R, s, i is the average response time of test image I i from scene s. The average test time of the DR method depends on components \(\bar {t}_{R}\) and t D R . Component t D R is the display time of one image in a set of reference images. Figure 1 displays components \(\bar {t}_{R}\) and t D R .

In this study, we tested three DR setups. The t D R parameter was set to 0.25 s, 0.50 s or 0.75 s. These parameter values were selected based on the pre-tests. The test time of the DR setup, with the longest reference image display time (t D R = 0.75s), was shorter than the PC method when the number of test images was 10. If the reference image display time is too short, we can expect observers not to notice differences in quality and types of distortions between test images. If the display time is too long, T D R T P C ; the DR method therefore cannot be justified because the PC method is presumably more accurate and is thus preferable.

2.2 Observers and procedure

A total of 75 observers (41 women and 34 men) participated in this experiment. In addition we have conducted a supplementary experiment in which we tested if the number of test images affects the accuracy of the DR method. The setup and results of the supplementary experiment are presented in Section 3.4.

Observers were recruited from university student mailing lists. The age of the observers ranged from 20 to 40 years (mean 24.22, SD 4.28). Prior to testing, the observers confirmed that they did not personally or professionally work with image processing or image-quality evaluation. All observers had normal or corrected to normal vision. Near visual acuity, near contrast sensitivity (functional acuity contrast test, F.A.C.T), and color vision (Farnsworth’s dichotomous test for color blindness D-15) were screened for normal visual acuity before the experiment. All observers received a movie ticket as payment for their participation.

The observers were divided into five groups of 15 observers each. Different groups evaluated the test images using the DR25, DR50, DR75, ACR or PC setup. The t D R parameters of the DR25, DR50 and DR75 setup were 0.25 s, 0.50 s and 0.75 s, respectively, which was the time a reference image was visible at once.

The experiments were performed in a dark room with controlled lighting directed towards a wall behind the displays, which produced ambient illumination of 20 lux to avoid flare. The setups included two colorimetrically calibrated 24″ 1920 × 1200 displays (Eizo Color Edge CG210). The color calibration assumed that the source material was encoded in the sRGB color space. The luminance levels were set to 80 c d/m 2, the white point at 6500 K and gamma at 2.2. The experiments were conducted with the VQone MATLAB toolbox, which is available to the research community.1

The example application area of this study was camera benchmarking. In a camera benchmarking study, the scenes from where test images are captured should represent environments in which consumers typically shoot photos, ranking from dark to bright indoors to bright outdoor conditions. The choice of the scenes in this study was based on the photospace approach described in [6]. The photospace describes six typical shooting distances and illuminance levels from where photos are captured. According to our long-term experience in camera benchmarking research, the typical number of cameras that are benchmarked at the same time is between 10 and 15 cameras. According to the above requirements, we selected 10 test images captured by different mobile phone or compact cameras from six scenes (n = 10, s = 6). For the supplementary experiment, presented in Section 3.4, we selected 15 test images from three scenes (n = 15, s = 3).

Test images (DR and ACR methods) or test image pairs (PC method) were presented in random order, one scene at a time, to each observer. Figure 2 shows low and high quality example images. Figure 12 shows the thumbnails from the all test images. The test images and the subjective data collected in this study are publicly available for the research community.
Fig. 2

Test images were captured from six scenes; high (up) and low (down) quality example images

The observers were asked to read a briefing form, which explained to them the experiment. Before the actual experiment began, the observers received a short demo and completed a short practice session. For the DR and ACR methods, the observers used a continuous quality scale. They used a graphical slider (0 – 100) to evaluate the general quality of the test images. Observers did not know how many steps were in the slider in order to avoid any tendency to favor certain numbers (e.g., even tens and quartiles). Furthermore, a continuous quality scale is more justifiable than category ranking, because a scale with categorized attributes can introduce bias [27]. The subjective distance between bad and poor, for example, is different from distance between good and excellent. This results in a situation in which only ordinal data is gathered, thus greatly limiting the possibility to use more powerful statistical analyses; even calculating an average from an ordinal scale is questionable. When only the extremities are labeled, the scale is more continuous and represents an interval scale.

Table 1 lists the quality evaluation methods, abbreviations and observer information. The average test durations were 10.2, 15.2, 18.6, 23.2 and 26.4 minutes for the ACR, DR25, DR50, DR75 and PC methods, respectively. The number of observers (m = 15) and the duration of the experiments (t < 30 minutes) meet the standard recommendations. The standard [9] recommends that an experiment lasts up to half an hour. The standards [9, 10] recommend that at least 15 observers participate in the experiment. Furthermore, Winkler [29] determined that, with 10–15 observers, the standard deviation reaches the actual value.
Table 1

Methods for evaluating image quality

Method

t D R

Average test duration

SD test duration

Participants

DR25 (Dynamic reference)

0.25 s

15.2 min

3.1 min

15 (9 women, 6 men)

DR50 (Dynamic reference)

0.50 s

18.6 min

2.9 min

15 (8 women, 7 men)

DR75 (Dynamic reference)

0.75 s

23.2 min

4.3 min

15 (8 women, 7 men)

ACR (Absolute Category Rating)

10.2 min

1.7 min

15 (8 women, 7 men)

PC (Paired Comparison)

26.4 min

5.3 min

15 (8 women, 7 men)

2.3 Analysis

The main goal of this study was to compare the performance of the DR method with the performance of the ACR and PC methods. The standards for evaluating image quality describe experimental procedures, viewing conditions, display calibration parameters and data processing methods, but lack measures for comparing methods. However, studies [3, 13, 15, 23, 26] applied metrics such as ease of evaluation, experiment duration, effect size, mean confidence interval and repeatability when evaluating and comparing methods.

Redi et al. [23] used confidence intervals and root mean square error (RMSE) values as performance measures to compare methods of evaluating quality. Narrower confidence intervals indicated higher consistency between observers. Smaller RMSE values between repeated experiments indicated the higher repeatability of the method. Kuang et al. [13] and Kawano et al. [26] computed correlation values between results to compare methods of evaluating quality. Blin [3] and Kawano et al. [26] used standard deviation as performance measure to compare evaluation methods. The stability of the methods was related to the standard deviation values as a function of the number of observers and the duration of the experiment [26].

Mantiuk et al. [15] used the metrics of effect size and time effort to compare the methods of evaluating quality. An effect size metric served as a measure of method accuracy. Time effort was measured by computing the time required to compare the given number of test images.

In this study, we used simple descriptive statistics, such as average, standard deviation and correlation values, and three discriminability metrics to evaluate the performance of the ACR, DR and PC methods. The discriminability metrics measured the sensitivity of the methods in finding differences between the test images. The descriptive statistics expressed latent reasons for differences in performance. Moreover, time effort was computed to show the time required to compare the given number of test images. Table 2 lists descriptions of the metrics used in this study.
Table 2

We used descriptive statistics and discriminability metrics to compare methods of evaluating image quality

Metric

Description

Average MOS

A high or low value may indicate an imbalance in evaluations of quality

Standard deviation

A high value indicates that evaluations of quality are spread out over a large range of values

Rank correlation

A low value indicates lower similarity with the ground truth

Histogram of overlapping CI

The latent reasons for a low value are low standard deviation and dispersed MOS data → Higher discriminability

Effect size

The latent reasons for a high value are low standard deviation and long distances between proximity data points → Higher discriminability

Number of statistically significant differences

A high value indicates the statistical power to make differences between test images → Higher discriminability

The mean option score (M O S) over m observers was calculated for test image I i from scene s with the following equation:
$$MOS_{s,i}=\frac{1}{m}\sum\limits_{j=1}^{m}S_{s,i,j} $$
(4)
where S s, i, j is an evaluation value of observer j. A low or high average MOS for all test images can occur for two reasons: the observers used the quality scale in an imbalanced manner or the test image set includes more low-quality images than high-quality images (or vice versa).
Standard deviation σ s, i was calculated for image I i with the following equation:
$$\sigma_{s,i}=\sqrt{\frac{1}{m}\sum\limits_{j=1}^{m}(S_{s,i,j}-MOS_{s,i})^{2}} $$
(5)
A low standard deviation indicates that the observers’ scores are close to the MOS. A high standard deviation indicates that the scores are spread out over a large range of values. The standard deviation depends on the observers, the method for evaluating image quality, and the set of test images. The standard deviation is a measure of how widely distributed around the average are the observers answers. When controlling for the effects of image set, image quality and observers, the standard deviation can serve to compare different evaluation methods. The wider the distribution is, the less powerful the evaluation method is to differentiate the stimuli.

Some test images are more difficult to evaluate and evoke various opinions. For this study, we conducted a pre-test to guarantee that the quality levels in our test image sets vary smoothly and that the test images can be ordered in terms of image quality. In the pre-test we used the PC method to rank 14–15 test images from different scenes in order of quality; from these, we selected 10 test images in terms of discriminability. We wanted to confirm that if the standard deviation is high or the average MOS is low or high, the latent reason is not the set of test images, but is instead the method for evaluating quality.

The PC method provides ranking data on the image pairs, namely, how many times image A is of higher quality than image B. The ranking data are transformed into matrices. Matrix cell C a, b indicates how many times row test image a was of higher quality than the column test image b. The cells for scene s are then transformed into probability values with the following equation:
$$ p_{s,a,b}=C_{s,a,b}/m $$
(6)
where m is the number of observers. In this study, for the overall probability P s, a that image a is of higher quality than those of other images in the scene s, we averaged the p values from each row.

We assume that the PC data are the ground truth of quality order, as all images are compared against each other. If the quality between the test images differs, the PC method indicates this difference. The similarity metric between the ranking data (gathered with the PC method) and MOS data (gathered with the ACR or DR method) was Spearman’s rank correlation coefficient ρ (MATLAB corr-function). A higher correlation value indicates a greater similarity with the PC data (the ground truth).

The discriminability metrics used in this study were based on the histograms of overlapping 95 % confidence intervals (CI), effect size and the number of significant differences. The number of significant differences between image pairs was based on the linear mixed models (ACR and DR methods) and Chi-square tests (PC method).

The metric of the histograms of overlapping CI, D H O , was derived from a metric proposed by Winkler [30]. Here, 95 % confidence intervals for image I i were computed as \(1.96 \frac {\sigma _{s,i}}{\sqrt {m}}\) where m is the number of observers. We created a histogram in which all bins within the range M O S±C I were incremented by one for each test image. Figures 3 and 4 show examples for forming histograms of overlapping CI. The left sides of the figures show MOS values with CI whiskers. The right sides of the figures show histograms in which all bins within the range M O S±C I are incremented by one. The metric value was computed by equation:
$$ D_{HO}=\sum\limits_{c=1}^{N}bin_{c}(c) $$
(7)
$$bin_{c}(c) = \left\{ \begin{array}{l l} (bin(c))^{2}-1 & \quad \text{if \(bin(c)>0\)}\\ 0 & \quad \text{if \(bin(c) = 0\)} \end{array} \right.$$
where b i n(c) (c = 1, ... N) is the bin value of the histogram of overlapping and N is the number of bins. A low D H O value indicates higher discriminability and that the probability of finding statistically significant differences between the test images is higher. For example, the values of D H O for the examples showed in Figs. 3 and 4 are 0 and 15. Thus, the Fig. 3 is related to the higher discriminability.
Fig. 3

Left side: MOS values M O S = 2, 7, 11, 15, 19, 22, 28 and constant CI whiskers C I95=1; Right side: Histogram of overlapping

Fig. 4

Left side: MOS values M O S = 2,7,11,15,19,22,28 and constant CI whiskers C I95 = 2; Right side: Histogram of overlapping

The effect size metric, D E S , was derived from a metric used in [15]. We computed the D E S by sorting the MOS values of the scene-specific test images I i in ascending order for vectors L s . The effect size was calculated as the difference between sequential vector values divided by the standard deviation, σ, for the data. Let L s = {M O S s, i i = 1, ... , n} and
$$D_{ES,s}=\frac{1}{n-1} \sum\limits_{i=1}^{n-1}\frac{|L_{s,i}-L_{s,i+1}|}{\sqrt{\frac{2*(m-1)*(\sigma_{s,i}+\sigma_{s,i+1})}{2*(m-1)}}} $$
(8)
where m is number of observers and n is number of test images. A larger D E S value results in a higher statistical power; thus, the probability of finding statistically significant differences between test images is also higher.

The third discriminability metric, D S D , was the number of statistically significant differences. We applied the Linear Mixed Models method (IBM SPSS Statistics 21) for analyzing the MOS data gathered with the DR and ACR methods. Because the pre-tests maximized the smoothness of the quality distributions for the test image sets, the MOS distributions were more uniform than normal. Linear Mixed Models can handle less normal data more effectively than can standard methods, such as ANOVA or MANOVA [1]. Moreover, we used a Heterogeneous Compound Symmetry (HCS) covariance matrix in our Linear Mixed Models. HCS does not assume a similar variance between variables (test images), an assumption that fits well in our data. Subjective preference data (from the camera benchmarking study) are highly dependent on test images. The variance values depend on the test image type and distortions. Variance can be low for one image but high for another image in same set of test images, meaning that in a camera benchmarking study different camera devices and scenes can vary considerably.

Pairwise comparisons of test images were performed using Bonferroni correction (IBM SPSS Statistics 21), which reduces the likelihood of obtaining false-positive results when multiple pairwise tests are used on a single set of data [2]. The new level of significance is computed as a′ = a/t, where a is the significance level and t is the number of tests. In our case we had 45 hypothesis to test (image pairs), and a significance level of 0.05. Because of the Bonferroni correction, image pairs differed at significant level if the p-values were less than 0.05/45.

The Chi-square test [5] served to analyze the data obtained with the PC method to determine the ground truth values of metric D S D within the set of test images. When analyzing forced choice paired comparison data, we used two classes in which the left or right image is of higher quality (degree of freedom, d f = 2−1 and with 95 % confidence level X 2 = 3.814), and the expected counts are E a = m/2 and E b = m/2, where m is the number of observers. The threshold values of O for statistically significant differences can be derived from the equation
$$X^{2}=\frac{(O_{a}-E_{a})^{2}}{E_{a}}+\frac{(O_{b}-E_{b})^{2}}{E_{b}}. $$
(9)
In our experiment, we had 15 observers (m = 15) and assumed a statistically significant difference between image pair A and B, if 73 % of the times image A or B is considered better than image B or A, i.e. 11 of the 15 observers preferred image A or B.

3 Results

3.1 Descriptive statistics

This sub-section presents and analyzes the results of the descriptive statistics (average MOS, standard deviation and correlation values).

Figure 5 shows the MOS value histograms for the DR25, DR50, DR75 and ACR methods. The skewness values of the data (MATLAB skewness function) were -0.02, -0.01, -0.01 and 0.24 for the DR25, DR50, DR75 and ACR methods. For the DR methods, the MOS values were distributed smoothly on the scale (except for the start and end peaks of the histograms) and the skewness values were close to zero. For the ACR method, the MOS values gathered more toward the left side of the scale, and the skewness value was positive. Using the ACR method, the observers selected more low-quality values than high-quality values for the test images.
Fig. 5

Histograms of MOS values for DR25, DR50, DR75 and ACR methods

Table 3 lists the average MOS values computed over all test images. The average MOS values from the DR methods were close to the mean of the scale (M O S = 50). For the ACR method, the observers gave more weight to small values, and the average MOS was lower than the mean of the scale. This result reveals an imbalance in the evaluations of quality with the ACR method. The observers most likely anticipated that the quality of forthcoming images would be higher than the images already evaluated, so they left room on the right side of the scale. For the DR methods, the observers most likely had an idea of the quality differences between the test images prior to the evaluations, and therefore used the entire scale.
Table 3

Descriptive statistics

Method

Average MOS

Average SD

Rank correlation with the PC method

ACR

43.6

16.9

0.89

DR25

48.4

15.0

0.96

DR50

49.4

15.4

0.96

DR75

49.9

14.4

0.97

Table 3 shows the average standard deviation values over all test images for the ACR, DR25, DR50 and DR75 methods. Figure 6 presents the fitted polynomial functions (y = ax 2+bx + c) for the quality evaluation methods, where y = the standard deviation and x = MOS. The average standard deviations and the fitted functions show higher values with the ACR method than with the DR methods. Based on these results, the reference image series of the DR method reduces unwanted noise in evaluations of quality. With the DR method, the observers were more unanimous in their quality decisions than with the ACR method.
Fig. 6

Fitted polynomials of the standard deviation values as a function of MOS

As observed in Fig. 6, the standard deviation values decrease as they approach the low or high values of the MOS scales. The inverted U-shape is common in studies of subjective quality evaluation [29]. One reason for this shape is the clipping of the ratings at the far ends of the scale [29]. The higher standard deviation values in the middle of the scale could also be interpreted to mean that it is easier to evaluate images that are either very poor or very good, which would result in less variation at the far ends [4].

Figure 7 shows the MOS values for the ACR, DR25, DR50 and DR75 methods as a function of the PC values. The PC values were calculated with Eq. 6. Figure 7 shows that the different versions of the DR methods correlate strongly and linearly with the PC method. The ACR method does not correlate as well with the PC method as the DR methods do. Table 3 lists the rank correlation values between the PC and the ACR, DR25, DR50 and DR75 methods. The rank correlation values show that the values of the DR method predict the values of the PC method well.
Fig. 7

The MOS values of the ACR, DR25, DR50 and DR75 methods as a function of the PC values

Based on a summary of the descriptive statistics, we can conclude that the series of reference images in the DR method reduces randomness in the answers (lower standard deviations) and spreads the MOS values more smoothly on the quality scale (smoother MOS histograms). Furthermore, the DR method correlates more closely with the values of the ground truth (PC method) than the ACR method does.

3.2 Discriminability

This sub-section presents and analyzes the values of the discriminability metrics: histogram overlapping (D H O ), effect size (D E S ) and the number of statistically significant differences (D S D ).

Figure 8 shows the histograms of overlapping confidence intervals for the DR25, DR50, DR75 and ACR methods. The histogram values of the ACR method are higher than the histogram values of the DR methods, especially in the quality range from 30 to 60. The latent reason for the higher values can be derived from average MOS and standard deviation values. The MOS values of the ACR method were clustered in the lower end of the scale, and the average MOS values were lower than the scale average. Furthermore, for the ACR method, the standard deviation values were high, which led to higher CI values. Table 4 lists the D H O metric values for the ACR and DR methods calculated with Eq. 7. The value of the ACR method is higher than the values of the DR methods, as Fig. 8 suggests. A smaller D H O value indicates a higher probability of finding statistically significant differences between a pair of images.
Fig. 8

The histogram of overlapping confidence intervals for the data obtained with the ACR, DR25, DR50 and DR75 methods

Table 4

Discriminability statistics

Method

Histogram overlapping (D H O )

Effect size (D E S )

Number of statistically significant differences (D S D )

ACR

13145

2.01

23

DR25

10274

2.37

29

DR50

10300

2.40

29

DR75

9063

2.53

30

PC

39

Table 4 shows the D E S metric values. The D E S value of the ACR method is lower than those of the DR methods. A larger effect size indicates higher statistical power; the probability of finding statistically significant differences between a pair of images is thus higher.

The third discriminability metric was the number of statistically significant image pairs. The metric values for the ACR and DR methods were calculated using the Linear Mixed Models with Bonferroni correction. The PC method provides ranking data, and the number of statistically significant differences was calculated using the Chi-square test.

In this study, the number of test images for the different scenes was 10. The different scenes were evaluated individually. The maximum number of statistically significant differences for the 10 test images was calculated as 0.5(10 ∗ (10 − 1)) = 45. Table 4 shows the average number of statistically significant image pairs over six scenes for the ACR, DR25, DR50, DR75 and PC methods. The ACR method differentiated fewer image pairs than the DR methods did. Figure 9 shows the numbers of statistically significant image pairs for different scenes. The PC method differentiated more image pairs than did the DR methods, regardless of the image scene. Moreover, the ACR method differentiated fewer image pairs than did the DR methods, regardless of the image scene.
Fig. 9

The numbers indicate how many of the possible image pair comparison revealed a statistically significant difference

Based on a summary of the discriminability metrics, we conclude that the DR method is more sensitive in identifying differences between image pairs than is the ACR method. Moreover, according to the discriminability metrics, the image time parameter t D R for the reference only slightly affects the sensitivity of the DR method. The performance of the DR75 method was only slightly more accurate than the performance of the DR25 or DR50 methods. The difference between the DR25, DR50 and DR75 is time effort; the DR25 method is the fastest.

3.3 Time effort

Figure 10 shows an approximation of test duration as a function of the number of test images. The values are estimated according to the average observer response time (\(\bar {t}_{R}\)) derived from Eqs. 13. The values of \(\bar {t}_{R}\) were 5.9, 10.2, 12.9, 14.3 and 16.5 seconds for the PC, ACR, DR25, DR50 and DR75, respectively. According to the data, the value of \(\bar {t}_{R}\) increased as a function of the value t D R . It seems that the observers responded a bit slower if the value of t D R was higher.
Fig. 10

The time required to compare a given number of test images

For the ACR method, the approximation of test duration increased as a linear function of the number of test images. For the DR and PC methods, the growth is more exponential than linear. In particular, for the PC method, the test duration grows rapidly while the number of test images increases. If the test duration is restricted to 30 minutes, as standard recommends [9], and only one image set is tested, then the theoretical maximum number of test images used with the PC method according to this approximation is 24. For the ACR, DR25, DR50 and DR75 methods, the theoretical maximum number of test images is 177, 63, 48 and 40, respectively.

It should, however, be pointed out that the approximations for the maximum numbers of test images presented in Fig. 10 are derived from test data when the number of test images was 10. In addition, the realistic maximum number of test images with the DR method is less than 63. For example 15 seconds slide show of 60 images before evaluating an image is too long. Anyway, a typical number of cameras (test images) in a camera benchmarking study, which is application example of this study, is between 10 and 15. From the application point of view, the possible number of image sets is even more interesting question than the theoretical maximum number of test images. Figure 11 shows more realistic approximation for test duration as a function of the number of image sets (scenes) when number of test images was 10. If the test duration is restricted to 30 minutes and image sets with 10 images are used, then the maximum number of image sets with the PC method is 6 and maximum number of test images is 60 (6 scenes × 10 test images). For the ACR, DR25, DR50 and DR75 methods, the maximum number of image sets is 17, 11, 9 and 7, respectively.
Fig. 11

The time required to compare a given number of test images if observers evaluate sets of ten test images at a time

3.4 Supplementary experiment

At total of 32 observers (23 women and 9 men) participated in this supplementary experiment in which we used image sets of 15 images instead of 10 images. We aimed to study if the same t D R can be used for higher number of test images to achieve the same level of accuracy. The observers were divided into two groups of 16 observers each. The groups evaluated test images using the DR50 or PC method. The viewing environment was the same as described in Section 2.2.

In the experiment the observers were evaluated images from three scenes (n = 15, s = 3). The rank correlation value between the PC and the DR50 methods when n = 15 was 0.98. This result shows that the values of the DR50 method (if n = 15) predict the values of the PC method well.

The maximum number of statistically significant differences for the 15 test images is calculated as 0.5(15∗(15−1)) = 105. Table 5 shows the number of statistically significant image pairs for the DR50 and PC methods when n = 15. According to the results for the DR50, the accuracy of n = 15 was close to the accuracy of n = 10. The accuracy of the DR50 was 74.4 % (29/39 image pairs) when n = 10 and 71.9 % when n = 15. The test duration of DR50 was 54 % (16.3/30.1 minutes, 3∗15 test images) for the speed of the PC method when n = 15. When n = 10 test duration of DR50 was 70 % for the speed of the PC method.
Table 5

The numbers indicates how many of the possible image pair comparison revealed a statistically significant difference for the DR50 and PC method when n=15

Scene

1

2

3

Average

PC

91/105

97/105

97/105

95/105

DR50

75/105

59/105

71/105

68.3/105

The results in this section prove that the DR method is the best option when the number of test images is 10 – 15 and test setup includes many image sets as recommended in [6]. However, it should be pointed out that the PC or ACR can be better options than the DR if the number of test images is low (e.g. n = 5) or the number of test images is high (e.g. n = 30).

4 Discussion

4.1 Hypotheses revisited

The first hypothesis of the study (H1) was that the DR method is more accurate than the ACR method. The main metric of accuracy in this study was the number of significantly different image pairs. The results show that the DR method differentiated more image pairs than did the ACR method. Assuming that the PC method identifies differences between the images (if present) we can calculate the accuracy of the ACR and DR methods as a percentage of the ground truth. On average, the PC method differentiated 39 of 45 image pairs. Thus, the accuracy of the ACR method was 60.0 % (23/39 image pairs), the accuracy of the DR25 and DR50 methods was 74.4 % (29/39 image pairs), and the accuracy of the DR75 method was 76.9 % (30/39 image pairs). Furthermore, the rank correlations between the methods and the supplementary experiment (Section 3.4) support hypothesis H1. The rank correlation values of the DR methods were higher than those of the ACR method. The MOS values gathered with the ACR method showed more randomness than did the values gathered with the DR methods. The supplementary experiment shows that the accuracy was in the same level with the DR50 method for n = 10 and n = 15.

The second hypothesis of the study (H2) was that the DR method is faster than the PC method. The speed was measured as an average test duration, when 15 observers evaluated six sets of ten test images. The results showed that the test duration of DR25 was 58 % (15.2/26.4 minutes), that of DR50 was 70 % (18.6/26.4 minutes), and that of DR75 was 88 % (23.2/26.4 minutes) for the speed of the PC method. The test duration of the PC method lengthens rapidly as the number of test images grows. If one heeds the standard recommendation of 30 minutes the maximum number of test images for the PC method is 24. For example, when the average test time for the DR25 method is under 30 minutes, the theoretical number of test images can be as high as 63.

It should, however, be noted that slide show of 60 images before evaluating an image is too long. Test images should be evaluated using smaller sets of test images. If the sets of ten test images are used, as in typical camera benchmarking research, the number of test images with the DR25 method can be as high as 110 (11 × 10 test images). We conclude that this study proved that the DR method is the best option when the number of test images is 10 – 15 and test setup includes many image sets. If the number of test images is low (e.g. n = 5) or the number of test images is high (e.g. n = 30), then the PC or ACR methods can be better options than the DR.

4.2 Discriminability issues

Descriptive statistics (mean and standard deviation values) indicated differences between the DR methods (DR25, DR50 versus DR75) with different display times for reference image. The performance of the DR75 method exceeded the performance of DR50 method, which exceeded the performance of the DR25 method. The discriminability metrics, however, showed less clear differences than did the descriptive statistics. For example, the number of significantly different images with the DR75 method exceeded that with the DR25 and DR50 methods by only one (image). Thus, the use of the DR25 or DR50 methods instead of the DR75 method is preferable due to shorter test times.

4.3 Cyclic relations

In the camera benchmarking study (or a similar one) one should be concerned that quality estimates can have cyclic relations. If image A is of higher quality than B, and image B is of higher quality than C, then image C is of higher quality than A. Cyclic relations can occur, especially when images are similar and contain multi-type distortions and manipulations [14].

It is not clear how the images with cyclic relations should be ranked or how the results should be presented. If the image A is better than the image B, the image B is better than the image C, and at the same time the image C is better than the image A, the average result over all observers (mean opinion scores obtained by the ACR or DR) indicates that the images are on the middle range without statistically significant differences within the group of images A, B and C. By averaging the p values from each row obtained by the PC method for the same test image set is showing the same result. However, the PC method compares image pairs and the cyclic relations can be observed from rank matrixes before averaging. The ACR and DR method determine the quality scores directly for each image and these scores are unable to show if cyclic relations occur.

4.4 Method options

The PC method is accurate but can be too slow when the number of test images is high. The literature shows, however, that one can speed up the pair comparison tests by reducing the number of test image pairs. Instead of comparing every test image pairs (the complete method), for example Silverstein and Farrel [25] presented a partial method (a binary tree sorting), in which more comparisons are made between closer samples than between more distant samples. Further study should be conducted to test the accuracy and time effort of the pair comparison methods with reduced number of image pairs compared with the complete PC and DR methods.

The DR method presents all images of a scene except one image whose quality to be evaluated as the reference images. An option, between the ACR and the DR, is to present all the test images only once before evaluating each test image set. Other options would be to present randomly sampled images in a similar way as the SAMVIQ or to present all images by spatially arranging them together in a window instead of a slide show. If randomly sampled subset from the reference images is presented, it should be ensured that the subset is comprehensive enough. A problem of presenting reference images in a window, instead of a slide show, is small physical size of the reference images. Observers do not get an idea of the range of quality and types of distortion from the thumbnail images. In addition, thumbnail sized images reduce the visibility of some distortion types, such as differences in sharpness, while some differences, like color cast, might become more visible between images. This might skew the evaluation process, as observers weight their evaluations towards the distortions that are easy to detect. Some of the distortions, although very visible in a matrix of thumbnails, might not be that annoying or visible on full sized images. However, further study should be conducted to show the accuracy and time effort of those options compared to the standard ACR, PC and DR methods.

4.5 Supplementary material

The subjective evaluation data collected in this study with the DR, ACR and PC methods are available to the research community. Several image databases [20, 24] have been published previously and are available. To the best of our knowledge, however, this database is the first to use subjective evaluation data collected with different quality evaluation methods for test images captured with different cameras. This database offers new opportunities for researchers to study differences between scores obtained with various methods. For example, one can investigate whether the method of pair comparison ranks images with multiple distortions and modifications differently than the ACR method does. Moreover, image database with more realistic images (compared to the simulated distortions of many other published databases) is valuable in developing objective algorithms for assessing image quality.

5 Conclusions

In this study, we evaluated the performance of the Dynamic Reference (DR) method. The DR method creates reference image sets from test images. Observers then view reference image set prior to evaluating the test images. The DR method introduces quality differences and types of distortions in test images. The reference image set is the main difference between the DR and the standardized ACR method. The longer test durations of the DR method, as compared to the ACR method, can be a drawback. Our results proved that the DR method is faster than the PC method and more accurate than the ACR method.

Footnotes

References

  1. 1.
    Bagiella E, Sloan R, Heitjan D (2000) Mixed-effects models in psychophysiology. Psychophysiology 37(1):13–20CrossRefGoogle Scholar
  2. 2.
    Bland M (1995) An introduction to medical statistics, 2nd edn. Oxford Medical PublicationsGoogle Scholar
  3. 3.
    Blin J (2006) New quality evaluation method suited to multimedia context: samviq. In: Proceeding of the 2nd International Workshop on Video Processing and Quality Metrics, (VPQM)Google Scholar
  4. 4.
    Engelke U, Maeder A, Zepernick HJ (2012) Human observer confidence in image quality assessment. Signal Process Image Commun 27(9):935–947CrossRefGoogle Scholar
  5. 5.
    Greenwood P, Nikulin M (1996) A guide to chi-squared testing. A Wiley-Interscience Publication. Wiley. http://books.google.fi/books?id=bc8zfQSKOwIC
  6. 6.
    I3A (2007) CPIQ initiative phase 1 white paper fundamentals and review of considered test methodsGoogle Scholar
  7. 7.
    ISO 20462 (2012) Psychophysical experimental methods for estimating image quality – Part 3: Quality ruler methodGoogle Scholar
  8. 8.
    ITU-R, Rec.BT.1788 (2007) Methodology for the subjective assessment of video quality in multimedia applicationsGoogle Scholar
  9. 9.
    ITU-R BT.500 (2012) Methodology for the subjective assessment of the quality of television picturesGoogle Scholar
  10. 10.
    ITU-T Rec. P. 910 (2008) Subjective video quality assessment methods for multimedia applicationsGoogle Scholar
  11. 11.
    Jianping Z, Glotzbach J (2007) Image pipeline tuning for digital cameras. In: Proceedings of IEEE International Symposium on Consumer Electronics, (ISCE), pp 1–4Google Scholar
  12. 12.
    Jin E, Keelan B (2010) Slider-adjusted softcopy ruler for calibrated image quality assessment. J Electron Imaging 19(1):011,009CrossRefGoogle Scholar
  13. 13.
    Kuang J, Yamaguchi H, Liu C, Johnson G., Fairchild M (2007) Evaluating HDR rendering algorithms. ACM Trans Appl Percept 4(2)Google Scholar
  14. 14.
    Leisti T, Radun J, Virtanen T, Halonen R, Nyman G (2009) Subjective experience of image quality: attributes, definitions, and decision making of subjective image quality. In: Proceedings of the SPIE 7242, Image Quality and System Performance VI, pp 72,420D–72,420D–9Google Scholar
  15. 15.
    Mantiuk R, Tomaszewska A, Mantiuk R (2012) Comparison of four subjective methods for image quality assessment. Comput Graph Forum 31(8):2478–2491CrossRefGoogle Scholar
  16. 16.
    Nikkanen J, Gerasimow T, Kong L (2008) Subjective effects of white-balancing errors in digital photography. Opt Eng 47(11):113,201. doi: 10.1117/1.3013232 CrossRefGoogle Scholar
  17. 17.
    Nuutinen M, Orenius O, Säämänen T, Oittinen P (2012) A framework for measuring sharpness in natural images captured by digital cameras base on reference image and local areas. EURASIP J Image Video Process 2012(13640):1–15Google Scholar
  18. 18.
    Nuutinen M, Virtanen T, Oittinen P (2012) Features for predicting quality of images captured by digital cameras. In: Proceedings of IEEE International Symposium on Multimedia, pp 165–168Google Scholar
  19. 19.
    Petit J, Mantiuk R (2013) Assessment of video tone-mapping: are cameras’ s-shaped tone-curves good enough?. J Vis Commun Image Represent 24(7):1020–1030CrossRefGoogle Scholar
  20. 20.
    Ponomarenko V, Lukin V, Zelensky A, Egiazarian K, Carli M, Battisti F (2009) Tid2008 a database for evaluating of full-reference visual quality assessment metrics. Adv Modern Radioelectron 10:30–45Google Scholar
  21. 21.
    Radun J, Leisti T, Virtanen T, Häkkinen J, Vuori T, Nyman G (2010) Evaluating the multivariate visual quality performance of image-processing components. ACM Trans Appl Percept 7(3):16:1–16:16CrossRefGoogle Scholar
  22. 22.
    Ramanath R, Snyder W, Yoo Y, Drew M (2005) Color image processing pipeline. IEEE Signal Proc Mag 22(1):34–43CrossRefGoogle Scholar
  23. 23.
    Redi J, Liu H, Alers H, Zunino R, Heynderickx I (2010) Comparing subjective image quality measurement methods for the creation of public databases. In: Proceedings of the SPIE 7529, Image Quality and System Performance VII, pp 752,903–752,903–11Google Scholar
  24. 24.
    Sheikh HR, Sabir MF, Bovik AC (2006) A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Trans Image Process 15(11):3440–3451CrossRefGoogle Scholar
  25. 25.
    Silverstein D, Farrell J (2001) Efficient method for paired comparison. Jof Electron Imaging 10(2):394–398CrossRefGoogle Scholar
  26. 26.
    Taichi K., Kazuhisa Y., Takanori H. (2012) Performance comparison of subjective assessment methods for 3d video quality. In: Proceedings of International Workshop on Quality of Multimedia Experience, (QoMEx), pp 218–223Google Scholar
  27. 27.
    Teunissen K (1996) The validity of CCIR quality indicators along a graphical scale. SMPTE Motion Imaging J 105(3):144–149Google Scholar
  28. 28.
    Ween B, Kristoffersen D, Hamilton G, Olsen D (2005) Image quality preferences among radiographers and radiologists. a conjoint analysis. Radiography 11(3):191–197CrossRefGoogle Scholar
  29. 29.
    Winkler S (2009) On the properties of subjective ratings in video quality experiments. In: Proceedings of international workshop on quality of multimedia experience, (QoMEx), pp 139–144Google Scholar
  30. 30.
    Winkler S (2012) Analysis of public image and video databases for quality assessment. IEEE J Select Topics Signal Process 6(6):616–625CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Mikko Nuutinen
    • 1
  • Toni Virtanen
    • 1
  • Tuomas Leisti
    • 1
  • Terhi Mustonen
    • 1
  • Jenni Radun
    • 1
  • Jukka Häkkinen
    • 1
  1. 1.Institute of Behavioural Sciences, University of HelsinkiHelsinkiFinland

Personalised recommendations