1 Introduction

Visual comparison is a significant technique in computer vision, which is defined as that given two images and a special attribute, we can predict which one exhibits the attribute more, less or equal compared to the other. As is shown in Fig. 1, we could predict the relation of the right image pair according to prior relation of the left image pairs. Obviously, you are likely to conclude that given some indistinguishable image pairs there are meaningless to only predict the ordered results.

Fig. 1.
figure 1

Visual comparison with attribute ‘Smile’

Attributes, which are visual properties describable in words, can capture anything from material properties (‘plastic’, ‘wooden’), shapes (‘pointy’, ‘round’) to facial expressions (‘serious’, ‘smiling’). Since their emergence, attributes have inspired a lot of work in image search [1,2,3,4], biometrics [5, 6], and language based supervision for recognition [7,8,9,10]. Those attribute models are mainly divided into two forms: binary attributes and relative attributes. Whereas binary attributes are suitable only for clear-cut predicates, such as boxy, relative attributes can show ‘real-valued’ properties that inherently exhibit a series of strengths, such as ‘comfort’. Relative attributes [8] were first proposed by learning the global ranking Support Vector Machine (SVM) functions, followed by much recent work for visual comparison based on ranking SVM functions [3, 11,12,13,14,15]. With relative attributes, originally introduced in [8, 16], images could be compared in terms of how strongly they exhibit a nameable visual property. Given an image pair, relative attributes could indicate which image in a pair exhibits an attribute more or less, while the Just Noticeable Differences (JND) method introduced in [15] could indicate one image in a pair exhibits an attribute equal or not to another image in test time. Now we propose a novel method to indicate which image in a pair exhibits an attribute more, less or equal in test time.

In order to obtain both the ordered pairs and equal pairs in test time for visual comparison, we propose one-versus-one multi-class classification with relative attributes by training the linear regression model for visual comparison, which can be formulated by learning a mapping function between a vector-formed feature input and a scalar-valued output.

Due to the wide existence of multi-class classification problems in different areas, many different methods have been developed to solve such problems. A wide variety of empirical studies have reported the decomposition and ensemble methods can increase the performance on multi-class classification problems. Most existing research shows that the design or selection of decomposition and ensemble strategies play an important role in the performance of decomposition and ensemble methods. With regard to decomposition strategies, One-vs-One (OVO) [17], One-vs-All (OVA) [18], and Error-Correcting Output Coding (ECOC) [19] are the most widely used. Due to the intrinsic of relative attribute, visual comparison with relative attribute could be casted into 3 class classification problem. Compared to OVA, OVO strategy is competent for the case the category number is pretty small.

The main contribution of this paper is the idea to learn OVO multi-class classification by linear regression for visual comparison, which to our knowledge has not been explored for visual comparison in any prior work. The other contribution is that we not only predict the ordered pairs but also predict equal pairs in test time. Tests on three challenging datasets show that the proposed approach obtains promising results for visual comparison.

2 Related Work

Comparing attributes has gained a lot of interest recently. The relative attributes approach learned a global linear ranking function for each attribute [8], which was extended to non-linear ranking functions in [20, 21] by training a hierarchy of rankers and normalizing predictions at the leaf nodes. Aside from learning to rank formulations, researchers have applied the Elo rating system for biometrics [5], and a local learning method based on the ranking SVM [16] was proposed for fine-grained visual comparison. Most of the prior methods produce a ranking function based on SVM for each attributes, whereas we propose multi-class classification with relative attributes and produce a mapping function based on linear regression for each attribute in visual comparison. In contrast to the proposed approach, all those prior methods are only able to predict the ordered image pairs.

Regression is one of the critical techniques for visual attribute application. A number of computer vision problems such as human age estimation could be formulated as a regression problem by learning a mapping function between a high dimensional vector-formed feature input and a scalar-valued output [22,23,24,25,26]. A locally adjusted regression method [24] to search local regions for adjusting was proposed, and followed by bio-inspired features (BIF) for regression [25] in human age-estimation. Most of these regression methods have achieved better performance.

Besides, much prior work [8, 16, 21] predicting ordered image pairs in test time for visual comparison has been proposed in visual applications. JND method [15] proposed to identify equal image pairs in test time could predict ordered pairs according to the learned ranks of image pairs when the image pairs are not equal, but it cost a lot of time to compute the prior probability due to the local learning. Therefore, we propose a multi-class classification method to conduct visual comparison so that we can not only predict both ordered pairs and equal pairs in test time, but also can save a lot of computational time.

3 Approach

We use the OVO multi-class classification to conduct relative attributes for visual comparison, and apply linear regression to efficiently train the OVO models for visual comparison. In the following, we first introduce OVO multi-class classification for visual comparison, and then present the linear regression model to realize OVO multi-class classification for visual comparison.

3.1 OVO Multi-class Classification for Visual Comparison

Relative attributes are generally obtained from the ranking SVM functions for only predicting ordered pairs, while OVO multi-class classification are obtained from the regression model for predicting both ordered pairs and equal pairs in test time in visual comparison. The regression model will be introduced in the following section.

The Multi-class classification model aims at assigning a class label for each input observation. Given a training data set \(\{(X_1, y_1),...,(X_n, y_n)\}\), where \(X_i\in R^r\) denotes the ith observation feature vector, and \(y_i \in \{1,..., K\}\) is the class label of the ith observation. It is a mapping function \(f:X\rightarrow \{1,..., K\}\) inferred from the labeled training data set through a training process. Therefore, visual comparison problem could be casted into the following multi-class classification problem. Given a certain attribute \(a_m\) and a set of images \(I = \{u_i\}\), each of which is described by the image feature \(u_i\in R^d\), and a set of image pairs \(P_m=\{(s,t)\}\). \(P_m\) is a set of image pairs with attribute \(a_m\) and the corresponding class labels can be defined as \(l_m\in \{1,2,3\}\). Among them, \(l_m=1\) or \(l_m=2\) denote that image s has the attribute \(a_m\) more or less than image t respectively, while \(l_m=3\) denotes image s has the attribute \(a_m\) as much as image t. We wish to learn a multi-class classification to successfully identify the relation between image s and image t given the attribute \(a_m\). In particular, visual comparison could be categorized to 3 classes (more, less or equal) according to relations of image pairs. To this end, we define pairwise vector between image s and t as follows:

$$\begin{aligned} x_{st}=p(u_s,u_t),u_s,u_t\in I \end{aligned}$$
(1)

where p is an entry-wise function that outputs a pairwise vector between \(u_s\) and \(u_t\). Therefore, the multi-class training set for visual comparison can be represented as \(\{(x,l_m)\}_{st}, \forall (s,t)\in P_m\).

The OVO [27] approach is to divide the multi-class problem with K classes into \(C_2^K=K(K-1)/2\) binary classification problems. One binary classifier is constructed for each binary classification problem for discriminating each pair of classes. Let the binary classifier that discriminates the classes of i and j be denoted by \(f_{ij}\), the output of binary classifier \(f_{ij}\), denoted by \(y_{ij}\), is defined as follows.

$$\begin{aligned} y_{ij}=f_{ij}(x) \end{aligned}$$
(2)

where \(f_{ij}(.)\) is realized by linear regression introduced by the following section. More specifically, \(y_{ij}\) is the confidence score denoting that images pair x belongs to ith class, while \(1-y_{ij}\) is the confidence score denoting x belongs to jth class. The class selected by the weighted voting strategy for OVO is the class with the largest total confidence score from all binary classifiers and is defined as [18]:

$$\begin{aligned} class=\mathop {argmax}\limits _{i=1,...K}\mathop {\varSigma }\limits _{1\le j\ne i\le K}(y_{ij}) \end{aligned}$$
(3)

3.2 OVO by Linear Regression Model

Given the pairwise visual comparison training set \(\{(x,l_m)\}\) and the attribute \(a_m\), OVO multi-class classification is to train \(N=K(K-1)/2\) binary classification functions by linear regression. Specifically, we select image pairs of class i and j to train linear regression model. Therefore, we need to learn the mapping relationship between x and \(l_m\) by a regression function for binary classification. Most existing relative attribute learning methods aim to establish a mapping by SVM. However linear ridge regression [28] is a classical statistical problem that aims to find a linear function that models the dependencies between vectors \(\{x\} \) in \(R^r\) and label variables \(\{l_m\}\) in R. In this paper, we learn the mapping by a multivariate linear ridge regression function, our goal is to learn N regression functions for each attribute:

$$\begin{aligned} f_{ij}(x_{st})=w_m^Tx_{st}+b_m, \forall (s,t)\in P'_m \quad and \quad P'_m\subset P_m \end{aligned}$$
(4)

The objective functions by Ridge Regularization [29] can be written as:

$$\begin{aligned} min \quad \frac{1}{2}||w_m||_2^2+C\sum _{(s,t)\in P'_m}loss(f_{ij}(x_{st}),l_m(x_{st})) \end{aligned}$$
(5)

where the constant C is a balanced parameter between minimizing error function and regularization, and loss(\(\cdot \)) denotes the loss function. To simplify the above objective functions without losing generality, quadratic loss function is considered as the loss function. The objective functions are then written as:

$$\begin{aligned} min \quad \frac{1}{2}||w_m||_2^2+C\sum _{(s, t)\in P'_m}(l_m(x_{st})-(w_m^Tx_{st}+b_m))^2 \end{aligned}$$
(6)

To further simplify the above objective functions, we set \(z_k=x_{st},N=\left| P'_m\right| \), and then the objective functions can be written as

$$\begin{aligned} min \quad \frac{1}{2}||w_m||_2^2+C\sum _{k=1}^N(l_m(z_{k})-(w_m^Tz_k+b_m))^2 \end{aligned}$$
(7)

where \(z_k\in R^r\) is a training vector after the feature reduction, \(w_m\) is also a weight vector with r dimensions and \(b_m\in R \) is a bias term respectively. The model parameters are estimated by solving an equality-constrained Quadratic Programming Problem, which has a closed-form global optimal solution as follows [6]:

$$\begin{aligned} \left[ \begin{array}{l} w_m\\ b_m \end{array} \right] =-(Q^TQ)^{-1}Q^Tp \end{aligned}$$
(8)

where positive semi-definite matrix Q and vector p are given by

$$\begin{aligned} Q=\left[ \begin{array}{cc} 2C\sum _{k=1}^Nz_kz_k^T+E &{}2C\sum _{k=1}^Nz_k \\ 2C\sum _{k=1}^Nz_k^T &{}2CN \end{array} \right] \end{aligned}$$
(9)
$$\begin{aligned} p=\left[ \begin{array}{c} -2C\sum _{k=1}^Nl_m(z_k)z_k \\ -2C\sum _{k=1}^Nl_m(z_k) \end{array} \right] \end{aligned}$$
(10)

where E is an identity matrix.

Therefore, given a test pair (st) and an attribute \(a_m\), we can compute \(x_{st}=p(u_s, u_t)=concat(u_s, u_t)\) and infer \(y_{ij}(x_{st})\), and then we can obtain the class of the image pair by Eq. (3). At last we can predict image s exhibits the attribute \(a_m\) more, less or equal compared to image t through the obtained class label.

4 Experiments

To validate the advantages of the proposed method, we compare it with several state-of-the-art methods on three datasets: UT-Zap50K-1 [16], the Outdoor Scene Recognition dataset [30] (OSR), and a subset of the Public Figures faces dataset [31] (PubFig). All methods run for 10 random train/test splits on all pairs, in which we select 300 pairs for testing and the remaining for training. In all methods, we simply fix it at \(C = 1\) and use the same labeled data as in [8], and then report the accuracy of the percentage of correctly pairs and macro-Average measure (maA) commonly used in evaluating performance on multi-class problems respectively. We will compare the following methods on the above datasets:

  • JND [15]: The JND method which develops a Bayesian local learning strategy to infer whether images are indistinguishable or not for a given attribute. If the images are distinguishable, the ordered relation could be obtained by the learned ranks in pairs.

  • RSVM + OVA: The method which develops a one-versus-all multi-class classification method by ranking svm (RSVM) for visual comparison.

  • RSVM + OVO: The approach which develops a one-versus-one multi-class classification method by ranking svm (RSVM) for visual comparison.

  • LRM + OVA: The one-versus-all method which develops a one-versus-all multi-class classification method by linear regression model (LRM) for visual comparison.

  • LRM + OVO: The proposed approach which first develops a one-versus-one multi-class classification method by linear regression model (LRM) for visual comparison.

4.1 Experiments on Three Benchmark Datasets

Experiment on UT-Zap50K-1. UT-Zap50K-1 contains 50025 images with 4 attributes (‘Open’, ‘Pointy’, ‘Sporty’, ‘Comfort’) [16]. The image descriptors kindly provided by the authors of each dataset are 960-dim GIST and 30-bin Lab color histograms. We reduce their dimensionality to 30 with PCA to prevent overfitting. For a fair comparison, we take the same feature reduction as in the other methods. Table 1 demonstrates the test results of the proposed method compared to the other methods on UT-Zap50K-1.

Table 1. Accuracy of visual comparison tested on UT-Zap50K-1

Obviously as seen in Table 1, the accuracy of LRM is far higher than the accuracy of RSVM, which demonstrates the LRM has an advantage over the RSVM in visual comparison. This just validates RSVM method is not optimal because the model used in the method may be more sensitive to training samples. By the same token, the OVO method is superior to the OVA for most attributes in visual comparison. Only for the attribute ‘Sporty’ there are the approximate accuracy. More importantly, from the Table 1 we see the accuracy of LRM + OVO is far higher than that of JND [15], which shows the proposed method is more effective and significant for visual comparison.

The maA measure is another performance measure in multi-class problem, which is defined as follows [32]:

$$\begin{aligned} maA=\frac{1}{K} \mathop {\varSigma }\limits _{i=1}^K{\frac{n_{ii}}{n_i}} \end{aligned}$$
(11)

where \(n_{ij}\) denotes the number of observations of the i class which are predicted as the j class \((i = 1, ..., K, j = 1, ..., K)\) and \(n_i=\mathop {\varSigma }\limits _{j=1}^K n_{ij}\).

Table 2 shows the maA measure on UT-Zap50K-1. Obviously, the proposed method outperforms the other methods in all attributes except the JND method. The maA mesure of the JND is a little more than that of the proposed method with attributes ‘Pointy’ and ‘Sporty’, but the accuracy of the proposed method is a lot more than that of the JND method. Therefore, the proposed method is still effective for visual comparison.

Table 2. The maA measure on UT-Zap50K-1

Experiment on OSR. The Outdoor Scene Recognition dataset [30] (OSR) consists of 2,688 images with 8 categories and 6 attributes (‘natural’, ‘Open’, ‘perspective’, ‘size-large’, ‘diagonal-plane’ and ‘depth-close’, the corresponding abbreviations are ‘Natr’, ‘Open’, ‘Persp’, ‘LgSi’, ‘Diag’ and ‘ClsD’). The image pairs are those based on category-wise comparisons such that there are about over 20,000 pairs per attribute when we select 30 images in each category. Without loss of generality, we randomly select 1000 pairs used for training and 300 pairs for testing. Tables 3 and 4 respectively show the experimental accuracy and the maA measure on the OSR dataset for all attributes. Similar to the results on UT-Zap50K-1, the proposed method outperforms the other state-of-art methods on OSR. This just demonstrates both LRM and OVO are the effective approaches for visual comparison.

Table 3. Accuracy of visual comparison tested on OSR
Table 4. The maA measure on OSR

Experiment on PubFig. We select a subset of the Public Figures faces dataset [31] (PubFig), which includes 772 images with 8 categories and 11 attributes (‘Masculine_looking’, ‘White’, ‘Young’, ‘Smiling’, ‘Chubby’, ‘Visible_Forehead’, ‘Bushy_Eyebrows’, ‘Narrow_Eyes’, ‘Pointy_Nose’, ‘Big_Lips’, ‘RoundFace’, and the corresponding abbreviations are ‘Male’, ‘White’, ‘Young’, ‘Smil’, ‘Chub’, ‘Foreh’, ‘Eyebrow’, ‘Eye’, ‘Nose’, ‘Lip’, ‘Face’). The method of generating the image pairs is similar to that on OSR. Tables 5 and 6 respectively report the experimental results on the accuracy comparison and the maA measure. Obviously, like the results on OSR, the proposed method almost achieves best results compared to the other methods for all attributes. This further validates the proposed method is an effective method for visual comparison.

Table 5. Accuracy of visual comparison tested on PubFig
Table 6. The maA measure on PubFig

Therefore, from experimental results on the above datasets, we can conclude that the proposed method is an effective method by applying one-versus-one multi-class classification and LRM for visual comparison.

4.2 Time Complexity Analysis

All algorithms are implemented by Matlab on a PC with an Intel i5-4670 CPU @3.40 GHz 3.40 GHz. Without loss of generality, Tables 7 and 8 respectively shows the train time and test time of the proposed method and JND method in a train/test split pairs on the UT-zap50k-1 dataset under the same setup. From these tables we can conclude that the proposed method significantly reduces the computational time compared to the JND method. For train time, the reason is that the JND is trained by RSVM which is solved by an optimized iterative approach, while the proposed method is realized by linear regression model which is solved by a closed form. For so much test time of the JND method, it is mainly because it is solved by a local learning strategy by finding K nearest pairs in order to obtain the prior probability of each test pair in test time and the method used to calculate the distance between the pairs is Information Theoretic Metric Learning (ITML) [33] method which is solved by an optimized iterative approach.

Table 7. Train time (s)
Table 8. Test time (s)

5 Conclusion

In this paper, we have proposed a novel visual comparison method, which applies one-versus-one multi-class classification method and linear regression model with relative attributes for visual comparison. The comprehensive experimental results on three benchmark datasets verified that the proposed method is an effective approach for visual comparison. Meanwhile, the proposed method can save a lot of computational time.