Keywords

1 Introduction

Sparse representation has been widely studied due to its promising performance [1–4]. It can be used in image classification [5–10], face recognition [11–14], image retrieval [15], and image restoration [16]. The basic idea is to represent an input signal as a sparse linear combination of the atoms in the dictionary. Since the dictionary quality is a critical factor for the performance of the sparse presentation, lots of approaches focus on learning a good dictionary. Aharon et al. [17] presented the K-SVD algorithm, which iteratively updated the sparse codes of samples based on the current dictionary, and optimized the dictionary atoms to better fit the data. The discriminative information resided in the training samples might be ignored in this method. To solve this problem, some approaches [18–24] aim to learn more discriminative dictionaries. Mairal et al. [22] added a discriminative reconstruction constraint in the dictionary learning model to gain discrimination ability. Pham et al. [23] proposed a joint learning and dictionary construction method with consideration of the linear classifier performance. Yang et al. [24] employed the Fisher discrimination criterion to learn a structured dictionary.

However, in these methods, features are used separately while learning the dictionary, which results that the similarity information between the features is lost. Similar features in the same class thus may be encoded as dissimilar codes, while features in different classes may be encoded as similar codes with the learned dictionary. In order to alleviate this problem, we propose a discriminative neighborhood preserving dictionary learning method that explicitly takes the similarity and class information of features into account. Figure 1 shows the idea of our method. The circle represents the feature \(x_{i}\)’s neighborhood which is composed of features close to the \(x_{i}\). Some of the neighbors are with the same label as \(x_{i}\), and others are not. Our method encourages the distance between the codes of \(x_{i}\) and its neighbors in the same class as small as possible, at the same time maintains the distance between the codes of \(x_{i}\) and its neighbors in different classes. The learned dictionary can ensure that similar features in the same class could be encoded as similar codes and the features in different classes could be encoded as dissimilar codes.

Fig. 1.
figure 1

The basic idea of our method. The left is the neighborhood of feature \(x_{i}\) which contains the features close to \(x_{i}\). The different colors represent the neighbors of different classes. The blue features in the neighborhood are the neighbors of \(x_{i}\) with the same label. The neighbors with the same label are expected to be encoded close to the code of \(x_{i}\), while other neighbors are expected to be encoded distant. Therefore, our method is more discriminative for classification.

Inspired by [25, 26], we construct a Laplacian matrix which expresses the relationship between the features. The dictionary learned with this Laplacian matrix can well characterize the similarity of the similar features and preserve the consistence in sparse codes of the similar features. Different from [25, 26], the class information is taken into account to further enhance the discriminative power of the dictionary in our method. Through introducing the class information, the Laplacian matrix is not only with the similarity information of the features in the same class but also can distinguish features in different classes. By adding the Laplacian term into the dictionary learning objective function, our method is able to learn a more discriminative dictionary. The experimental results demonstrate the encoding step is efficient with the learned discriminative dictionary and the classification performance of our method is improved with the dictionary.

The rest of this paper is organized as follows. In Sect. 2, we provide a brief description of the sparse presentation problem and introduce our discriminative neighborhood preserving dictionary learning method. In Sect. 3, the optimization scheme of our method is presented, including learning sparse codes and learning the dictionary. The experimental results and discussions are displayed in Sect. 4. Finally, we conclude the paper in Sect. 5.

2 Discriminative Neighborhood Preserving Dictionary Learning Method

2.1 Sparse Representation Problem

We briefly review sparse representation. Given a data matrix \(X=[x_{1},\cdots ,x_{n}] \in R^{d \times n}\), dictionary matrix \(D=[d_{1},\cdots ,d_{k}]\in R^{d \times k}\), where each \(d_{i}\) represents a basis vector in the dictionary, coefficient matrix \(V=[v_{1},\cdots ,v_{n}]\in R^{k \times n}\), where each column is a sparse representation for a data point. Each data point \(x_{i}\) can be represented as a sparse linear combination of those basis vectors in the dictionary. The objective function of sparse presentation can be formulated as follows:

$$\begin{aligned} \min \sum _{j=1}^{n}\Vert v_{i}\Vert _{0} \qquad s.t. X=DV \end{aligned}$$
(1)

\(\Vert v_{i}\Vert _{0}\) is the number of nonzero entries of \(v_{i}\), representing the sparseness of \(v_{i}\). However, the minimization problem for this sparse representation with \(l_{0}\) norm is shown to be an NP-hard problem [27]. The most widely used approach is to replace the \(l_{0}\) norm with its \(l_{1}\) norm. With the loss function, the objective function then becomes

$$\begin{aligned} \min _{D,V}\Vert X-DV\Vert ^{2}_{F}+\lambda \sum _{i=1}^{n}\Vert v_{i}\Vert _{1} \qquad s.t. \Vert d_{i}\Vert ^{2} \le c, \quad i=1,\ldots ,k \end{aligned}$$
(2)

The first term in Eq. (2) represents the reconstruction error, \(\lambda \) is the parameter used to balance the sparsity and the reconstruction error.

2.2 Formulation of Discriminative Neighborhood Preserving Dictionary Learning

In most current methods, the features are used separately while learning the dictionary. The similarity information among the features is lost which lead to the similar features can be encoded as totally different codes. In order to alleviate this problem, we propose a discriminative neighborhood preserving dictionary learning method. The dictionary learned by our method can well represent the intrinsic geometrical structure of the features to better characterize the relationship between the features and get more discriminative power through the features’ class information.

Given the training features set \(X= \{x_{1},x_{2},\ldots ,x_{n}\}\) and the label of the training features. For each feature \(x_{i}\), we choose l-nearest neighbors of \(x_{i}\) in the same class to form \(\{x_{i^{1}},x_{i^{2}},\ldots ,x_{i^{l}}\}\) and choose m-nearest neighbors of \(x_{i}\) in different classes to form \(\{x_{i_{1}},x_{i_{2}},\ldots ,x_{i_{m}}\}\). All of these neighbors make up a local neighborhood of \(x_{i}\) which can be represented as \(X_{i}=\{x_{i^{1}},x_{i^{2}},\ldots ,x_{i^{l}},x_{i_{1}},x_{i_{2}},\ldots ,x_{i_{m}}\}\). \(V_{i}=\{v_{i^{1}},v_{i^{2}},\ldots ,v_{i^{l}},v_{i_{1}},v_{i_{2}},\ldots ,v_{i_{m}}\}\) is the codes of \(X_{i}\) about the dictionary. As shown in Fig. 1, the purpose of our method is to learn a discriminative dictionary which make the distance between \(v_{i}\) and its neighbors in the same class as small as possible and the distance between \(v_{i}\) and its neighbors in different classes as large as possible

$$\begin{aligned} \min \sum _{i=1}^{n}(\sum _{j=1}^{l}\Vert v_{i}-v_{i^{j}}\Vert ^{2}-\beta \sum _{p=1}^{m}\Vert v_{i}-v_{i_{p}}\Vert ^{2}) \end{aligned}$$
(3)

\(\beta \) is the metric factor. We define W as the similarity matrix corresponding to the features, whose entry \(W_{ij}\) measures the similarity between \(x_{i}\) and \(x_{j}\). If \(x_{i}\) is among the l-nearest neighbors in the same class of \(x_{j}\) or \(x_{j}\) is among the l-nearest neighbors in the same class of \(x_{i}\), \(W_{ij} = 1\). If \(x_{i}\) is among the m-nearest neighbors in different classes of \(x_{j}\) or \(x_{j}\) is among the m-nearest neighbors in different classes of \(x_{i}\), \(W_{ij} = -\beta \), otherwise, \(W_{ij} = 0\). Through the similarity matrix, the Eq. (3) can be represented as

$$\begin{aligned} \min \sum _{i=1}^{n}(\sum _{j=1}^{l}\Vert v_{i}-v_{i^{j}}\Vert ^{2}-\beta \sum _{p=1}^{m}\Vert v_{i}-v_{i_{p}}\Vert ^{2})=\min \sum _{i=1}^{n}\sum _{j=1}^{l}\Vert v_{i}-v_{j}\Vert ^{2}W_{ij} \end{aligned}$$
(4)

We define the degree of \(x_{i}\) as \(d_{i}=\sum _{j=1}^{n}W_{ij}\), and \(D=diag(d_{1},\ldots ,d_{n})\). The Eq. (4) can be converted as [28]

$$\begin{aligned} \frac{1}{2}\min \sum _{i=1}^{n}\sum _{j=1}^{l}\Vert v_{i}-v_{j}\Vert ^{2}W_{ij}=\min Tr(VLV^{T}) \end{aligned}$$
(5)

where \(L=D-W\) is the Laplacian matrix. By adding this Laplacian term into the sparse presentation, we get the objective function of our method:

$$\begin{aligned} \min _{D,V}\Vert X-DV\Vert ^{2}_{F}+\lambda \sum _{i=1}^{n}\Vert v_{i}\Vert _{1}+ \alpha Tr(VLV^{T}) \qquad s.t. \Vert d_{i}\Vert ^{2} \le c, \quad i=1,\ldots ,k \end{aligned}$$
(6)

Due to the Laplacian term, both the similarity among the features and the class information are considered during the process of dictionary learning and the similarity of codes among the similar features can be maximally preserved.

The Eq. (6) is not convex for D and V simultaneously, but it is convex for D when V is fixed and it is also convex for V when D is fixed. Motivated by the work in [29], we propose the following two-stage strategy to solve the Eq. (6): learning the codes V while fixing the dictionary D, and learning dictionary D while fixing the codes V.

3 Optimization

3.1 Learning Codes V

When fixing the dictionary D, Eq. (6) becomes the following optimization problem:

$$\begin{aligned} \min _{V}\Vert X-DV\Vert ^{2}_{F}+\lambda \sum _{i=1}^{n}\Vert v_{i}\Vert _{1}+ \alpha Tr(VLV^{T}) \end{aligned}$$
(7)

Equation (7) is an L1-regularized least squares problem. This problem can be solved by several approaches [30, 31]. Instead of optimizing the whole codes matrix V, we optimize each \(v_{i}\) one by one until the whole V converges following [26, 32]. The vector form of Eq. (7) can be written as

$$\begin{aligned} \min \sum _{i=1}^{n}\Vert x_{i}-Dv_{i}\Vert ^{2}+\lambda \sum _{i=1}^{n}\Vert v_{i}\Vert _{1}+ \alpha \sum _{i,j=1}^{n}L_{ij}v_{i}^{T}v_{j} \end{aligned}$$
(8)

When updating \(v_{i}\), the other codes \({v_{j}}(j \ne i)\) are fixed. We rewrite the optimization with respect to \(v_{i}\) as follow:

$$\begin{aligned} \min _{v_{i}}f(v_{i})\Vert x_{i}-Dv_{i}\Vert ^{2}+\lambda \sum _{j=1}^{k}|v_{i}^{(j)}|+ \alpha L_{ii}v_{i}^{T}h_{i} \end{aligned}$$
(9)

where \(h_{i}=2\alpha (\sum _{j\ne i}L_{ij}v_{j})\), \(v_{i}^{(j)}\) is the j-th coefficient of \(v_{i}\). We use the feature-sign search algorithm in [29] to solve this problem. Define \(h(v_{i})=\Vert x_{i}-Dv_{i}\Vert ^{2}+\alpha L_{ii}v_{i}^{T}v_{i}+v_{i}^{T}h_{i}\), then \(f(v_{i})=h(v_{i})+\lambda \sum _{j=1}^{k}|v_{i}^{(j)}|\). If we know the signs (positive, zero, or negative) of the \(v_{i}^{(j)}\) at the optimal value, we can use either \(v_{i}^{(j)}\) (if \(v_{i}^{(j)}>0\)), \(-v_{i}^{(j)}\) (if \(v_{i}^{(j)}<0\)), or 0 (if \(v_{i}^{(j)}=0\)) to replace each of the terms \(|v_{i}^{(j)}|\). Considering only nonzero coefficients, the Eq. (9) is reduced to a standard, unconstrained quadratic optimization problem, which can be solved analytically and efficiently. When we update each \(v_{i}\) in the algorithm, maintaining an active set of potentially nonzero coefficients and their corresponding signs (all other coefficients must be zero). Our purpose is to search for the optimal active set and coefficient signs which minimize the objective function. The algorithm proceeds in a series of feature-sign steps: on each step, it is given the active set and the signs of current target, then it computes the analytical solution about the Eq. (9) and updates the solution, the active set and the signs using an efficient discrete line search between the current solution and the analytical solution. The detailed steps of the algorithm are stated in Algorithm 1.

figure a

3.2 Learning Dictionary D

In this section, we present a method for learning the dictionary D while fixing the coefficients matrix V. Equation (6) reduces to the following problem:

$$\begin{aligned} \min _{D}\Vert X-DV\Vert ^{2}_{F} \qquad s.t.\Vert d_{i}\Vert ^{2}\le c,i=1,...,k \end{aligned}$$
(12)

Equation (12) is a least squares problem with quadratic constraints. It can be efficiently solved by a Lagrange dual method [29].

Let \(\lambda =[\lambda _{1},...,\lambda _{k}]\), and \(\lambda _{i}\) is the Lagrange multiplier associated with the i-th inequality constraint \(\Vert d_{i}{\Vert }^{2}-c\le 0\), we obtain the Lagrange dual function:

$$\begin{aligned} \min _{D}L(D,\lambda )=Tr((X-DV)^{T}(X-DV))+\sum ^{n}_{j=1}\lambda _{j}(\sum ^{k}_{i=1}d^{2}_{ij}-c) \end{aligned}$$
(13)

Define \(\varLambda =diag(\lambda )\), Eq. (13) can be written as

$$\begin{aligned} \min _{D}L(D,\lambda )=Tr(X^{T}X-XV^{T}(VV^{T}+\varLambda )^{-1}(XV^{T})^{T}-c\varLambda ) \end{aligned}$$
(14)

The optimal solution is obtained by letting the first-order derivative of Eq. (14) equal to zero

$$\begin{aligned} D^{*}=XV^{T}(VV^{T}+\varLambda )^{-1} \end{aligned}$$
(15)

Substituting Eqs. (15) into (14), the Lagrange dual function becomes:

$$\begin{aligned} \min _{\varLambda }Tr(XV^{T}(VV^{T}+\varLambda )^{-1}VX^{T})+cTr(\varLambda ) \end{aligned}$$
(16)

We optimize the Lagrange dual Eq. (16) using the conjugate gradient. After obtaining the optimal solution \(\varLambda ^{*}\), the optimal dictionary D can be represented by \(D^{*}=XV^{T}(VV^{T}+\varLambda ^{*})^{-1}\).

4 Experiments

In this section, we evaluate our method on four public datasets for image classification: Scene 15, UIUC-Sport, Caltech-101, and Caltech-256. For each experiment, we describe the information of datasets and detailed settings. The effectiveness of our method is validated by comparisons with popular methods.

4.1 Parameters Setting

In the experiment, we first extract SIFT descriptors from 16 \(\times \) 16 patches which are densely sampled using a grid with a step size of 8 pixels to fairly compare with others. Then we extract the spatial pyramid feature based on the extracted SIFT features with three grids of size 1 \(\times \) 1, 2 \(\times \) 2 and 4 \(\times \) 4. In each spatial sub-region of the spatial pyramid, the codes are pooled together by max pooling method to form a pooled feature. These pooled features from each sub-region are concatenated and normalized by L2 normalization as the final spatial pyramid features of the images. The dictionary in the experiment is learned by these spatial pyramid features.

In our method, the weight of the Laplacian term \(\alpha \), the sparsity of the coding \(\lambda \), and the constraints of the neighborhood in different classes \(\beta \) play more important roles in dictionary learning. According to our observation, the performance is good when \(\beta \) is fixed at 0.2 for Scene 15 and UIUC-Sport. For Caltech-101 and Caltech-256, 0.1 is much better for \(\beta \). For Scene 15, the value of \(\alpha \) is 0.2 and the value of \(\lambda \) is 0.4. For UIUC-Sport, Caltech-101, and Caltech-256, the value of \(\alpha \) is 0.1 and the value of \(\lambda \) is 0.3.

4.2 Scene 15 Dataset

Scene 15 dataset contains 15 categories. Each category contains 200 to 400 images and the total image number is 4485. In order to compare with other work, we use the same setting to choose the training images. We randomly choose 100 images per category and test on the rest. This process is repeated for ten times to obtain reliable results.

Table 1 gives the performance comparison of our method and several other methods on the Scene 15 dataset. We can see that our method can achieve high performance on scene classification. It outperforms ScSPM by nearly 11 % by considering the geometrical structure of the feature space based on sparse representation and outperforms LScSPM by nearly 2 % by adding the class information. Both of them demonstrate the effectiveness of our method. Our discriminative neighborhood preserving dictionary learning method can not only make use of the geometrical structure of the feature space to preserve more similarity information, but also make the final dictionary more discriminative by considering the class information which can improve the image classification performance.

Table 1. Performance comparison on the Scene-15 dataset

4.3 UIUC-Sport Dataset

UIUC-Sport dataset contains 8 categories for image-based event classification and 1792 images in all. These 8 categories are badminton, bocce, croquet, polo, rock climbing, rowing, sailing and snow boarding. The size of each category ranges from 137 to 250. Following the standard setting for this dataset, we randomly choose 70 images from each class for training and test on the rest images. We repeat this process for ten times for fair comparison.

Table 2 gives the performance comparison of our method and several other methods on the UIUC-Sport dataset. We can see that our method outperforms ScSPM by nearly 5 % and outperforms LScSPM by nearly 2 %. This demonstrates the effectiveness of our proposed method.

Table 2. Performance comparison on the UIUC-Sport dataset

4.4 Caltech-101 Dataset

The Caltech-101 dataset contains 9144 images in 101 classes with high intra-class appearance shape variability. The number of images per category varies from 31 to 800. We follow the common experimental setup and randomly choose 30 images per category for training and the rest for testing. This process is repeated for ten times.

The average classification rates of our method and several other methods on Caltech-101 dataset are reported in Table 3. From these results, we see that our method performs better than most existing methods. As compared to the LLC, our method makes a 2.4 % improvement. It demonstrates the effectiveness of our proposed method.

Table 3. Performance comparison on the Caltech-101 dataset

4.5 Caltech-256 Dataset

Caltech-256 dataset contains 256 categories and a background class in which none of the image belongs to those 256 categories. The number of images is 29780 with much higher intra-class variability and higher object location variability as compared to Caltech-101. Therefore Caltech-256 is a very challenging dataset so far for object recognition and classification. The number of images per category is no less than 80. We randomly choose 30 images per category for training and repeat this process for ten times.

The average classification rates of our method and several other methods on Caltech-256 dataset are reported in Table 4. We can see that our method can achieve the state-of-the-art performances on this dataset.

Table 4. Performance comparison on the Caltech-256 dataset

5 Conclusion

In this paper, we propose a discriminative neighborhood preserving dictionary learning method for image classification. We consider the geometrical structure of the feature space in the process of dictionary learning to preserve the similarity information of the features. By introducing the class information, the discriminative power of the learned dictionary is enhanced. The learned dictionary can ensure that the similar features in the same class are encoded as similar codes and the features in different classes are encoded as dissimilar codes. Experimental results on four public datasets demonstrate the effectiveness of our method.