Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Sparse representation has been successfully applied to a variety of problems in image processing and computer vision, e.g., image denoising, image restoration and image classification. In the framework of sparse representation, an image can be represented as a linear combination of a few bases selected sparsely from an over-complete dictionary. The dictionaries can be predefined by the use of some off-the-shelf basis, such as the Discrete Fourier Transform (DFT) matrix and the wavelet matrix. However, it has been shown that learning the dictionary from the training data enables a more sparse representation of the image in comparison to using a predefined one, which can lead to improved performance in the reconstruction task. Some typical reconstruction dictionary learning methods include the Method of optimal direction (MOD) [1], and K-SVD [2].

Sparse representation has also been considered in pattern recognition applications. For example, it has been used in the Sparse representation classifier (SRC) [3], which achieves competitive recognition performance in face recognition. In contrast to image reconstruction which only concerns the sparse representation of an image, in pattern recognition, the main goal is to find the correct label for the query sample, consequently the discriminative capability of the learned dictionary is crucial. A variety of discriminative dictionary learning methods have recently been proposed, that involve two different strategies.

One strategy is to learn a class-specific dictionary, which discriminates different classes of images via a sparse representation residual. Instead of learning a dictionary shared by all classes, it seeks to learn a sub-dictionary for each class. Yang et al. [4] first sought to learn a dictionary for each class, and applied it to image classification. In [5], instead of considering the dictionary atoms individually at the sparse coding stage, the atoms are selected in groups according to some priors to guarantee the block sparse structure of each coding coefficient. In [6], a group-structured dirty model is used to achieve a hierarchical structure of each coding coefficient via estimating a superposition of two coding coefficients and regularising them differently. It is worth noting that the multi-task setting is adopted in [6]. However, the sub-dictionaries in all these methods are disjoint to each other, and how many and which atoms belong to each class is fixed during the entire dictionary learning process. In addition, although class-specific setting of the dictionary works well when the number of training samples in each class is sufficient, it is not scalable to the problem with a large number of classes.

Another strategy is to learn a dictionary that is shared by all classes. Commonly, a classifier based on the coding vectors is learned together with the shared dictionary by imposing some class-specific constraints on the coding vector. Rodriguez et al. [7] proposed that samples of the same class should have similar sparse coding vectors which are achieved by using linear discriminant analysis. Yang et al. [8] proposed Fisher discrimination dictionary learning (FDDL) where the Fisher discrimination criterion is imposed on the coding vectors to enhance class discrimination. Cai et al. proposed support vector guided dictionary learning methods (SVGDL) [9], which is a generalised model of FDDL, that considers the squared distances between all pairs of coding vectors. In all these methods, the similarity between two coding vectors is measured by the Euclidean distance, which allows two images of different classes to be represented by using the same set of dictionary atoms. To our knowledge so far, no multi-task setting has been used in the shared dictionary, since it is difficult to discriminate groups of coefficients between different classes owing to the lack of prior knowledge concerning subdictionary structure.

In recent years, it has been shown that adding structural constraints to the supports of coding vectors can result in improved representation robustness and better signal interpretation [1012]. In this paper, the multi-task setting adopts a shared dictionary, however, instead of learning the dictionary with discrimination based on the Euclidean distance between the coefficients for different classes, we consider a different principle: The support of the coding vectors from one class should be similar, while the support of the coding vectors from different classes should be dissimilar. Here the support of a coding vector denotes the indices of the non-zero elements of the image sparse representation under some dictionary.

More specifically, we propose a support discrimination dictionary learning method (SDDL), that finds a dictionary under which the coefficients of images from the same class have a common sparse structure while the size of the overlapped signal support of different classes is minimised. Informed by the multitask learning framework [13], and the multiple measurement vector (MMV) model [14] in the signal processing field, an effective way to encourage a group of signals to share the same support is to simultaneously encode those samples. Based on this idea, we encode multiple images from the same class, requiring that their coefficient matrix is largely ‘row sparse’, where only a few rows have non-zero elements. In addition to the similarity of intra-class coding vectors, the main contribution of our work is that we also design a new discriminative term to guarantee the dissimilarity of inter-class coding vectors by reducing the overlapped signal support from different classes. This can be achieved by minimisation of the \(\ell _{0}\) norm of the Hadamard product between any pair of coefficients in different classes. An iterative reweighting scheme that produces more focal estimates is adopted as the optimization progresses.

The SDDL provides the following advantages. Firstly, the previous multi-task setting based dictionary learning methods all use disjoint sub-dictionaries, in which how many and which atoms belong to each class is fixed during the entire dictionary learning process. In contrast, a multi-task setting using a shared dictionary is adopted in SDDL. Our approach can automatically identify overlapped sub-dictionaries for different classes, where the size of each sub-dictionary is adjusted appropriately during the learning process to suit the training dataset. Furthermore, our approach is scalable to allow for a large number of classes, while the previous sub-dictionary based approaches cannot. Secondly, instead of using the Euclidean distance to measure the similarity and dissimilarity between different coefficients, we achieve discrimination via the support. The structural sparse constraints eases the difficulty in solving the ill-posed inverse problem in comparison to the conventional element-sparse structure [15]. The superior performance of the proposed approach in comparison to the state-of-art is demonstrated using both face and object datasets.

The paper is organised as follows. In Sect. 2, we propose the novel support discrimination dictionary learning method for classification, including the optimisation algorithm and the classification scheme. In Sect. 3, extensive experiments are performed on both face and object datasets to compare the proposed method with other state-of-art dictionary learning methods. Conclusions are drawn in Sect. 4.

2 Support Discrimination Dictionary Learning

2.1 Problem Formulation

Assume that \(\varvec{x}\in \mathbb {R}^{m}\) is a m dimensional image with class label \(c\in \{1,2,...,C\}\), where C denotes the number of classes. The training set with n images is denoted as \(\varvec{X}=[\varvec{x_{1}}, \varvec{x_{2}},...,\varvec{x_{n}}] =[\varvec{X_{1}}, \varvec{X_{2}},...,\varvec{X_{C}}] \in \mathbb {R}^{m\times n}\), where \(\varvec{X_{c}}\) includes \(n_{c}\) training images of class c. The learned dictionary is denoted by \(\varvec{D=[d_{1},d_{2},...,d_{K} ]}\in \mathbb {R}^{m \times K}(K<n) \), where \(\varvec{d_{k}}\) denotes the \(k^{th}\) atom of the dictionary. \(\varvec{A=[A_{1}, A_{2},...,A_{C}]= [a_{1},a_{2},...,a_{n}]} \in \mathbb {R}^{K \times n}\) are the coding coefficients of \(\varvec{X}\) over \(\varvec{D}\). Our dictionary learning problem can be described as

$$\begin{aligned} \min _{\varvec{D},\varvec{A}}R(\varvec{X,D,A})+w_{1} g(\varvec{A}) +w_{2} f(\varvec{A}), \end{aligned}$$
(1)

where \(R\varvec{(X,D,A)}\) denotes the reconstruction residuals for all the images \(\varvec{X}\) with the sparse representation matrix \(\varvec{A}\) under the dictionary \(\varvec{D}\), \(g(\varvec{A})\) is a regulariser to promote intra-class similarity, \(f(\varvec{A})\) is the inter-class discriminative term based on the coding vectors \(\varvec{A}\), and \(w_{1}>0\) and \(w_{2}>0\) denote the weights for the final two terms in (1). In this optimisation problem, we learn a single dictionary shared among all classes while exploring the discrimination of the coding vectors.

In a common multi-task learning setting, a group of tasks share certain aspects of some underlying distribution. Here we assume the intra-class coding vectors share a similar sparse structure. In our formulation, we use the joint sparsity regularisation \(\ell _{p}/\ell _{q}\) norm of a coefficient matrix corresponding to one class, rather than encoding each training image separately. More specifically, we set \(p=2, q=0\), which means that the intra-class coefficient matrix should be ‘row sparse’, i.e., where each row is either all zero or mostly non-zero, and the number of non-zero rows is low. In this way, we can find the shared nonzero supports for each class automatically, rather than predefining their number and position. However, minimizing the \(\ell _{2}/\ell _{0}\) norm is NP hard, so in this paper, we use \(\ell _{2}/\ell _{1}\) norm instead. In this way, we can design a regulariser to promote intra-class similarity as:

$$\begin{aligned} g(\varvec{A})= \sum _{i=1} ^{C} \left\| \varvec{A_{i}} \right\| _{2,1}= \sum _{i=1} ^{C}\sum _{k=1} ^{K}\left\| \varvec{a^{(ik)}} \right\| _{2}, \end{aligned}$$
(2)

where \(\varvec{A_{i}}\) represents the coefficient matrix for the \(i^{th}\) class and \(\varvec{a^{(ik)}}\) denotes the \(k^{th}\) row of coefficient matrix \(\varvec{A_{i}}\).

In general, discrimination for different classes can be assessed by the similarity of the intra-class coding vectors and the dissimilarity of inter-class ones. As mentioned previously, the similarity of intra-class coding vectors is promoted by the \(\ell _{2}/\ell _{1}\) regulariser. To encourage dissimilarity of the inter-class coding vectors, we design the following discriminative term:

$$\begin{aligned} f(\varvec{A})= \sum _{i=1} ^{C} \sum _{p} \sum _{q} \left\| \varvec{a_{i,p}}\circ \varvec{a_{/i,q}}\right\| _{0}, \end{aligned}$$
(3)

where \(\circ \) denotes the Hadamard (elementwise) product between two vectors \(\varvec{a_{i,p}}\) and \(\varvec{a_{/i,q}}\), where \(\varvec{a_{i,p}}\) and \(\varvec{a_{/i,q}}\) are the \(p^{th}\) column of \(\varvec{A_{i}}\) and the \(q^{th}\) column of \(\varvec{A_{/i}}\) respectively. \(\varvec{A_{i}} \in \mathbb {R}^{K\times n_{i}}\) represents the coefficient matrix for the \(i^{th}\) class, while \(\varvec{A_{/i}} \in \mathbb {R}^{K \times (n-n_{i})}\) denotes a sub-matrix of \(\varvec{A} \in \mathbb {R}^{K \times n}\) without the columns in \(\varvec{A_{i}}\). Alternatively, the value of \(\left\| \varvec{a_{i,p}}\circ \varvec{a_{/i,q}}\right\| _{0}\) is the size of the overlapped support between the \(p^{th}\) image of the \(i^{th}\) class and the \(q^{th}\) image that is not in class i. Therefore, \(f(\varvec{A})\) denotes the summation of overlapped supports between images in different classes. However, minimising \(f(\varvec{A})\) in Eq. (3) is an NP hard problem. Enlightened by many recent sparse approximation algorithms that rely on iterative reweighting schemes [1618] to produce more focal estimates as optimization progresses, we use the iterative reweighted \(\ell _{2}\) minimization to approximate the \(\ell _{0}\) norm.

We use the vector \(\varvec{a}^{\odot 2}\) to represent the element by element square of vector \(\varvec{a}\), which equals to \(\varvec{a} \circ \varvec{a}\). We define the weight term \(\varvec{w_{p,q}}\) for a given pair of coefficient \((\varvec{a_{i,p}}, \varvec{a_{/i,q}})\) at each iteration as a function of those coefficients from the previous iteration as

$$\begin{aligned} \varvec{w_{i,p,q}}=\frac{1}{(\varvec{a_{i,p}^{'}} \circ \varvec{a_{/i,q}^{'}})^{\odot 2} + \epsilon } \end{aligned}$$
(4)

where \(\varvec{a_{i,p}^{'}}\) and \(\varvec{a_{/i,q}^{'}}\) are the coefficients from the previous iteration and \(\epsilon \) is a regularization factor that is reduced to zero as the number of iterations increases. In this case, the descrimination term \(f(\varvec{A})\) can be rewritten as

$$\begin{aligned} \begin{aligned} f(\varvec{A})=&\sum _{i=1} ^{C} \sum _{p} \sum _{q} \left\| \varvec{a_{i,p}}\circ \varvec{a_{/i,q}}\right\| _{0} = \sum _{i=1} ^{C} \sum _{p} \sum _{q} \sum _{k} w_{i,p,q}^{(k)} \cdot (a_{i,p}^{(k)} \circ a_{/i,q}^{(k)} ) ^{2} \\ =&\sum _{i=1} ^{C} \sum _{p} \sum _{q} \sum _{k} [w_{i,p,q}^{(k)} \cdot (a_{/i,q}^{(k)})^{2}]\circ (a_{i,p}^{(k)})^{2}\\ =&\sum _{i=1} ^{C} \sum _{p} \sum _{q} diag([\varvec{w_{i,p,q}}\circ (\varvec{a_{/i,q}})^{\odot 2} ] \cdot (\varvec{a_{i,p}})^{\odot 2} =\sum _{i=1} ^{C} \sum _{p} \left\| \varvec{\varOmega _{i,p}} \varvec{a_{i,p}}\right\| _{F}^{2}, \end{aligned} \end{aligned}$$
(5)

where k represents the index of the corresponding vector and

$$\begin{aligned} \varvec{\varOmega _{i,p}}=diag(\sqrt{\sum _{q}(\sqrt{\varvec{w_{i,p,q}}}\circ \varvec{a_{/i,q}} )^{\odot 2})}. \end{aligned}$$
(6)

However, minimising the above \(f(\varvec{A})\) is both time and memory consuming since we need to calculate a weight vector \(\varvec{w_{i,p,q}}\) and thus a distinct weight matrix \(\varvec{\varOmega _{i,p}}\) for each \(\varvec{a_{i,p}}\). Considering the effect of the \(\ell _{2}/\ell _{1}\) regulariser, different coefficients in the same class should have a similar sparse pattern, hence we use the average \((\varvec{\tilde{a_{i}^{'}}})^{\odot 2}\) instead of \((\varvec{a_{i,p}^{'}})^{\odot 2}\) in Eq. (4), where

$$\begin{aligned} \forall p, \;\; (\varvec{a_{i,p}^{'}})^{\odot 2}\approx (\varvec{\tilde{a_{i}^{'}}})^{\odot 2}= \sum _{p} (\varvec{a_{i,p}^{'}})^{\odot 2}/n_{i}. \end{aligned}$$
(7)

That is, all p images of the class i share the same weight vector \(\varvec{w_{\tilde{i},q}}\) as

$$\begin{aligned} \varvec{w_{\tilde{i},q}}=\frac{1}{(\varvec{\tilde{a_{i}^{'}}})^{\odot 2} \circ (\varvec{a_{/i,q}^{'}})^{\odot 2} + \epsilon }. \end{aligned}$$
(8)

Finally Eq. (5) can be rewritten as:

$$\begin{aligned} f(\varvec{A})= \sum _{i=1} ^{C} \sum _{p} \left\| \varvec{\varOmega _{i,p}} \varvec{a_{i,p}}\right\| _{F}^{2} =\sum _{i=1} ^{C} \left\| \varvec{\tilde{\varOmega _{i}}} \varvec{A_{i}}\right\| _{F}^{2}, \end{aligned}$$
(9)

where

$$\begin{aligned} \varvec{\tilde{\varOmega _{i}}}=diag(\sqrt{\sum _{q}(\sqrt{\varvec{w_{\tilde{i},q}}}\circ \varvec{a_{/i,q}})^{2})}. \end{aligned}$$
(10)

By substituting the discrimination term given by Eq. (9) into (1), we can rewrite the dictionary learning problem as

$$\begin{aligned} \min _{D,A} \sum _{i=1}^{C} \left\| \varvec{X_{i}}-\varvec{DA_{i}} \right\| _{F}^{2}+w _{1} \left\| \varvec{A_{i}} \right\| _{2,1}+w_{2} \left\| \varvec{\tilde{\varOmega _{i}} A_{i}} \right\| _{F}^{2}. \end{aligned}$$
(11)

Although the objective function in (11) is not jointly convex to \((\varvec{D},\varvec{A})\), it is convex with respect to \(\varvec{D}\) and \(\varvec{A}\) when the other is fixed. In the sequel, we provide an algorithm which alternately optimises \(\varvec{D}\) and \(\varvec{A}\).

2.2 Optimisation

Finding the solution of the optimisation problem in (11) involves two sub-problems, i.e., to update the coding coefficients \(\varvec{A}\) with fixed \(\varvec{D}\), and to update \(\varvec{D}\) with fixed coefficients \(\varvec{A}\).

First suppose that \(\varvec{D}\) is fixed, and the optimisation problem can be reduced to a sparse coding problem to calculate \(\varvec{A}=[\varvec{A_{1}}, \varvec{A_{2}},..,\varvec{A_{C}}]\) with two constraints. Here, we compute the coefficients matrix \(\varvec{A_{i}}\) class by class. More specifically, all \(\varvec{A_{j} (j\ne i)}\) are fixed thus \(\varvec{\tilde{\varOmega _{i}}}\) is fixed when computing the \(\varvec{A_{i}}\). In this way, the objective function can be further reduced to

$$\begin{aligned} \min _{\varvec{A_{i}}} \left\| \varvec{X_{i}}-\varvec{DA_{i}} \right\| _{F}^{2}+w _{1} \left\| \varvec{A_{i}} \right\| _{2,1}+w_{2} \left\| \varvec{\tilde{\varOmega _{i}} A_{i}}\right\| _{F}^{2}. \end{aligned}$$
(12)

We choose the alternating direction method of multipliers (ADMM) as the optimisation approach because of its simplicity, efficiency and robustness [15, 19, 20]. By introducing one auxiliary variable \(\varvec{Z_{i}}=\varvec{A_{i}} \in \mathbb {R} ^{K \times n_{c}}\), this problem can be reformulated as

$$\begin{aligned} \begin{aligned} \min _{\varvec{A_{i}},\varvec{Z_{i}}}&\left\| \varvec{X_{i}}-\varvec{DA_{i}} \right\| _{F}^{2}+ w _{1} \left\| \varvec{Z_{i}} \right\| _{2,1}+ w_{2} \left\| \varvec{\tilde{\varOmega _{i}} A_{i}} \right\| _{F}^{2} \;\;\; s.t. \;\varvec{A_{i}}-\varvec{Z_{i}}=0. \end{aligned} \end{aligned}$$
(13)

Therefore, the augmented Lagrangian function with respect to \(\varvec{A_{i}},\varvec{Z_{i}}\) can be formed as

$$\begin{aligned} \begin{aligned}&L_{u} (\varvec{A_{i}},\varvec{Z_{i}}) = \left\| \varvec{X_{i}}-\varvec{DA_{i}} \right\| _{F}^{2}+ w _{1} \left\| \varvec{Z_{i}} \right\| _{2,1}+w_{2} \left\| \varvec{\tilde{\varOmega _{i}} A_{i}} \right\| _{F}^{2}\\&\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;- \varvec{\varLambda _{1}}^{T}(\varvec{Z_{i}} -\varvec{A_{i}})+\frac{u_{1}}{2} \left\| \varvec{Z_{i}}-\varvec{A_{i}} \right\| _{2}^{2}, \end{aligned} \end{aligned}$$
(14)

where \(\varvec{\varLambda _{1}} \in \mathbb {R}^{K \times m}\) are the Lagrangian multipliers for equality constraints and \(u_{1}>0\) is a penalty parameter. The Augmented Lagrangian function can be minimised over \(\varvec{A_{i}},\varvec{Z_{i}}\) by fixing one variable at a time and updating the other one. The entire procedure is summarised in Algorithm 1. The Shrink function in Eq. (17) updates \(\varvec{Z_{i}}\) by using row-wise shrinkage, which can be represented as

$$\begin{aligned} \varvec{z^{r}} =max \{\left\| \varvec{q^{r}}\right\| _{2} -\frac{w_{1}}{u_{1}}, 0\}\frac{\varvec{q^{r}}}{\left\| \varvec{q^{r}}\right\| _{2}}, r =1,....,K, \end{aligned}$$
(15)

where \( \varvec{q^{r}}=\varvec{a^{r}}+ \frac{\varvec{\lambda _{1}^{r}}}{u_{1}}\) and \( \varvec{z^{r}},\varvec{a^{r}},\varvec{\lambda _{1}^{r}}\) represent the \(r^{th}\) row of matrix \(\varvec{Z_{i}}, \varvec{A_{i}}, \varvec{\varLambda _{i}}\) respectively.

figure a

Since the above ADMM scheme computes the exact solution for each subproblem, its convergence is guaranteed by the existing ADM theory [21, 22]. After we obtain the sparse coding, we secondly update dictionary \(\varvec{D}\) column by column with fixed \(\varvec{A}\). When updating \(\varvec{d_{i}}\), all the other columns \(\varvec{d_{j}}, j\ne i\) are fixed. Now the objective function in Eq. (13) is reduced to

$$\begin{aligned} \min _{\varvec{D}}{\left\| \varvec{X}-\varvec{DA} \right\| _{F}^{2}},\; s.t.{\left\| \varvec{d_{i}} \right\| _{2}=1}. \end{aligned}$$
(20)

In general, we require that each column of the dictionary \(\varvec{d_{i}}\) is a unit vector. Equation (20) is a quadratic programming problem and it can be solved by using the K-SVD algorithm, which updates \(\varvec{d_{i}}\) atom by atom. In practice, the exact solution by K-SVD can be computationally demanding, especially when the number of training images is large. As an alternative, in the following experiments, we use the approximate KSVD to reduce the complexity of this task [23]. The detailed derivation can be found in Algorithm 5 in [24].

2.3 The Classification Scheme

After obtaining the learned dictionary \(\varvec{D}\), a test sample \(\varvec{y}\) can be classified based on its sparse coefficients over \(\varvec{D}\). We choose a linear classifier both for its simplicity and for the purpose of fair comparison with other dictionary learning methods, although we note that better classifier design (e.g. SRC) can potentially improve the performance further. We design the linear classifier \(\varvec{W} \in \mathbb {R}^{C\times K}\) as [6, 25]:

$$\begin{aligned} \varvec{W^{T}}=(\varvec{AA^{T}}+\eta \varvec{I})^{-1}\varvec{AL^{T}}, \end{aligned}$$
(21)

where \(\varvec{A}\in \mathbb {R}^{K \times n}\) is the final rounded coefficients of the training set. The matrix \(\varvec{L} \in \mathbb {R}^{C\times n}\) contains the label information of the training set. If the training data \(\varvec{x_{i}}\) belongs to the class c, the element \(L_{c,i}\) in vector \(\varvec{l_{i}}\) is one and all the other elements in the same columns are zero. The parameter \(\eta \) controls the tradeoff between the classification accuracy and the smoothness of the classifier.

Next, we can compute the sparse coefficients of the each test sample y using the following objective function:

$$\begin{aligned} \min _{\varvec{a}} {\left\| \varvec{y}-\varvec{Da}\right\| _{F}^{2}+w_{3}\left\| \varvec{a}\right\| _{1}}, \end{aligned}$$
(22)

where \(w_{3}\) is a constant. Finally we apply the linear classifier \(\varvec{W}\) to the sparse coding of a test sample to get the label vector \(\varvec{l_{y}}\) and assigned it to the class \(c= \arg \max _{c} \varvec{l_{y}}\). The overall procedure is summarised in Algorithm 2.

figure b

3 Experimental Validation

In this section, we compare our proposed Support discrimination dictionary learning (SDDL) method with some other existing Dictionary learning (DL) based classification approaches. We verify the classification performance on various datasets, such as face recognition and object classification. The classification performance is measured by the percentage of correctly classified test data. The public datasets used are the Extended-Yale B Face Dataset [26], the AR Face Dataset [27] and the Caltech 101 object dataset [28]. The benchmark algorithms for comparison are the Sparse Representation Classification (SRC) [3], K-SVD [2], Label-Consistent K-SVD (LC-KSVD) [25], Fisher Discrimination Dictionary Learning (FDDL) [8], Support Vector Guided Dictionary Learning (SVGDL) [9] and Group-structured Dirty Dictionary Learning method (GDDL) [6]. For all the competing methods, we tune their parameters for the best performance.

3.1 Parameter Selection

Dictionary Size: In all experiments, the initialised dictionary is randomly selected from the training data. As shown in [8, 25], the larger the size of the dictionary, the better is the performance it can achieve. The disadvantage of a large dictionary is that the problem size becomes large, which is computationally demanding. Therefore, the ideal dictionary learning method should achieve an acceptable level of performance using a relatively small size of dictionary. Here we use the Caltech 101 object dataset as an example. For each class, we randomly choose 30 images for training and the rest for testing. The number of dictionary atoms per class varies from 10 to 30. As shown in Fig. 1, all the DL methods tested improve performance when the dictionary size becomes larger. Also, our proposed SDDL method achieves high classification accuracy and consistently outperforms all the other DL-based methods. The basic reason for good recognition performance, even with only a small size dictionary, is that SDDL learns a shared dictionary for all classes, while it can automatically identify sub-dictionaries for different classes, where the size of each sub-dictionary is adjusted appropriately during the learning process.

Fig. 1.
figure 1

Effect of dictionary size on the classification performance of various DL methods. For the Caltech 101 dataset, the size of training samples per class is fixed to 30. The dictionary atoms per class is varied from 10 to 30. As can be seen, our proposed method outperforms the other DL-based methods.

Regularisation Parameters: There are 3 regularisation parameters \(w_{1}, w_{2}, w_{3}\) that need to be tuned, two in the dictionary learning stage and one in the classifier. In this paper, we employ cross validation to find the regularisation parameters that give the best result.

Stopping Criterion: The proposed algorithm will stop either if the values of the objective function in Eq. (11) in adjacent iterations are sufficiently close in value, or if the maximum number of iterations is reached. In Fig. 2 we show empirically the value of the objective function as the number of iterations increases using the AR dataset, where we can see that the SDDL method converges rapidly.

Fig. 2.
figure 2

The convergence curve of objective function on the AR database.

3.2 Factors Affecting Performance

We will now investigate how the performance is affected by different factors in the proposed method using the face datasets, i.e., the Extended Yale B dataset and the AR dataset. We will discuss two factors as follows:

Factor 1: Function of the \(\ell _{2}/\ell _{1}\) Regularisation Term. As mentioned in Sect. 2.1, the \(\ell _{2}/\ell _{1}\) regularisation term is adopted to make the coefficients from the same class share a similar sparse structure. In this section, we provide a visual illustration to see if the \(\ell _{2}/\ell _{1}\) regularisation term can be truly helpful in the representation of the images from the same class. We compare the sparse codings of the same test samples from two dictionaries, where one is learned with \(\ell _{1}\) regularisation while the other with \(\ell _{2}/\ell _{1}\) regularisation. Figure 3(a) shows 4 test samples of the \(2^{nd}\) subject in the Extended Yale B database; Figs. 3(b) and (c) show the four coefficients corresponding with the two dictionaries respectively. Looking at the coefficients in Fig. 3(b), in which the dictionary is learned with \(\ell _{1}\) regularisation, it can be seen that the coding vectors corresponding to the fourth image are significantly different to the other three coding vectors of the same class, which is not discriminative, owing to the poor quality of the image. However, in the Fig. 3(c), the coding vector of the fourth image now look more similar to the other coding vectors in the class, which has a high probability of being classified correctly. A benefit of such a multi-task learning framework is that ‘good quality’ images help constrain the coding vector of ‘poor quality’ ones in the training stage. In this way, even the ‘poor quality’ images contribute appropriately to the dictionary update.

Fig. 3.
figure 3

An example for 4 test images and their corresponding coefficients. (a) Shows 4 training samples of the \(2^{nd}\) subject in Extended Yale B database; (b) and (c) show the four coefficients corresponding with two dictionaries, where one is learned with \(\ell _{1}\) regularisation while the other with \(\ell _{2}/\ell _{1}\) regularisation respectively.

Factor 2: Function of the Discriminative Term \(f(\varvec{A})\) . As described in Sect. 2.1, the term \(f(\varvec{A})\) is utilised in the objective function to guarantee the discrimination of coding vectors from different classes. In this section, we illustrate both visually and numerically the influence of the discriminative term \(f(\varvec{A})\) with an example from the AR database, as shown in Figs. 4 and 5.

To clearly show the discrimination of coding vectors between subjects in the AR database (100 subjects in total), we calculate a symmetric scatter matrix \(\varvec{S} \in \mathbb {R}^{100 \times 100}\), in which each element \(S_{ij}\) represent the similarity between sparse codings \(\varvec{A_{i}}\), \(\varvec{A_{j}}\) of \(i^{th}\) and \(j^{th}\) subject (\(i,j\in [1,100]\)):

$$\begin{aligned} S_{ij}=\sum _{p} \sum _{q} \left\| \varvec{a_{i,p}}\circ \varvec{a_{j,q}}\right\| _{1}, \end{aligned}$$
(23)

where \(\varvec{a_{i,p}}\) and \(\varvec{a_{j,q}}\) are the \(p^{th}\) column of \(\varvec{A_{i}}\) and the \(q^{th}\) column of \(\varvec{A_{j}}\) respectively. Following this, two scatter matrices are calculated based on the sparse codings of the same test samples from two dictionaries, where one is learned using the discriminative term while the other is not. Then for both scatter matrices, we normalise the largest element of each column or row to unity to permit comparison and plot them in Fig. 4. Accordingly, the diagonal elements represent the similarity of intra-class sparse codings while the off-diagonal elements shows the similarity of the between-subject ones. We see that, the diagonal elements of both figures are the largest, and that there is obviously more between-subject similarity in Fig. 4(a) than in 4(b). By summing the elements in the columns of the scatter matrix to quantify the similarity index for each subject, we then plot them in Fig. 5. The lower the similarity index, the less overlap there is between the pairs of coefficients between this subject and the others, i.e., the better is the discrimination of the coding coefficient. As shown in Fig. 5, the red curve learned using the discrimination term is lower than the blue one learned without the discrimination term for all the 100 subjects, which shows that learning the dictionary with the help with \(f(\varvec{A})\) can decrease the coefficient overlap between different subjects. These visual and numerical results both show that the dictionary learned with the \(f(\varvec{A})\) term can significantly enhance the discrimination performance of the coefficients. We use the Extended Yale and AR face databases to illustrate how this term can help to improve classification performance. With the help with the discrimination term \(f(\varvec{A})\), the recognition rate for the Extended Yale B is enhanced from 96.20 % to 98.50 %, and the recognition rate for the AR database is increased from 95.90 % to 98.00 %. The experimental setting used to obtain these result will be presented fully in Sect. 3.4.

Fig. 4.
figure 4

Comparison between the scatter matrices calculated based on the sparse coding of the same test samples from two different dictionaries. In (a), the dictionary is learned without the discrimination term, and in (b), the dictionary is learned using the discrimination term.

3.3 Object Classification

The Caltech 101 dataset is one of the benchmark datasets used in object classification. It consists of 9144 images, split between 101 distinct object classes including animals, vehicles, as well as a background class. The sample from each class has significant shape variability. In the following experiments, the spatial pyramid features are used as the input for the classifier, which is the same as used in [8, 9, 25]. Following [25], We vary the number of training samples per class from 10 to 30. The size of the dictionary in SDDL is \(K=510\), that is the same as the experimental setting in [9]. The experiments are carried out 10 times with differently chosen partitions. The average classification accuracy of the proposed method (SDDL) compared with other existing dictionary learning based methods is shown in Table 1. The regularisation parameters for the Caltech 101 dataset are \(w_{1}=0.2,w_{2}=10,w_{3}=0.05\). The DL-based methods perform better than SRC, which shows that better performance can be achieved by learning a discriminative dictionary. Our proposed method consistently outperforms the other existing DL based methods, by at least 2.8 % points.

Fig. 5.
figure 5

The comparison between the similarity index calculated based on the sparse coding of the same test samples from two different dictionaries. The red line represents the similarity index calculated by the dictionary learned using the discrimination term, while the blue line represents the similarity index without. (Color figure online)

Table 1. Recognition rates (%) for object classification

3.4 Face Classification

The two benchmark face datasets are the Extended Yale B dataset and the AR dataset. With different illumination conditions and facial expressions, the Extended Yale B dataset consists of 2414 frontal images of 38 subjects (about 64 images per subject). We randomly select half as the training set and the rest as the test set for each class. As in the experimental setting in [6, 25], we crop each image to \(192 \times 168\) pixels, and then normalise and project it to a 504 dimension vector using a random Gaussian matrix. The dictionary size of the Extended Yale B dataset is 570, which corresponds to an average of 15 atoms per subject. As discussed previously, there is no explicit correspondence between the dictionary atoms and the labels of the individual at the training stage.

Similarly, the AR face dataset consists of over 4000 images of 126 subjects, which is more challenging owing to more variation, i.e., different illumination, expressions and facial occlusion (e.g., sunglasses, scarf). As in the experimental setting in [6, 25], we use the subset of the dataset which contains 2600 images for 50 male and 50 female subjects. For each subject, we randomly select 20 and 6 images for training and testing respectively. We crop each image to \(165 \times 120\) pixels, and then normalise and project it to a 540 dimension vector using a Gaussian matrix. The dictionary size of the AR dataset is 500, that corresponds to an average of 5 atoms per subject. The dictionary is shared by all subjects.

The experiments are carried out 10 times with different chosen partitions. The average classification accuracy of the proposed method compared with other existing dictionary learning based methods are shown in the Table 2. The regularisation parameters for the Extended Yale B dataset are \(w_{1}=0.04, w_{2}=2, w_{3}=0.005\), and for the AR face database are \(w_{1}=0.05,w_{2}=3,w_{3}=0.005\). We can see that the proposed SDDL method achieves an improvement of at least 1.7 and 2 % points over the next best scheme in terms of classification accuracy for the Extended Yale B and the AR datasets respectively.

Table 2. Recognition rates (%) for face classification

4 Conclusion

We incorporate structured sparsity into the dictionary learning process and propose a support discrimination dictionary learning (SDDL) method for image classification. In contrast to other methods, we use the sparse structure, i.e., support, to measure the similarity between the pairs of coefficients, rather than the Euclidean distance which is widely adopted in many dictionary learning approaches for classification. The discrimination capability of the proposed method is enhanced in two ways. First, a row sparse regulariser is adopted so that a shared support structure for each class can be learned automatically. Second, we adopt a discriminative term to make the coefficients from different classes have minimum support overlap between each other. It can be achieved by minimisation of the \(\ell _0\) norm of the Hadamard product between any pair of coefficients in different classes. It worth noting that our approach can automatically identify overlapped sub-dictionaries for different classes, where the size of each sub-dictionary is adjusted appropriately during the learning process to suit the training dataset. In this way, this proposed approach is scalable to classification tasks having a large number of classes. Extensive experimental results on object recognition and face recognition demonstrate the proposed method can generate more discriminative sparse coefficients and that it has superior classification performance to a number of state-of-the-art dictionary learning based methods.