1 Introduction

In the past decades, face recognition has been paid more and more attention due to its great application prospect in the fields of public safety [38], transportation [36], finance [20], social media [29, 30] and so on. Discriminative feature extraction is the very first key step of face recognition, which can be generally categorized as global and local methods. Global methods treat the image as a high-dimensional vector and extract features from it. For effective and efficient global methods, subspace learning methods such as PCA [19] and FLDA [4] are adopted, which have achieved impressive results in face recognition applications. However, they are easily affected by those regions with variances in illumination, expression, occlusion etc. Comparing with the global methods, local methods have become a hot topic because of their good robustness to the local change of the image, which usually establish the feature description of the face image based on the extracted local low-level visual features, such as Gabor [22], LBP [3], SIFT [6].

In spite of the tremendous achievements, most global and local methods only work well when there are sufficient training samples for each subject. However, in many real-world applications such as identity card verification, passport verification in customs, law enforcement, surveillance or access control, only one training sample per person is available. This is so called single sample per person (SSPP) problem [32] which has become one of the greatest challenges in face recognition. Many conventional global or local methods will suffer serious performance drop or fail to work when encountering SSPP problem. This is mainly because it is difficult to distinguish the image changes caused by illumination, expression, occlusion etc. and the essential changes from different person, which leads to the semantic gap between facial features and facial identification.

Recently, the excellent performance of Bag-of-Features (BoF) methods [7, 13, 21, 27] have aroused wide interest, and been introduced into face recognition, which represents an image as a histogram of visual words. BoF extracts middle-level semantic features to weaken the semantic gap between high-level semantics and low-level features. Motivated by this point, we claim that BoF will be also suitable to solve SSPP problem. In this paper, we propose a multistage KNN collaborative coding based BoF (MKCC-BoF) method to address SSPP problem. Firstly, local descriptors are extracted from the single training face images and a visual dictionary is obtained offline by clustering a large set of descriptors with K-means. Then, we design a multistage KNN collaborative coding scheme to project local features into the semantic space, which is much more efficient than the most commonly used non-negative sparse coding algorithm in face recognition. In the k-th stage, we just use the k nearest neighbors from the visual dictionary to compute the collaborative coefficients, which further improves the computing efficiency. At the last stage, we directly use the hard vector quantization by making the coefficient of the nearest neighbor to be one and the others zero. The coding results of all stages are added together as the final coding features. To describe the spatial information as well as reduce the feature dimension, the encoded features are then pooled on spatial pyramid cells by max-pooling, which generates a histogram of visual words to represent a face image. Finally, a linear kernel based SVM classifier is trained with the concatenated features from pooling results. Experimental results on three public face databases show that the proposed MKCC-BoF not only generates well to SSPP problem but also has great robustness to expression, illumination, occlusion and , time variation.

The rest of this paper is organized as follows. We present a brief introduction to related work in the next section. Then in Section 3, we describe the proposed BoF based method in detail. Section 4 demonstrates experiments and results. Finally, we conclude in Section 5 by highlighting key points of our work.

2 Related work

How to effectively extract features from high dimensional, complex and changeable face image is the key step of face recognition. In the last two decades, subspace learning methods are the mainstream in the field of face recognition and have attracted much attentions due to their effectiveness in feature extraction and representation. Principal component analysis (PCA) [19] and fisher linear discriminant analysis (FLDA) [4] are two representative methods which respectively finds a set of optimal orthogonal basis functions to reconstruct the original signal and finds a set of optimal linear transformations to minimize the inner class divergence and maximize the divergence between classes. However, both PCA and FLDA fail to reveal the essential data structures nonlinearly embedded in high-dimensional space. To overcome this limitation, a number of manifold learning methods (e.g., ISOMAP [33], LLE [28], LPP [16], and Laplacian Eigenmap [5]) were proposed by assuming that the data lie on a low-dimensional manifold of the high-dimensional space.

Recently, the significance of feature extraction has been debated due to the excellent performance of sparse representation in face recognition. Wright et al. [39] proposed a robust face recognition via sparse representation based classification (SRC), which codes the test sample as a sparse linear combination of all training samples by L1 norm minimization. Then many extensions of SRC have begun to come out. Besides, it not only can be utilized for face recognition but also show great robustness in various fields such as human pose recovery [17, 43] and web image reranking [42]. To reduce the complexity of SRC, Zhang et al. [44] proposed collaborative representation-based classification (CRC) by using L2 norm instead of L1 norm. However, no matter subspace learning methods or sparse representation methods will suffer serious performance drop or even fail to work when encountering SSPP problem.

In order to address the SSPP problem, many methods have been developed during the last two decades. They can be generally classified into two categories: global methods and local methods. The global methods treat a whole image as a high-dimensional vector, which usually utilizes virtual samples or generic training set to estimate intra-personal variation. For example, Gao et al. [14] utilized SVD to decompose each face image and the obtained non-significant SVD basis images were used to estimate the within-class scatter matrix of this person approximately. Su et al. [31] proposed an adaptive generic learning(AGL) method to infer the discriminative information of the SSPP gallery set by using a generic training set. Recently, Deng et al. [12] proposed a novel generic learning method by mapping the intra-class facial difference of the generic faces to the zero vectors. They also proposed the extended sparse representation-based classifier (ESRC) [11] to make SRC feasible to SSPP problem, which applies an auxiliary intra-class variant dictionary to represent possible variation between the training and testing images. Yang et al. [41] proposed to learn the sparse variation dictionary by using the relationship between the gallery set and the external generic set.

Global methods are easily affected by those regions that are corrupted by variances in illumination, expression, and occlusion. Therefore, some local methods were proposed, which have been proven to be more robust against variations [26]. For example, Chen et al. [9] proposed the BlockFLD method by partitioning each face image into a set of blocks which treats each block as a sample from the same class and applies FLDA to the set of newly produced samples. Lu et al. [24] proposed a discriminative multi-manifold analysis (DMMA) method by learning discriminative features from image patches. The other way is to represent each patch with one feature vector. Then some famous classification techniques, such as K-nearest classifier (KNN), sparse representation based classification (SRC) and collaborative representation based classification (CRC), can be used to predict the label of each patch, like in [26], [39] and [45]. Liu et al. [23] also proposed to use the image local structure relationship to further enhance the performance of PSRC [39] and PCRC [45].

Although local methods can lead to significant improvement in recognition rate and robustness, they still cannot distinguish the image changes caused by illumination, expression, occlusion etc. and the essential changes from different person. In other words, they cannot cross the semantic gap caused by SSPP problem. In order to fundamentally address SSPP problem, we should eliminate the semantic gap as much as possible. The direct way is to find features with semantic information. Fortunately, the excellent performance of bag-of-features (BoF) in the image classification has been introduced into face recognition in recent years, which can be regarded as a kind of middle-level semantic feature. In [21], a robust face recognition algorithm based on the block bag-of-words is proposed. Meng et al. [27] and [7] also build a bag-of-words model for face image, but they use the intensity image as local low-level features. In [40], multi-scale and multi-orientation Gabor transform are first performed on the image. Recently, Cui et al. [10] proposed a face recognition algorithm based on spatial face region description operator, which also uses intensity image to describe each image patches, and the nonnegative sparse coding method is chosen to encode each local feature. Metric learning algorithm is finally used to fuse pooling feature in each block of image. Motivated by the success of BoF, we calim that BoF with semantic information will also be suitable to solve SSPP problem.

3 The proposed approach

The overview of our face recognition using BoF with multi-stage KNN collaborative coding is shown in Fig. 1, which consists of four main steps: (1) image local feature extraction, (2) visual vocabulary construction, (3) local descriptor coding based on multistage KNN collaborative coding scheme, (4) feature pooling. The details of our algorithm are described as follows. Given a training set denoted as R = {(ri, yi)}(i = 1,⋯ , n) where yi is the class label of the i-th face image. Each image in training set is densely partitioned into a set of patches. After the step of the local feature extraction, the set of the local features of all the training image is denoted by X = {x1, x2,⋯ , xN}∈ RD×N where D is the dimension of each local feature, N is the total number of local features of training images, and the feature of each patch is extracted by SIFT descriptor. As each local feature is only a subtle description of the facial image, a large number of them are very similar. Therefore, when face images appear local changes such as illumination, facial expression and occlusion, the distance between similar local features will increase. In order to improve the robustness and discriminability of each local feature, it is necessary to use some coding algorithm to map each local feature from the low dimensional low-level visual features space to the high dimensional middle-level semantic space. It is necessary to train a complete visual dictionary offline in advance to complete the above task. To address this problem, we randomly select a subset of local features denoted as Xs from X, and use K-means clustering algorithm to cluster Xs. All the clusters form the visual vocabulary and each cluster can be regarded as a visual word which represents a specific local pattern shared by the descriptors in that cluster. The number of clusters can vary from hundreds to over tens of thousands, which determines the size of vocabulary.

Fig. 1
figure 1

Overview of our face recognition algorithm

Let Xr be a set of D-dimensional local descriptors extracted from each face image in image dataset, i.e. Xr = [x1, x2,⋯ , xM] ∈ RD×M. Given a visual dictionary V = [v1, v2,⋯ , vK] ∈ RD×K, let ciRK be the coding coefficient vector of xi. To obtain this coding coefficient vector ci, many sparse coding methods have been proposed. However, it is very time-consuming to solve L1 minimization. To obtain the coding efficient vector effectively and efficiently, we propose multistage KNN collaborative coding (MKCC) scheme, which utilizes L2-norm instead of L1-norm. And the illustration of MKCC is shown in Fig. 2. To further reduce the computing burden, we first find its k nearest neighbor visual words by euclidean metric, which is denoted as Vk = [v1, v2,⋯ , vk] ∈ RD×k. And then we use Vk to code the local feature xi by collaborative representation, which can be computed as:

$$ c^{*}=\arg\min\|x_{i}-V_{k} c^{*} \|^{2}_{2}+\lambda\|c^{*}\|_{2} $$
(1)
Fig. 2
figure 2

The illustration of our multistage KNN collaborative coding (MKCC) algorithm

After obtaining c, we will get a K × 1 vector \({c_{i}^{k}}\) with k non-zero elements whose values are the corresponding c. In the next stage, the value k of KNN will become k − 1 and we will get another K × 1 vector \(c_{i}^{k-1}\) in the same way. This procedure is repeated until k = 1. It should be noticed that collaborative representation cannot work when k = 1. Here, we directly adopt hard vector quantization (VQ) which makes the coefficient of the nearest neighbor to be 1 and let the other elements be 0. At last, the final coding ci of xi is calculated by

$$ c_{i}={{\Sigma}_{1}^{k}} {c}^{k}_{i} $$
(2)

After the coding step is completed, the image is still represented as a set consist of T × M coded vectors. Therefore, the traditional classifier cannot be used to classify face images directly. It is necessary to compute the aggregation feature of the coded vectors to obtain a compact representation of the image content. Here, we utilize spatial pyramid method to complete pooling manipulation, which partitions an image into 2l × 2l subregions in different scales. Let l = 0,1,⋯ , L denote the level of pyramid model, so the total levels of pyramid model is L + 1. The illustration figure of spatial pyramid model (SPM) is shown in Fig. 3. Suppose that there are Mp encoding vectors in the p th subregion of l th level of SPM, the maximum statistical value of the coding vectors in this region is calculated as follow

$$ B_{lp}=\max\limits_{j = 1,2,\cdots,M_{p}} c_{j} $$
(3)
Fig. 3
figure 3

Illustration of spatial pyramid model

The features of each sub-region from xi in all levels are concatenated as the final representation of face image, which is denoted as Bi. Classification based on this face representation is complex due to the various facial changes like expression, illumination, occlusion etc. Support vector machine [34] is finally used to classify the images since it has high generalization performance. In case the data is linearly separable, the optimal separating hyperplane is

$$ f(B_{j})=sgn({\Sigma}_{i = 1}^{n} {y}_{i} \alpha_{i}({B}_{i}\cdot {B}_{j})+{b}^{*}) $$
(4)

where αi is the Lagrange coefficient of each training image, Bi is the feature representation of i th training image, Bj is the feature representation of j th testing image. b is the threshold of classification. However, the extracted feature may be not linearly separable due to the complex facial variation. For this case, the input vectors can be nonlinearly mapped to a high dimensional feature space which is considered to be linearly separable. However, it is difficult to obtain the mapping function φ. Therefore, a kernel function \(\mathcal {K}\) is utilized to compute the φT(Bi) ⋅ φ(Bj) by \(\mathcal {K}(B_{i},B_{j})\). Then, the optimal decision surface of SVM with the kernel function is

$$ f(B_{j})=sgn({\Sigma}_{i = 1}^{n} {y}_{i}\alpha_{i}\mathcal{K}({B}_{i}, {B}_{j})+ {b}^{*}) $$
(5)

Some popular kernel functions include linear kernel function, gaussian radial basis function(RBF), polynomial function and sigmoid function. In this paper, we use LIBSVM [8] to train a SVM classifier based on linear kernel function.

4 Experimental results

In this section, we conduct experiments on Extended Yale B [15] AR [25] and LFW [18] databases to evaluate our algorithms and compare them with several popular methods dealing with SSPP problem. These methods include AGL [31], BlockFLD [9], PCRC [45], PSRC [39], ESRC [11], SVDL [41], LGR [46] and LRA*-GL [12]. Furthermore, we also compare with some commercial projects such SeetaFace [2] and Face+ + [1]. We use the gray scale of the pixels as the features for all the methods, and all the face images are resized to 80 × 80 in all the experiments. For patch based methods including BlockFLD, PCRC, PSRC, LGR, the patch size is fixed as 11 × 11 and the distance between two patch centers is 4 pixel. For our method, SIFT features are extracted by VLFeat lib [35] at single-scale from densely located patches of gray images. The patches are centered at each pixel and the fixed size is 8 × 8 pixels. The number of word is fixed to 1500 and the SPM is used by hierarchically partitioningeach image into 1 × 1, 2 × 2, 4 × 4, 8 × 8 and 16 × 16 blocks on 5 levels. Moreover, we also compare our MKCC scheme(k = 5) with the commonly used non-negative sparse coding (NSC). All the experiments are conducted on a 2.4 GHz machine with Xeon E5-2640v4 CPU and 32G RAM. We also open 10 Matlab workers for parallel computation to improve the efficiency.

4.1 Results on Extended Yale B database

We conduct experiments with the first 30 subjects of the Extended Yale B face database, which contains 38 human subjects under 64 illumination conditions. The images of the remaining 8 subjects are used as the generic set for those generic learning methods. We use the images with the best illumination condition (0 degree azimuth and 0 degree elevation) for training and the images under other illumination conditions for testing. Some sample images from Extended Yale B database are shown in Fig. 4. Although the extreme lighting conditions make it a challenging task for most face recognition methods, the experimental results in Table 1 show that our method achieves favorable results and outperform all the other ones. It should be noticed that the recognition rate of our method is higher than the popular commercial projects SeetaFace and Face + +.

Fig. 4
figure 4

Sample images from the Extended Yale B database

Table 1 Recognition rate on Extended Yale B

To further compare our method with SeetaFace and Face + +, we also evaluate their computing time of recognizing one image on Extended Yale B database. For SeetaFace, we use its API to extract the feature from each face image and classify the testing image by computing the cosine distance. For Face + +, we directly use it “compare API” to compute the similarity of two face images because its “search API” of the trial version is limited to search 5 face images. Then the testing image is classified into the category with highest similarity. Therefore, it needs to compute the similarity for 30 times since Extended Yale B has 30 subject to be recognized. It finally almost costs 14.85 s to recognize one image. In contrary, SeetaFace is much faster. It only consumes 0.197 s to recognize one image. However, the performance of SeetaFace is much lower than Face + + and our method. Generally speaking, our method can achieve the best result with acceptable computing time. In addition, its computing time can also be further reduced by decreasing the size of the visual dictionary. As described above, the size of the visual dictionary is K, which also refers to the number of centers in K-means. The recognition rates and computing time under differer K is shown in Table 2. We can see that the recognition rate changes a little and even becomes a little higher when K decreases from 1500 to 50. When K is 10, the recognition rate decreases to 84.39% which is still higher than many traditional methods for SSPP problem. In addition, the computing time decreases with K decreasing.

Table 2 The impact of the visual dictionary size K

4.2 Results on AR database

The AR face database [25] contains over 4,000 face images of 126 subjects, where 26 pictures of each subject under different facial expressions, lighting conditions and occlusions were taken in two sessions (separated by two weeks). In the experiments, a subset with 2500 images from 50 males and 50 females is selected, some sample images from which are shown Fig. 5.

Fig. 5
figure 5

Sample images from the AR database

The first 40 male and the first 40 female subjects are selected for constructing gallery and probe set and the other 20 subjects are used as the generic set of those generic learning methods. The single image of each subject with natural expression and illumination from session 1 is used for training, and the remaining images from both sessions are used for testing. Experimental results on two sessions are respectively shown in Tables 3 and 4. We can see that the proposed MKCC-BoF achieves the highest average accuracy on session 1 and the second highest on session 2. The classical non-negative sparse coding based BoF (NSC-BoF) method also achieves better results than those specially designed methods for SSPP problem. Comparing with NSC-BoF, the proposed MKCC-BoF respectively obtains 0.2% and 1.77% improvement on two sessions. Although the improvement is not very obvious, the computing efficiency of MKCC-BoF is much higher than NSC-BoF. The experimental results also show that our method is robust to expression, illumination, disguise and time variation. Although Face+ + achieves the highest result on session 2, it has restriction on image size. When the image size is resized to 80 × 80, many face images cannot be recognized. The recognition rates under expression and illumination variation of session 1 only achieve 72.9% and 57.5%. This is because Face++ must first find face key points and extract face feature. But when the image size is too small, it cannot find face key points successfully and cannot extract face feature.

Table 3 Recognition rates (%) on AR database (session 1) for SSPP problem
Table 4 Recognition rates (%) on AR database (Session 2) for SSPP problem

4.3 Results on LFW database

The LFW database [18] is taken under an unconstrained environment, whose images are from 5,749 individuals. In the experiments, we use the aligned version LFW-a [37] of LFW, from which 158 subjects with no less than 10 samples were gathered. Some sample images are shown in Fig. 6.

Fig. 6
figure 6

Sample images from the LFW database

The first 80 subjects are used for evaluation, and the remaining subjects are used as the generic set. For each subject, we randomly choose one image as gallery sample and use nine images for testing. And 10 experiments are conducted to report the average recognition rates. The experimental results listed in Table 5 show that MKCC-BoF and NSC-BoF still achieve the best results. Comparing with those methods for SSPP problem, MKCC-BoF obtains nearly 10% improvement. Moreover, MKCC-BoF is still superior to NSC-BoF, which demonstrates the advantages of the proposed MKCC scheme once again.

Table 5 Recognition rate on LFW database

5 Conclusion

In this paper, we try to address SSPP problem by eliminating the semantic gap between facial features and facial identification. Motivated by the success of BoF and the fact that BoF extract can extract middle-level semantic feature, we propose a multistage KNN collaborative coding based BoF (MKCC-BoF) method. Different from conventional nonnegative sparse coding based BoF methods, its computing efficiency is much faster since it has close solution. Experimental results on three public face databases show that the proposed MKCC-BoF not only generates well to SSPP problem but also has great robustness to expression, illumination, occlusion and time variation.