Keywords

1 Introduction

Face verification (FV), whose protocol is to classify whether a pair of faces belongs to the same person or not, is one of the most challenging face tasks with difficulty of variations on pose, age, occlusion and so on [1].

In recent years, the state-of-the-art methods based on deep learning achieved the great success on FV [2,3,4,5]. One of the most important ingredients for this success is large scale facial public datasets [6] with identity labels. However, there are often noise contaminated since images are crawled from the Internet. Some methods use various measures [4, 7] to clean images, while others produce multiple patches [8,9,10,11,12] and synthesized images [13]. It requires a lot of extra time and human resources. Since excessive noise data usually depress the classification performance, in this paper, we only select partial training data of MsCeleb [6] randomly without any cleaning and alignment methods in order to simplify pre-processing process.

Some methods obtain better performance [8,9,10,11] along with concatenating features from multiple models. It needs additional time to acquire multiple models and evaluate models with high-dimensional features. Here, we propose to use both of the last two layers of a single model. Unlike concatenation, we use weighted average of the two layers without dimension increase. It can be seen that we only train one model but use it as two models by fusion of two layers, which effectively reduces the time at train and test stages.

Besides, many other strategies are effective for better performance, such as carefully-designed models [4, 7] and center loss [14]. While joint Bayesian [15] has been highly successful as metric learning in many methods [9,10,11], Large Margin Nearest Neighbor (LMNN) [16] is used to improve accuracy. LMNN [16] is first introduced to train a matrix that maintains a large distance between imposters and constrains the k-nearest neighbors belonging to the same class. In this research, we use LMNN to further improve the discrimination of features.

Despite its simple process, our method achieves competitive results on Labeled Faces in the Wild(LFW) [17], Celebrities in Frontal-Profile in the Wild(CFP) [18] and Cross-Age Celebrity Dataset(CACD) [19]. We summarize the contributions of this paper as follows:

  1. 1.

    We simplify the pre-processing procedure only by selecting partial training data randomly. No any other clean measures are used to carefully select the label-corrected images.

  2. 2.

    We propose a new method that only requires to train one model but uses the last two layers of features via computing their weighted average, which improves performance without much time consumption.

  3. 3.

    We use LMNN as metric learning to strengthen the discriminative power of deeply features.

  4. 4.

    The proposed method performs excellent and significant results on all the three datasets of LFW, CFP and CACD.

2 Related Work

Various of methods exploited deep networks to achieve remarkable results in FV. We analyse the most critical aspects.

Data Preparation. Large scale public facial images are introduced to encourage the development of face recognition [2,3,4,5], such as CASIA-Webface [20] and MsCeleb [6]. In addition, data augmentation like multiple patches [8,9,10,11,12] and synthesized images [13] can make the system be more robust to various of face variation. Others like 2D/3D alignment [2, 14], RGB/Gray channels [8, 9] and rotated poses [13] can also contribute to performance.

Elaborately Designed Networks. Inspired by “very deep” networks, VGGNet is widely used to FV [2, 13]. After that, many blocks are added to networks [4, 7, 11], like inception [21] and residual [22]. Besides, local connected layer(LC) [8, 9, 14], L2 norm [4], DeepVisage(feature normalization) [7] are designed.

Fusion of Models. Many methods show that model fusion provides additional boosts to feature presentation. DeepID series [8,9,10,11] train at least 25 models on different facial patches which are complementary with each other.

Metric Learning. Several methods use metric learning after extracting features from CNN. Principal component analysis(PCA) is mainly used [8,9,10,11] which learns a mapping matrix to reduce dimensions. Joint Bayesian is proved to be an effective way [8,9,10,11, 20]. Baidu [12] uses triplets which shortens the distance of intra-class and enlarges that of inter-class.

Various Loss Functions. Softmax loss [2, 12] is largely used for classification and has been used for feature embedding by removing the last classification layer. Contrastive loss [3, 9,10,11] and triplet loss [2, 5, 12] are introduced to enhance intra-class binding. Then, center loss [14] is proposed to learn a center for each class.

3 The Proposed Method

We use VGGNet [2] as our baseline and modify it with a convolution-branch block(CB), called CB-VGG. What’s more, it brilliantly exploits weighted average of the last two layers, followed by LMNN [16] before computing the cosine similarity as the verification score of a pair images. Next, we describe all the details.

3.1 Data Pre-processing

Our pre-processing process is extremely simple which saves a lot of time.

Training Data: We use MsCeleb [6] as our training data. Different from other approaches [4, 7, 14] cleaning the training data, we only select 29,731 identifies randomly except target datasets. For simplification, we only use Normalized Pixel Difference(NPD) [23] detector without any 2D [2, 7] or 3D alignment [3]. It is inherited by [2] which found that alignment on test images instead of training data brought the best benefits. The work in [2] uses large amounts of images and the alignment step on test stage spends plenty of time. In contrast, we crop 4 corner with 1 center patches and randomly flip these images for data augmentation to promote the performance of model.

Test Data: The process is similar as training stage. Differently, LFW [17] and CACD [19] are all cropped 10 times consisting of 4 corners and 1 center with their flips. For CFP [18], we only use 1 crop without flip due to Frontal-Profile pairs.

3.2 CB-VGG Architecture

Baseline: Our baseline, VGGNet [2] comprises 11 sections with more than one linear and non-linear blocks like ReLU and max pooling. The first 8 sections use convolution as linear operator while the following 2 use fully connected layers(FC). The last FC layer is only used for classification.

CB-VGG Model: We make a small modification on the baseline model. Similar to [14], the 4th and 5th pooling units are concatenated together as the input of the 1st FC layer. In order to match the size of the two layers, we apply another convolutional operation after the 4th pooling layer, which is named as convolution-branch block(CB). It is critical for accurate face representation, because some information would be lost after successive down sampling. In this case, we can make full use of both global-abstract semantic information and high-resolution feature. Moreover, we simply use softmax loss instead of combining with other loss [14] or normalization [4, 7]. The details of the model are given in Fig. 1.

Fig. 1.
figure 1

Structure of CB-VGG model. IN is the input images. Conv is convolution layer. P represents padding. K is kernel size. N is the output number. Pool indicates max pooling unit. CT is concatenation layer which concatenates the 4th pooling(after one extra convolution) and the 5th pooling layers. FC and FC Classifier indicate fully connected layer and classification layer respectively. OUT is the softmax loss function.

3.3 Weighted Average Features

The deep features are taken from the last FC layer almost in all other methods. What’s different in our method is that we extract features on both the last two FC layers. For an image x, we represent its feature descriptors using the 1st and the 2nd FC layer as \(f_{1st}^x\) and \(f_{2nd}^x\). Then, the final feature descriptor \(f_{final}^x\) is expressed by taking weighted average of these two features:

$$\begin{aligned} \mathbf f _{final}^x = \alpha \times \dfrac{1}{C}\sum _{i=1}^{C}{} \mathbf f _{1st}^{x_{i}} + \beta \times \dfrac{1}{C}\sum _{i=1}^{C}{} \mathbf f _{2nd}^{x_{i}}.\; \end{aligned}$$
(1)

\(\alpha \) and \(\beta \) are the weights of the 1st and the 2nd FC layer’s descriptors respectively which act on every elements. In order to discriminate the role of the two layers, we impose restrictions on the weights with \(\beta = 1 - \alpha \). C is the number of crops which is 1 in CFP and 10 in LFW and CACD.

Model fusion is always effective. Sun et al. [9,10,11] concatenate 25 model’s features to improve performance. Baidu [12] uses 10 embedding models to obtain high performance. However, our method only trains one model but uses two layers to extract features, which is equal to training two models but saves a lot of time and boosts the performance at the same time.

3.4 LMNN Metric Learning

Large Margin Nearest Neighbor(LMNN) [16] learns a Mahanalobis distance matrix which is good for decreasing the distance of k target neighbors, while enlarging the distance of different k nearest points. In FV task, some distances of positive pairs are larger than negative pairs, causing a bad effect on thresholds selection. Based on LMNN, we can pull the same faces closer and repel different faces further. The problem can be described as follows. i and j are same pairs, yet l is different from them. \(\xi _{ijl}\) is a slack variable. M is the transformable matrix.

$$\begin{aligned}&min \sum _{i,j\rightarrow C_{i}}(\mathbf x _{i}-\mathbf x _{j})^{T}{} \mathbf M (\mathbf x _{i}-\mathbf x _{j}) + \sum _{i,j\rightarrow C_{i},l\rightarrow C_{l}}\xi _{ijl} . \end{aligned}$$
(2)
$$\begin{aligned}&\begin{aligned} s.t.&\forall _{y_{i}=y_{j}\ne y_{l}} (\mathbf x _{i}-\mathbf x _{l})^{T}{} \mathbf M (\mathbf x _{i}-\mathbf x _{l}) - (\mathbf x _{i}-\mathbf x _{j})^{T}{} \mathbf M (\mathbf x _{i}-\mathbf x _{j})\ge 1 - \xi _{ijl} ,\\&\xi _{ijl}\ge 0, \mathbf M \succeq 0 . \end{aligned} \end{aligned}$$
(3)

FaceNet [5] uses triplet loss similar to LMNN based on triplets consisting of a positive pair and a negative identity. However, it is not easy to generate triplets and [5] uses 200 M images for training. Instead, we use LMNN for employing idea of triplet but avoiding triplets selection.

4 Experiments

In this section, we first introduce the details of the training stage. Then we extract features of the last two FC layers(FC6, FC7) and perform LMNN to calculate the cosine similarity in FV. In order to verify the effectiveness of our method, we evaluate it on three different datasets. All experiments we carry out are on Caffe [24].

4.1 Details of Training Stage

In this subsection, we describe the details of model training. We randomly select 2.29 M images of 29,731 identities from MsCeleb dataset [6] without including the test identities. For simplification, no other cleaning and alignment strategies are used to pick up images. We train our model using 95% images (2.17 M images) and other 5% images (120 K) are used for monitoring and validating the loss. Our CNN model only uses identity information to optimize softmax loss. No other normalization methods [4, 7] are used.

The baseline model and our CB-VGG model are trained from scratch with the same hyper parameters. We use stochastic gradient descent(SGD) method. Momentum and weight decay are set to 0.9 and \(5e^{-4}\). We begin the training with a learning rate of 0.01 and decrease it by 10 times every \(100\,K\) iterations. The training batch size is 120. Images are resized into \(170\times 170\), and then cropped into \(160\times 160\) of 4 corners and 1 center. We also apply randomly horizontal flipping to the images for data augmentation.

4.2 Results of Test Datasets

Evaluation is performed on the commonly used LFW dataset [17], frontal-profile CFP dataset [18] and cross-age CACD dataset [19]. Like training stage, we also only use NPD to detect images without any other extra alignment process. The parameters of \(\alpha \) and \(\beta \) for FC6/FC7 are 0.15/0.85 respectively. The number of target neighbors k is set to 3 in LMNN.

LFW Dataset: LFW dataset [17] is one of the most popular datasets and the performance on it is close to saturation. It consists of 13,233 face images of 5,749 different identities. We evaluate our model following the most permissive protocol: unrestricted with labeled outside data. The FV task requires evaluating on 6,000 pairs in 10 folds. Each fold contains half of the genuine pairs and half of the impostor pairs.

We compare the network trained on Baseline and CB-VGG. For both FC6 and FC7, we extract features of 10 patches separately. Table 1 shows that CB-VGG is lightly better than Baseline. More over, with the weighted average method and LMNN metric learning, accuracy is further improved. The result verifies the effectiveness of weighted average of the last two layers which can be thought as fusion of two models but only trained once.

Table 1. Accuracy on LFW (%). Y is Yes (Use), N is Not (Not use)

We also compare our method with other state-of-the-art methods. The results are given in Table 2. We observe that our method achieves relatively significant accuracy (99.07\(\%\)) with relative less images and simple methods.

CFP Dataset: CFP [18] is one of the large-pose face datasets which is composed of 10 frontal and 4 profile images of each 500 individuals. Unlike LFW, CFP has two FV experiments: frontal-frontal (FF) and frontal-profile (FP). Both contains 10 folds, each with 350 same pairs and 350 different pairs.

We also compare the Baseline and CB-VGG model. However, features are extracted from both FC6 and FC7 with only 1 center crop. PCA is applied for reducing dimensions of fusional features from 4096d to 300d. Table 3 shows that CB-VGG with fusional features and LMNN is more robust to pose transformation.

Following the protocal in CFP [18], we also report the EER(Equal Error Rate) and AUC(Area under the curve) values on averages of 10 splits. Table 4 and Fig. 2 provide the results and the ROC curveFootnote 1 of ours along with the state-of-the-art methods.

Table 2. Performance comparison of state-of-the-art on LFW (%). Y is Yes (Use), N is Not (Not use)
Table 3. Accuracy on CFP (%). Y is Yes (Use), N is Not (Not use)
Table 4. Performance comparison of state-of-the-art on CFP (%). ACC is Accuracy. EER is Equal Error Rate. AUC is Area Under the Curve

We observe that our method achieves the competitive results to the best accuracy on both FF and FP. For FF ours result is slightly lower than FV-DCNN+pool5 [25] but enjoys being simple. FV-DCNN+pool5 [25] learns the Gaussian mixture model and performs Fisher vector encoding after extracting features from models. Besides, they also use PCA, Joint Bayesian and scores fusion with pool5. For FP, we achieve the first place if not competing p-CNN [1]. p-CNN(pose-directed multi-task CNN) specially tackles pose variation by separating all poses into several groups and then jointly learn identity and PIE(pose, illumination and expression) for each group. Although p-CNN has great advantage on pose, it performs relatively worse than us on FF.

Fig. 2.
figure 2

ROC curve of CFP dataset. (a) is the protocol of Frontal-Frontal, while (b) is the protocol of Frontal-Profile. We only compare deep learning methods without conventional methods.

CACD Dataset: CACD [19] is a large cross-age celebrity dataset which consists of 163,446 images of 2,000 celebrities with age ranging from 16 to 62. For verification task, 2,000 positive image pairs and 2,000 negative pairs are selected to form CACD-VS and divided by 10 folds.

Table 5. Accuracy on CACD (%). Y is Yes (Use), N is Not (Not use)

Similar to the previous process, we report comparison of baseline with ours and provide the results of other methods on CACD-VS. The results are shown in Tables 5 and  6.

MFM-CNN [28] and DeepVisage [7] are well designed that the former uses maxout activation to separate noisy signals with informative signals and the latter uses feature normalization to restrict features keeping equal contribution to cost function. LF-CNNs [27] aim at learning age-invariant features via combining latent factor layer. Note that we don’t use any complex blocks or finetune measures to fit age variation.

Table 6. Performance comparison of state-of-the-art on CACD (%). Y is Yes (Use).

In summary, we only use VGGNet with modification of combining FC6 with FC7 and LMNN metric learning. All the training images are randomly selected without any clean measures. Experiments show that our method is definitely robust to pose and age variation on FV task. It means our method is robust to both frontal and pose-variable faces and weighted features of the last two layers really learns mutually complementary information. LMNN is a fast metric learning which is really able to improve discrimination.

4.3 Analysis and Discussion

For highlighting the influences of methods, we perform further analysis on (i) number and data augmentation of training datasets; (ii) different settings of weights; (iii) some examples which are corrected by our methods. All the investigations are conducted on LFW dataset.

First, we study the influence from the quantity of training data. Table 7 presents the resultsFootnote 2 which we observe that: (i) the larger number of images creates the better CNN performance, although the 100K identities decrease the results due to the large dirty images; (ii) Multi-crop augmentation is really helpful for performance, since it is equal to add more pure images for per identities to increase number of training images. In this paper, we don’t use any selection or alignment for training dataset. Instead, multiple crop for data augmentation helps us to extract features with location and pose invariance.

Table 7. Analysis of the influences from number and data augmentation of training dataset. \(\sim \) is the approximate number which is obtained after detection.

Next, we analyse the various weights combination of the last two layers. Figure 3 shows that the weights selection of the features is important. Since FC7 layer extracts more information than FC6, the weights of the former is must be larger than the latter. Otherwise, the performance will be pull down by the weak feature.

Fig. 3.
figure 3

Analysis of the influences from the weights of the last two layers. (Color figure online)

Finally, we provide some examples which are corrected by our methods. In Fig. 4, four misclassified pairs are presented that (a) two pairs are false rejected images and (b) another two are false accepted images. These pairs are erroneously classified by CB-CNN model with only FC7 layer. When we use weighted average features of last two layers, pairs marked with green colored rectangles are corrected. It can be seen that more discriminative features are extracted using simple weighted combination which are more robust to pose, illumination, occlusion and so on. Then, when we further use LMNN, more difficult pairs can be corrected, such as the pair with red rectangle. That’s to say that our method really have ability to improve the representation of features.

Fig. 4.
figure 4

Some misclassified pair images of LFW dataset by CB-CNN model with only FC7 layer. Pair images with green rectangle is corrected via weighted average features of FC6 and FC7 layers. Then the pair with red rectangle is further corrected using LMNN. (a) and (b) is false rejected images and false accepted images respectively.

5 Conclusions

In this paper, we not only use extremely simple pre-processing procedure without clean or alignment, but also propose a single model named as CB-VGG. Taking care of time and accuracy, we present a simple weighted average on the last two FC layers instead of fusing several models. This can be seen as fusion of two models but only training once. After that, we use LMNN metric learning as post-processing process. Combining all these methods, we achieve competitive results and perform relatively robust to pose and age variations. These results successfully show that it may not be necessary to use complex process of selecting images or training multiple models to boost performance. Single models can be fully used to develop a good result and the time can be saved enormously. In the future, we will explore whether the cleaned images are really important for the performance.