Fusing Multiple Deep Features for Face Anti-spoofing

  • Yan Tang
  • Xing Wang
  • Xi Jia
  • Linlin Shen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10996)


With the growing deployment of face recognition system in recent years, face anti-spoofing has become increasingly important, due to the increasing number of spoofing attacks via printed photos or replayed videos. Motivated by the powerful representation ability of deep learning, in this paper we propose to use CNNs (Convolutional Neural Networks) to learn multiple deep features from different cues of the face images for anti-spoofing. We integrate temporal features, color based features and patch based local features for spoof detection. We evaluate our approach extensively on publicly available databases like CASIA FASD, REPLAY-MOBILE and OULU-NPU. The experimental results show that our approach can achieve much better performance than state-of-the-art methods. Specifically, 2.22% of EER (Equal Error Rate) on the CASIA FASD, 3.2% of ACER (Average Classification Error Rate) on the OULU-NPU (protocol 1) and 0.00% of ACER on the REPLAY-MOBILE database are achieved.


Deep convolutional neural networks Face anti-spoofing Multiple features 

1 Introduction

With the advancement of computer vision technologies, face recognition has been widely used in various applications such as access control and login system. As printed photos and replay videos of a user can easily spoof the face recognition system, approaches capable of detecting these spoof attacks are highly demanded.

To decide whether the faces presented before the camera are live person, or those spoof attacks, a number of approaches have been proposed in the literature. The main cues widely used are the depth information, the color texture information and the motion information. As majority of the attacks use printed photos or replayed videos, depth information could be a useful clue since live faces are 3D but the spoof faces are 2D. Wang et al. [11] combined the depth information and the textual information for face anti-spoofing. They used LBP (Local Binary Pattern) feature to represent the depth image captured by a Kinect and used a CNN (Convolutional Neural Network) to learn the texture information from the RGB image. This method needs an extra depth camera, which is usually not available for many applications. Instead of using the depth sensor like [11], the work presented in [9] adopted the CNN to estimate the depth information and then fused such depth information with the appearance information extracted from the face regions to distinguish between the spoof and the genuine faces. Besides the depth information, color texture or motion information has also been widely applied for face liveness detection [10, 13, 14]. Boulkenafet et al. [10] used different color spaces (HSV and YCbCr) to explore the color texture information and extracted LBP features from each space channel. The LBP features from all space channels were concatenated and then fed into the SVM (Support Vector Machine) for classification. An EER of 3.2% was obtained on the CASIA dataset. In contrast to the color texture information extracted from the static images, methods using motion cues tried to explore the temporal information of the genuine faces. Feng et al. [13] utilized the dense flow to capture the motion and designed optical flow based face motion and scene motion features. They also proposed a so called shearlet-based image quality feature, and then fused all the three features using a neural network for classification. Pan et al. [14] proposed a real-time liveness detection approach against photograph spoofing in face recognition by recognizing spontaneous eye blinks.

Motivated by the fact that CNN can learn features with high discriminative ability, recently many methods [8, 15] tried to use CNN for face anti-spoofing. Yang et al. [8] trained a CNN network with five convolutional layers and three fully-connected layers. Both single frame and multiple frames were input to the network to learn the spatial features and the spatial-temporal features, respectively, and an EER of 4.64% was reported on the CASIA dataset. Li et al. [15] fine-tuned the pre-trained VGG-face, and used the learned deep features to identify spoof attacks. An EER of 4.5% was achieved on the CASIA dataset.

While the existing methods explore various information for face antis-spoofing, most of them apply only single cue of the face. Although several methods [11, 13] indeed explored multiple cues of the face for anti-spoofing, they only adopted hand-crafted features. In this paper we propose to use CNNs to learn multiple deep features from different cues of the face to integrate different complementary information. Experiments show that our approach can outperform state-of-the-art approaches.

Below we detail the proposed method in Sect. 2. Then we present the experimental results in Sect. 3. And in Sect. 4 some conclusions are drawn.

2 The Proposed Method

In this paper, we aim to exploit three types of information, i.e. the temporal information, the color information and local information, for face anti-spoofing. As shown in Fig. 1, the face detector proposed in [2] is firstly applied to detect and crop the face from a given image. Then we use CNNs to learn three deep features from different cues of the face. Specifically, we learned temporal feature from the image sequences, the color based feature from different color spaces and the patch based local feature from the local patches. Each CNN learning process is supervised by a binary softmax classifier. Considering all the multiple features are complementary to another, we further propose a strategy to integrate all of the features: the class probabilities output by the softmax function of each CNN are concatenated as a class probability vector, which is then fed into SVM for classification.
Fig. 1.

The proposed multiple deep feature method.

2.1 Multiple Deep Features

Temporal Feature.

Here we introduce a strategy to exploit the temporal information between image frames in the video sequence. Specifically, we first convert three color images at different temporal positions into three gray images, and then stack the gray images as a whole sample and feed the stacked volume into the CNN. Figure 2 shows an example volume stacked by three gray images.
Fig. 2.

A volume stacked by three gray images in the CASIA database.

Color Based Feature.

It was demonstrated in [12] that the color information in the HSV and YCbCr color spaces were more discriminative than that in the RGB space for face anti-spoofing. But [12] used hand-crafted features (i.e. LBP features) to encode these color information. Here we use CNN to learn high-level color-based features from the RGB, HSV and YCbCr color spaces, respectively. For the HSV (or YCbCr) color space, we first convert the RGB image into the HSV (or YCbCr) color space and then feed the converted image into the CNNs for feature learning. Figure 3 shows an example face with different color spaces i.e. RGB, HSV and YCbCr.
Fig. 3.

Sample images of different color spaces i.e. RGB, HSV and YCbCr. The first row shows the genuine face images. The second row shows the warped print photo face images. The third row shows the cut print photo face images. The last row shows the replay video face images. All sample images are sampled from the CASIA database. (Color figure online)

Patch Based Local Feature.

While the temporal and color features are mainly learnt from the whole face images, important local information could be missing. In order to exploit the local information, we divided the face image into a number of patches with the same size for local feature representation. A set of ten patches with size 96 × 96 are randomly cropped from training faces and then used to train the network. Figure 4 shows a set of patches cropped from an example face.
Fig. 4.

Samples of face patches in the CASIA database.

2.2 Network Architecture

In our work, we employ the 18-layer residual network (ResNet) [16] as the CNN. However, the last 1000-unit softmax layer (originally designed to predict 1000 classes) is replaced by a 2-unit softmax layer, which assigns a score for genuine and spoof classes. A brief illustration of the network architecture is shown in Fig. 5. The network consists of seventeen convolutional (conv) layers and one fully-connected (fc) layer. The orange, green, dark green and red rectangles represent the convolutional layer, max pooling layer, average pooling layer and fully-connected layer, respectively. The purple rectangle represents BatchNorm (BN) and ReLU layers. As shown in Fig. 6, the light blue rectangle represents residual blocks 1 (RB1), and the dark blue rectangle represents residual blocks 2 (RB2).
Fig. 5.

The network architecture. (Color figure online)

Fig. 6.

Left: the residual block 1, Right: the residual block 2. (Color figure online)

2.3 SVM Classification with Integration of Multiple Deep Features

Different features capture different characteristics of the face and are complementary to each other. So we perform the classification with the integration of all multiple features. As shown in Fig. 1, each CNN can output a probability of whether the given face belongs to the genuine class or the spoof class. Then the class probabilities output by the softmax function of each CNN are concatenated as a class probability vector and then fed into SVM for classification. Given a video with N frames, N class probability vectors can be generated, then the video can be classified using the average of these class probability vectors.

3 Experiments

3.1 Datasets and Protocol

In this paper, the experiments were conducted on three databases, i.e.CASIA FASD, OULU-NPU and REPLAY-MOBILE databases, whose details are summarized in Table 1.
Table 1.

The summary of three face spoof databases.



Videos (live, spoof)

Acquisition device

Spoof type

Year of release




◦ USB camera (640 × 480, 480 × 640)

◦ Sony NEX-5 camera (1920 × 1080).

◦ Warped photo

◦ Cut photo

◦ Replay video





◦ Smartphone

◦ Tablet (720 × 1280)

◦ Printed photo

◦ Mattescreen (displayed photo and video)





◦ Samsung Galaxy S6 (1920 × 1080)

◦ HTC Desire EYE


◦ ASUS Zenfone Selfie

◦ Sony XPERIA C5 Ultra Dual


◦ Printed photo (Canon imagePRESS C6011 and PIXMA iX6550)

◦ Replay video (19” Dell UltraSharp 1905FP and Macbook 13” laptop)


CASIA FASD Database.

The CASIA FASD (face anti-spoofing database) [4] contains 600 genuine and spoof videos of 50 subjects, 12 videos (3 genuine and 9 spoof) were captured for each subject. This database consists of three imaging qualities and attacks, i.e. the warped photo attack, the cut photo attack (hides behind the cut photo and blink, another intact photo is up-down moved behind the cut one) and video attack. In [4], they design 7 testing scenarios, i.e. three imaging qualities, three fake face types and overall data (all data are used). In our experiments, we use all the videos (overall scenarios). The training and the test set consist of 20 subjects (60 live videos and 180 attack videos) and 30 subjects (90 live videos and 270 attack videos), respectively. As shown in Fig. 7, we detected and aligned faces in the video with the detector from MTCNN [2] and cropped them to 256 × 256. The same face alignment and cropping way also applied to the following two databases.
Fig. 7.

Example images with different qualities of the CASIA FASD database. (a) genuine face image. (b) warped print photo attack. (c) cut print photo attack. (d) video attack.


The REPLAY-MOBILE database [6] consists of 1190 videos of 40 subjects, i.e. 16 attack videos for each subject. This database has five different mobile scenarios, including the background of the scene that is uniform or complex, lighting conditions. Real client accesses were recorded under five different lighting conditions (controlled, adverse, direct, lateral and diffuse) [6]. In our experiments, we use 120 genuine videos and 192 spoof videos for training. The development set contains 160 real videos and 256 attack videos, and there are 110 live videos and 192 fake videos in the test set.

OULU-NPU Database.

The OULU-NPU database [5] consists of 5940 real access and attack videos of 55 subjects (15 female and 40 male). The attacks contain both print and video-replay attacks. Furthermore, the print and video-replay attacks were produced using two different printers and display devices. We use 4950 genuine and spoof videos in the public set [5] for testing. This database has three sessions with different illumination conditions, six different smartphones and four kinds of attacks, which has 90 videos for each client. The database is divided into three disjoint subsets, i.e. training set (20 users), development set (15 users) and testing set (20 users). In our experiments, two protocols were employed to evaluate the robustness of the proposed algorithm, i.e. protocol 1 for illumination variation and protocol 2 for presentation attack instruments (PAI) variation.

3.2 Performance Metrics

FAR (False Acceptance Rate) [3] is the ratio of the number of false acceptances and the number of negative samples. FRR (False Rejection Rate) is the ratio of the number of false rejections and the number of positive samples. The EER is the point in the ROC curve where the FAR equals the FRR. In our experiment, the results of CASIA-FASD are reported in EER. The Replay-mobile and OULU-NPU database are reported using the standardized ISO/IEC 30107-3 metrics [18], i.e. APCER (Attack Presentation Classification Error Rate) and BPCER (Bona Fide Presentation Classification Error Rate). The ACER (Average Classification Error Rate) is half of the sum of the APCER and the BPCER.

3.3 Experimental Settings

As the number of samples in publicly available datasets is very limited, CNN could easily over-fit when trained from scratch. So we fine-tune the ResNet-18 [16] model pre-trained on the ImageNet database. The proposed framework is implemented using the Caffe toolbox [19]. Size of input images is 256 × 256. The network is trained with a mini-batch size of 64. In the training of CNN, the learning rate is 0.0001; the decay rate is 0.0005; and the momentum during training is 0.9. These parameters are constant in our experiments.

3.4 Results

We first use the CASIA-FASD dataset to test the performance of different features, i.e. the temporal feature, the color based features in three different color space (RGB, HSV, and YCbCr), the patch based local feature, and their fusion. Table 2 details the EERs of different features on CASIA-FASD. It can be observed from the table that when only single feature is used, patch based local feature achieves the best performance, i.e. EER of 2.59%. After we fuse all different features, the EER is further reduced to 2.22%. This validates the proposed multiple deep feature method. Then we compare the proposed multiple deep feature method with the state of the art in Table 3. As shown in Table 3, our approach achieves the lowest EER among all of the approaches.
Table 2.

Performance of different features on the CASUA-FASD database.

Feature type

EER (%)

Temporal Feature


Color based feature (RGB)


Color based feature (HSV)


Color based feature (YCbCr)


Patch based local feature




Table 3.

Performance comparison with the state of the art on the CASIA-FASD database.


EER (%)

DOG [4]


IDA + SVM [1]


Color texture [12]


MUlti-cues integration + NN [13]


CNN [8]


DPCNN [15]


CSURF [17]


Patch and Depth [9]


Our method


Tables 4 and 5 lists the results of the proposed approach and the other methods on the REPLAY-MOBILE and the OULU-NPU databases, respectively. For the REPLAY-MOBILE database, our approach achieves much better performance than IQM and Gabor, i.e. no error was recorded. For the OULU-NPU dataset, the ACER on protocol 1 and 2 for our proposed method are 3.2% and 2.4%, respectively, which are much better than that of CPqD and GRADIANT. Overall, with the above experiments we can demonstrate the superiority of the proposed approach over the other methods.
Table 4.

Performance comparison with the state of the art on the REPLAY-MOBILE database.


Test (%)




IQM [6]




Gabor [6]




Our method




Table 5.

Performance comparison with the state of the art on the OULU-NPU database.



Dev (%)

Test (%)






Boulkenafet [5]





CPqD [7]





GRADIANT_extra [7]





Our method






Boulkenafet [5]





GRADIANT_extra [7]










Our method





4 Conclusions

In this paper, we proposed to employ the CNNs to learn discriminative multiple deep features from different cues of the face for face anti-spoofing. Because theses multiple features are complementary to each other, we further presented a strategy to integrate all the multiple features to boost the performance. We evaluated the proposed approach in three public databases and the experimental results demonstrated that the proposed approach can outperform the state of the art for face anti-spoofing. Regarding to the future work, we will conduct more cross-dataset experiments to investigate the generalization ability of the proposed method.



The work is supported by Natural Science Foundation of China under grands No. 61672357 and U1713214.


  1. 1.
    Wen, D., Han, H., Jain, A.K.: Face spoof detection with image distortion analysis. IEEE Trans. Inf. Forensics Secur. 10(4), 746–761 (2015)CrossRefGoogle Scholar
  2. 2.
    Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23(10), 1499–1503 (2016)CrossRefGoogle Scholar
  3. 3.
    Bengio, S., Mariéthoz, J.: A statistical significance test for person authentication. In: The Speaker and Language Recognition Workshop (Odyssey), pp. 237–244, Toledo (2004)Google Scholar
  4. 4.
    Zhang, Z., Yan, J., Liu, S., Lei, Z., Yi, D., Li, S. Z.: A Face antispoofing database with diverse attacks. In: IAPR International Conference on Biometrics, pp. 26–31 (2012)Google Scholar
  5. 5.
    Boulkenafet, Z., Komulainen, J., Li, L., Feng, X., Hadid, A.: OULU-NPU: a mobile face presentation attack database with real-world variations. In: IEEE International Conference on Automatic Face and Gesture Recognition, pp. 612–618 (2017)Google Scholar
  6. 6.
    Costa-Pazo, A., Bhattacharjee, S., Vazquez-Fernandez, E., Marcel, S.: The replay-mobile face presentation-attack database. In: Biometrics Special Interest Group (2016)Google Scholar
  7. 7.
    Boulkenafet, Z., Komulainen, J., Akhtar, Z., Benlamoudi, A., Samai, D., Bekhouche, S., et al.: A competition on generalized software-based face presentation attack detection in mobile scenarios. In: IEEE International Joint Conference on Biometrics (2017)Google Scholar
  8. 8.
    Yang, J., Lei, Z., Li, S.Z.: Learn convolutional neural network for face anti-spoofing. Comput. Sci. 9218, 373–384 (2014)Google Scholar
  9. 9.
    Atoum, Y., Liu, Y., Jourabloo, A., Liu, X.: Face Anti-spoofing using patch and depth-based CNNs. In: IEEE International Joint Conference on Biometrics (2018)Google Scholar
  10. 10.
    Boulkenafet, Z., Komulainen, J., Hadid, A.: Face spoofing detection using colour texture analysis. IEEE Trans. Inf. Forensics Secur. 11(8), 1818–1830 (2016)CrossRefGoogle Scholar
  11. 11.
    Wang, Y., Nian, F., Li, T., Meng, Z., Wang, K.: Robust face anti-spoofing with depth information. J. Vis. Commun. Image Represent. 49, 332–337 (2017)CrossRefGoogle Scholar
  12. 12.
    Boulkenafet, Z., Komulainen, J., Hadid, A.: Face anti-spoofing based on colour texture analysis. In: IEEE International Conference on Image Processing, pp. 2636–2640 (2015)Google Scholar
  13. 13.
    Feng, L., Po, L.M., Li, Y., Xu, X., Yuan, F., Cheung, C.H., et al.: Integration of image quality and motion cues for face anti-spoofing. J. Vis. Commun. Image Represent. 38(2), 451–460 (2016)CrossRefGoogle Scholar
  14. 14.
    Pan, G., Sun, L., Wu, Z., Lao, S.: Eyeblink-based anti-spoofing in face recognition from a generic webcamera. In: IEEE International Conference on Computer Vision, pp. 1–8 (2007)Google Scholar
  15. 15.
    Li, L., Feng, X., Boulkenafet, Z., Xia, Z., Li, M., Hadid, A.: An original face anti-spoofing approach using partial convolutional neural network. In: International Conference on Image Processing Theory TOOLS and Applications, pp. 1–6. IEEE (2017)Google Scholar
  16. 16.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  17. 17.
    Boulkenafet, Z., Komulainen, J., Hadid, A.: Face antispoofing using speeded-up robust features and fisher vector encoding. IEEE Sig. Process. Lett. 24, 141–145 (2017)Google Scholar
  18. 18.
    ISO/IEC JTC 1/SC 37 Biometrics. Information technology - Biometric Presentation attack detection - Part 1: Framework. International Organization for Standardization (2016)Google Scholar
  19. 19.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et al.: Caffe: convolutional architecture for fast feature embedding. In: 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Computer Vision Institute, School of Computer Science and Software EngineeringShenzhen UniversityShenzhenChina

Personalised recommendations