1 Introduction

Face recognition technology has widely used in many intelligent systems due to their convenience and remarkable accuracy. However, face recognition systems are still vulnerable to presentation attacks (PAs) ranging from print, replay and 3D-mask attacks. Therefore, both the academia and industry have recognized the critical role of face anti-spoofing (FAS) for securing the face recognition system.

In the past decade, plenty of hand-crafted features based (Boulkenafet et al., 2015, 2017; Komulainen et al., 2013; Patel et al., 2016) and deep learning based (Qin et al., 2020; Liu et al., 2018; Yang et al., 2019; Atoum et al., 2017; Gan et al., 2017; George & Marcel, 2019) methods have been proposed for unimodal FAS. Despite satisfactory performance in seen attacks and environments, unimodal methods generalize poorly on emerging novel attacks and unseen deployment conditions. Thanks to the advanced sensors with various modalities (e.g., RGB, Infrared (IR), Depth, Thermal) (George et al., 2019), multimodal methods facilitate the FAS applications under some high-security scenarios with low false acceptance errors (e.g., face payment and vault entrance guard).

Recently, due to the strong long-range and cross-modal representation capacity, vision transformer (ViT) (Dosovitskiy et al., 2021) based methods (Liu &, 2022; George & Marcel, 2020b) have been proposed to improve the robustness of FAS systems. However, these methods focus on directly finetuning ViTs (George & Marcel, 2020b) or modifying ViTs with complex and powerful modules (Liu &, 2022), which cannot provide enough insights on bridging the fundamental natures (e.g., modality-aware inputs, suitable multimodal pre-training, and efficient finetuning) of ViT in multimodal FAS. Despite mature exploration and finds (He et al., 2022; Bachmann et al., 2022; Xiao et al., 2021) of ViT on other computer vision communities [e.g., generic object classification (Chen et al., 2022a)], these knowledge might not be fully aligned for the multimodal FAS due to the task and modality gaps.

Compared with CNN, ViT usually aggregates the coarse intra-patch info at the very early stage and then propagates the inter-patch global attentional features. On other words, it neglects the local detailed clues for each modality. According to the prior evidence from MM-CDCN (Yu et al., 2020b), local fine-grained features from multiples levels benefits the live/spoof clue representation in convolutional neural networks (CNN) from different modalities. Whether local descriptors/features can improve the ViT-based multimodal FAS systems is worth exploring.

In comparison with CNNs, ViTs usually have huger parameters to train, which easily overfit on the FAS task with limited data amount and diversity. Existing works show that direct finetuning the last classification head (George & Marcel, 2020b) or training extra lightweight adapters (Huang et al., 2022) can achieve better performance than fully finetuning. However, all these observations are based on the unimodal RGB inputs, it is unclear how different ViT-based transfer learning techniques perform on (1) other unimodal scenario (IR or Depth modality); and (2) multimodal scenario (e.g., RGB+IR+Depth). Moreover, to design more efficient transfer learning modules for ViT-based multimodal FAS should be considered.

Existing multimodal FAS works usually finetune the ImageNet pre-trained models, which might be sub-optimal due to the huge task (FAS vs. generic object classification) and modality (multimodal vs. unimodal) gaps. Meanwhile, in consideration of costly collection of large-scale annotated live/spoof data, self-supervised pre-training without labels (Muhammad et al., 2022) is potential for model initialization in multimodal FAS. Although a few self-supervised pre-training methods [e.g., masked image modeling (MIM) (Bachmann et al., 2022; Chen et al., 2022b) and contrastive learning (Akbari et al., 2021)] are developed for multimodal (e.g., vision-language) applications, there are still no self-supervised pre-trained models specially for multimodal FAS. To investigate the discrimination and generalization capacity of pre-trained models and design advanced self-supervision strategies are crucial for ViT-based multimodal FAS.

Motivated by the discussions above, in this paper we rethink the ViT-based multimodal FAS into three aspects, i.e., modality-aware inputs, suitable multimodal pre-training, and efficient finetuning. Besides the elaborate investigations, we also provide corresponding elegant solutions to (1) establish powerful inputs with local descriptors (Bhattacharjee & Roy, 2019; Dalal & Triggs, 2005) for IR modality; (2) efficiently finetune multimodal ViTs via adaptive multimodal adapters; and (3) pre-train generalized multimodal model via modality-asymmetric masked autoencoder. Our contributions include:

  • We are the first to investigate three key factors (i.e., inputs, pretraining, and finetuning) for ViT-based multimodal FAS. We find that (1) leveraging local feature descriptors benefits the ViT on IR modality; (2) partially finetuning or using adapters can achieve reasonable performance for ViT-based multimodal FAS but still far from satisfaction; and (3) mask autoencoder (He et al., 2022; Bachmann et al., 2022) pre-training cannot provide better finetuning performance compared with ImageNet pre-trained models.

  • We design the adaptive multimodal adapter (AMA) for ViT-based multimodal FAS, which can efficiently aggregate local multimodal features while freezing majority of ViT parameters.

  • We propose the modality-asymmetric masked autoencoder (M\(^{2}\)A\(^{2}\)E) for multimodal FAS self-supervised pre-training. Compared with modality-symmetric autoencoders (He et al., 2022; Bachmann et al., 2022), the proposed M\(^{2}\)A\(^{2}\)E is able to learn more intrinsic task-aware representation and compatible with modality-agnostic downstream settings. To our best knowledge, this is the first attempt to design the MIM framework for generalized multimodal FAS.

  • Our proposed methods achieve state-of-the-art performance with most of the modality settings on both intra- as well as cross-dataset testings on WMCA (George et al., 2019) and CASIA-SURF (Zhang et al., 2019b) datasets. Moreover, the proposed method achieves the robustest performance under various missing-modality settings.

2 Related Work

Multimodal face anti-spoofing   With multimodal inputs (e.g., RGB, IR, Depth, and Thermal), there are a few multimodal FAS works that consider input-level (Nikisins et al., 2019; Liu et al., 2021; George & Marcel, 2020; Nikisins et al., 2019) and decision-level (Zhang et al., 2019a) fusions. Besides, mainstream FAS methods extract complementary multi-modal features using feature-level fusion (Yu et al., 2020b; Zhang et al., 2019b; Liu et al., 2021; Wang et al., 2022a; Liu &, 2022; Li et al., 2021) strategies. As there are redundancy across multi-modal features, direct feature concatenation (Yu et al., 2020b) easily results in high-dimensional features and overiftting. To alleviate this issue, Zhang et al. (2019b, 2020) propose a feature re-weighting mechanism to select the informative and discard the redundant channel features among RGB, IR, and Depth modalities. shen et al. (2019) design a Modal Feature Erasing operation to randomly dropout partial-modal features to prevent modality-aware overftting.

Despite performance improvement via fusion, multimodal FAS models suffer from vulnerability when missing partial modalities. George and Marcel (2021) present a cross-modal focal loss to modulate the loss contribution of each modality, which benefits the model to learn complementary information among modalities to alleviate modality-missing issues. Furthermore, to fairly evaluate models’ performance against modality-missing scenarios, two flexible-modal FAS benchmarks (Yu et al., 2023b, a) are established while few robust flexible-modal learning algorithms (Yu et al., 2023b, a; Liu et al., 2023; Liu &, 2022) are presented.

Fig. 1
figure 1

Framework of the ViT finetuning with adaptive multimodal adapters (AMA). The AMA and classification head are trainable while the linear projection and vanilla transformer blocks are fixed with the pre-trained parameters. ‘MHSA’, ‘FFN’, and ‘GAP’ are short for the multihead self-attention, feed-forword network, and global average pooling, respectively

Transformer for vision tasks   Transformer is proposed in Vaswani et al. (2017) to model sequential data in the field of NLP. Then ViT (Dosovitskiy et al., 2021) is proposed recently by feeding transformer with sequences of image patches for image classification. In consideration of the data hungry characteristic of ViT, direct training ViTs from scratch would result in severe overfitting. On the one hand, fast transferring [e.g., adapter ( Houlsby et al., 2019; Chen et al., 2022a; Jie &, 2022) and prompt (Zhou et al., 2022) tuning] while fixed most of pre-trained models’ parameters is usually efficient for downstream tasks. On the other hand, self-supervised masked image modeling (MIM) methods [(e.g., BEiT (Bao et al., 2021) and MAE (He et al., 2022; Bachmann et al., 2022)] benefit the excellent representation learning, which improve the finetuning performance in downstream tasks.

Meanwhile, a few works introduce vision transformer for FAS (Liu &, 2022; Ming et al., 2022; George & Marcel, 2020b; Wang et al., 2022b, c; Yu et al., 2021). On the one hand, ViT is adopted in the spatial domain (Ming et al., 2022; George & Marcel, 2020b; Wang et al., 2022b) to explore live/spoof relations among local patches. On the other hand, global features temporal abnormity (Wang et al., 2022c) or physiological periodicity (Yu et al., 2021) are extracted applying ViT in the temporal domain. Recently, Liu and (2022) develop the modality-agnostic transformer blocks to supplement liveness features for multimodal FAS. Despite convincing performance via modified ViT with complex customized modal-disentangled and cross-modal attention modules (Liu &, 2022), there are still no works to explore the fundamental natures (e.g., modality-aware inputs, suitable multimodal pre-training, and efficient finetuning) in vanilla ViT for multimodal FAS.

Masked Image Modeling   A few works have explored the potential of masked image modeling (MIM). The BEiT (Bao et al., 2021) proposes a pre-training task called MIM, and also introduces the definition of MIM firstly. In BEiT, the image is represented as discrete tokens, and these tokens will be treated as the construct target of masked patches. Most recently, MAE (He et al., 2022) and SimMIM (Xie et al., 2022) almost simultaneously obtain state-of-the-art on computer vision tasks. They propose a pre-training paradigm based on MIM. Specifically, the patches of images are randomly masked with a high probability, then the self-attention mechanism is used in the encoder to learn the relationship between patches. Finally, the masked patches are reconstructed in the decoder. Recently, Ma et al. (2022) apply MAE to pretrain unimodal FAS models, and show their excellent generalization capacities. Meanwhile, Bachmann et al. (2022) propose the first multimodal MAE for pretraining generic semantic models in a multimodal and multitask manner. Different from previous works (Ma et al., 2022; Bachmann et al., 2022), we are the first to explore MAE for the multimodal FAS task, and find that modality-symmetric masked inputs are one of the keys for effective multimodal live/spoof clue modeling.

3 Methodology

To benefit the exploration of fundamental natures of ViT for multimodal FAS, here we adopt the simple, elegant, and unified ViT framework as baselines. As illustrated in the left part (without ‘AMA’) of Fig. 1, the vanilla ViT consists of a patch tokenizer \(\textbf{E}_{\text {patch}}\) via linear projection, N transformer blocks \(\textbf{E}^{i}_{\text {trans}}\) (\( i=1,...,N\)) and a classification head \(\textbf{E}_{\text {head}}\). The unimodal (\(X_{\text {RGB}}\), \(X_{\text {IR}}\), \(X_{\text {Depth}}\)) or multimodal (\(X_{\text {RGB+IR}}\), \(X_{\text {RGB+Depth}}\), \(X_{\text {IR+Depth}}\), \(X_{\text {RGB+IR+Depth}}\)) inputs are passed over \(\textbf{E}_{\text {patch}}\) to generate the visual tokens \(T^\text {Vis}\), which is concatenated with learnable class token \(T^\text {Cls}\), and added with position embeddings. Then all patch tokens \(T^\text {All}=[T^\text {Vis},T^\text {Cls}]\) will be forwarded with \(\textbf{E}_{\text {trans}}\). Finally, \(T^{\text {Cls}}\) is sent to \(\textbf{E}_{\text {head}}\) for binary live/spoof classification.

We will first briefly introduce different local descriptor based inputs in Sect. 3.1, then introduce the efficient ViT finetuning with AMA in Sect. 3.2, and at last present the generalized multimodal pre-training via M\(^{2}\)A\(^{2}\)E in Sect. 3.3.

3.1 Local Descriptors for Multimodal ViT

Besides the raw multimodal inputs, we consider three local features and their compositions for multimodal ViT. The motivations behind are that the vanilla ViT with raw inputs is able to model rich cross-patch semantic contexts but sensitive to illumination and neglecting the local fine-grained spoof clues (Yu et al., 2023b). Explicitly leveraging local descriptors as inputs of different modalities might benefit multimodal ViT mining more discriminative fine-grained spoof clues (Yu et al., 2020b, c, 2021b, 2020a) as well as illumination-robust live/spoof features (Li et al., 2021).

Fig. 2
figure 2

Visualization of three classical local descriptors [i.e., LBP (Ojala et al., 2002), HOG (Dalal & Triggs, 2005), and PLGF (Bhattacharjee & Roy, 2019)] and their compositions

Local Binary Pattern (LBP)   LBP (Ojala et al., 2002) computes a binary pattern via thresholding central difference among neighborhood pixels. Fine-grained textures and illumination invariance make LBP robust for generalized FAS (Li &, 2019). For a center pixel \(I_{c}\) and a neighboring pixel \(I_{i}(i=1,2,...,p)\), LBP can be formalized as follows:

$$\begin{aligned}{} & {} \text {LBP}=\sum _{i=1}^{p}F(I_{i}-I_{c})\times 2^{i-1},\nonumber \\{} & {} F(I)=\left\{ \begin{array}{ll} 1, &{}\quad I\ge 0,\\ 0,&{}\quad \text {otherwise.} \end{array}\right. \end{aligned}$$
(1)

Typical LBP maps are shown in second column of Fig. 2.

Histograms of oriented gradients (HOG)   HOG (Dalal & Triggs, 2005) describes the distribution of gradient orientations or edge directions within a local subregion. It is implemented via firstly computing magnitudes and orientations of gradients at each pixel, and then the gradients within each small local subregion are accumulated into orientation histogram vectors of several bins, voted by gradient magnitudes. Due to the partial invariance to geometric and photometric changes, HOG features might be robust for the illumination-sensitive modalities like RGB and IR. The visualization results are shown in third column of Fig. 2.

Pattern of local gravitational force (PLGF)   Inspired by Law of Universal Gravitation, PLGF (Bhattacharjee & Roy, 2019) describes the image interest regions via local gravitational force magnitude, which is useful to reduce the impact of illumination/noise variation while preserving edge-based low-level clues. It can be formulated as:

$$\begin{aligned}{} & {} \text {PLGF}=arctan\left( \sqrt{\left( \frac{I*M_{x}}{I}\right) ^{2}+\left( \frac{I*M_{y}}{I}\right) ^{2}}\right) ,\nonumber \\{} & {} M_{x}(m,n) =\left\{ \begin{array}{ll} \frac{cos(arctan(m/n))}{m^{2}+n^{2}}, &{}\quad (m^{2}+n^{2})>0,\\ 0,&{}\quad \text {otherwise,} \end{array}\right. \nonumber \\{} & {} M_{y}(m,n) =\left\{ \begin{array}{ll} \frac{sin(arctan(m/n))}{m^{2}+n^{2}}, &{}\quad (m^{2}+n^{2})>0,\\ 0,&{}\quad \text {otherwise,} \end{array}\right. \end{aligned}$$
(2)

where I is the raw image. \(M_{x}\) and \(M_{y}\) are two filter masks for gravitational force calculation. m and n are indexes denoting the relative position to the center. \(*\) is convolution operation sliding along all pixels. The visualization of PLGF maps are shown in fourth column of Fig. 2.

Composition   In consideration of the complementary characteristics from raw image and local descriptors, we also study the compositions among these features via input-level concatenation. For example, ‘GRAY_HOG_PLFG’ denotes three-channel inputs (raw gray-scale channel + HOG + PLFG), which is visualized in last column of Fig. 2.

3.2 Adaptive Multimodal Adapter

Recent studies have verified that introducing adapters (Huang et al., 2022) with fully connected (FC) layers can improve the FAS performance when training data is not adequate. However, FC-based adapter focuses on the intra-token feature refinement but neglects (1) contextual features from local neighbor tokens; and (2) multimodal features from cross-modal tokens. To tackle these issues, we extend the convolutional adapter (ConvAdapter) (Jie &, 2022) into a multimodal version for multimodal FAS.

As illustrated in Fig. 1, instead of directly finetuning the transformer blocks \(\textbf{E}_{\text {trans}}\), we fix all the pre-trained parameters from \(\textbf{E}_{\text {patch}}\) and \(\textbf{E}_{\text {trans}}\) while training only adaptive multimodal adapters (AMA) and \(\textbf{E}_{\text {head}}\). An AMA module consists of four parts: (1) an 1 \(\times \)1 convolution with GELU \(\Theta _{\downarrow }\) for dimension reduction from the original channels D to a hidden dimension \(D'\); (2) a 3\(\times \)3 2D convolution \(\Theta _{\text {2D}}\) mapping channels \(D'\times K\) to \(D'\) for multimodal local feature aggregation, where K means the modality numbers; (3) an adaptive modality weight (\(w_{1},...,w_{K}\)) generator via cascading global averaging pooling (GAP), 1\(\times \)1 convolution \(\Theta _{\text {Ada}}\) to project channels from \(D' \times K\) to K, and the Sigmoid function \(\sigma \); and (4) an 1\(\times \)1 convolution with GELU \(\Theta _{\uparrow }\) for dimension expansion to D. As features from different modalities are already spatially aligned, we restore the 2D structure for each modality after the channel squeezing. Similarly, the 2D structure will be flatten into 1D tokens before the channel expanding. The AMA can be formulated as

$$\begin{aligned}{} & {} T^\text {Vis} =\text {Concat}[\Theta _{\downarrow }(T^\text {Vis}_{\text {RGB}}),\Theta _{\downarrow } (T^\text {Vis}_{\text {IR}}),\Theta _{\downarrow }(T^\text {Vis}_{\text {Depth}})],\nonumber \\{} & {} w_{\text {RGB}},w_{\text {IR}},w_{\text {Depth}} = \sigma (\Theta _{\text {Ada}}(\text {GAP}(T^\text {Vis} ))), \nonumber \\{} & {} T^\text {Vis} =\Theta _{\text {2D}}(T^\text {Vis} ), \nonumber \\{} & {} T^\text {Vis} =\text {Concat}[w_{\text {RGB}} \cdot T^\text {Vis},w_{\text {IR}} \cdot T^\text {Vis},w_{\text {Depth}} \cdot T^\text {Vis} ], \nonumber \\{} & {} \text {AMA}=\text {Concat}[\Theta _{\uparrow }(\Theta _{\downarrow }(T^\text {Cls})),\Theta _{\uparrow }(T^\text {Vis} )]. \end{aligned}$$
(3)

Here we show an example when K=3 (i.e., RGB+ IR+Depth) in Eq. (3), and AMA is flexible for arbitrary modalities (e.g., RGB+IR). Note that AMA is equivalent to vanilla ConvAdapter (Jie &, 2022) in unimodal setting when K=1.

3.3 Modality-Asymmetric Masked Autoencoder

Existing multimodal FAS works usually finetune the ImageNet pre-trained models, which might be sub-optimal due to the huge task and modality gaps. Meanwhile, in consideration of costly collection of large-scale annotated live/spoof data, self-supervised pre-training without labels (Muhammad et al., 2022) is potential for model initialization in multimodal FAS. Here we propose the modality-asymmetric masked autoencoder (M\(^{2}\)A\(^{2}\)E) for multimodal FAS self-supervised pre-training.

As shown in Fig. 15, given a multimodal face sample (\(X_{\text {RGB}},X_{\text {IR}},X_{\text {Depth}}\)), M\(^{2}\)A\(^{2}\)E randomly selects unimodal input \(X_{i}\) (\(i\in {\text {RGB},\text {IR},\text {Depth}}\)) among all modalities. Then random sampling strategy (He et al., 2022) is used to mask out p percentage of the visual tokens in \(X_{i}\). Only the unmasked visible tokens are forwarded the ViT encoder and both visible and masked tokens are fed in unshared ViT decoders. In terms of the reconstruction target, given a masked input \(X_{i}\) with the i-th modality, M\(^{2}\)A\(^{2}\)E aims to predict the pixel values with mean squared error (MSE) loss for 1) each masked patch of \(X_{i}\), and 2) the whole input images of other modalities \(X_{j}\) (\(j\ne i; j\in {\text {RGB},\text {IR},\text {Depth}}\)). The motivation behind M\(^{2}\)A\(^{2}\)E is that with the multimodal reconstruction target, the self-supervised pre-trained ViTs are able to model (1) task-aware contextual semantics (e.g., moiré patterns and color distortion) via masked patch prediction; and (2) intrinsic physical features (e.g., 2D attacks without facial depth) via cross modality translation.

Relation to modality-symmetric autoencoders (He et al., 2022; Bachmann et al., 2022)   Compared with the vanilla MAE (He et al., 2022), M\(^{2}\)A\(^{2}\)E adopts the same masked strategy in unimodal ViT encoder but targeting at multimodal reconstruction with multiple unshared ViT decoders. Besides, M\(^{2}\)A\(^{2}\)E is similar to the multimodal MAE (Bachmann et al., 2022) only when partial tokens from a single modality are visible while masking all tokens from other modalities (Fig. 3).

Fig. 3
figure 3

The framework of the modality-asymmetric masked autoencoder (M\(^{2}\)A\(^{2}\)E). Different from previous multimodal MAE (Bachmann et al., 2022) masking all modalities as inputs, our M\(^{2}\)A\(^{2}\)E randomly selects unimodal masked input for multimodal reconstruction

4 Experimental Evaluation

4.1 Datasets and Performance Metrics

Three commonly used multimodal FAS datasets are used for experiments, including WMCA (George et al., 2019), CASIA-SURF (Zhang et al., 2019b) and CASIA-SURF CeFA (CeFA) (Liu et al., 2021a). WMCA contains a wide variety of 2D and 3D PAs with four modalities, which introduces 2 protocols: ‘seen’ protocol which emulates the seen attack scenario and the ‘unseen’ attack protocol that evaluates the generalization on an unseen attack. CASIA-SURF consists of 1000 subjects with 21,000 videos, and each sample has 3 modalities, which has an official intra-testing protocol. CeFA is the largest multimodal FAS dataset, covering 3 ethnicities, 3 modalities, 1607 subjects, and 34,200 videos. We conduct intra- and cross-dataset testings on WMCA and CASIA-SURF datasets, and leave large-scale CeFA for self-supervised pre-training.

Fig. 4
figure 4

The training and inference pipeline of M\(^{2}\)A\(^{2}\)E and AMA modules. At Stage 1, M\(^{2}\)A\(^{2}\)E is used for multimodal self-supervised pre-training, and the ViT-based encoder and decoders are trainable. At Stage 2, the pre-trained encoder is fixed while only the AMA modules and the classification head are trainable when finetuning

In terms of evaluation metrics, Attack Presentation Classification Error Rate (APCER), Bonafide Presentation Classification Error Rate (BPCER), and ACER (international organization for standardization, 2016) are used for the metrics. The ACER on testing set is determined by the Equal Error Rate (EER) threshold on dev sets for CASIA-SURF, and the BPCER = 1% threshold for WMCA. True Positive Rate (TPR)@False Positive Rate (FPR) = 10\(^{-4}\) (Zhang et al., 2019b) is also provided for CASIA-SURF. For cross-testing experiments, Half Total Error Rate (HTER) is adopted.

4.2 Implementation Details

We crop the face frames using MTCNN (Zhang et al., 2016) face detector. The local descriptors are extracted from gray-scale images with: (1) \(3\times 3\) neighbors for LBP (Ojala et al., 2002; 2) 9 orientations, \(8\times 8\) pixels per cell, and \(2\times 2\) cells per block for HOG (Dalal & Triggs, 2005); and (3) the size of masks are set to 5 for PLGF (Bhattacharjee & Roy, 2019). Composition inputs ‘GRAY_HOG_PLGF’ are adopted on unimodal and multimodal experiments for IR modality, while the raw inputs are utilized for RGB and Depth modalities. ViT-Base (Dosovitskiy et al., 2021) supervised by binary cross-entropy loss is used as the defaulted architecture. For the direct finetuning, only the last transformer block and classification head are trainable. For AMA and ConvAdapter (Jie &, 2022) finetuning, the original and hidden channels are D=768 and \(D'\)=64, respectively. For M\(^{2}\)A\(^{2}\)E, the mask ratio p=40% is used while the decoder depth and width is 4 and 512, respectively. The training and inference pipeline of self-supervised pretraining with M\(^{2}\)A\(^{2}\)E pretraining and fine-tuning/inference with AMA modules are illustrated in Fig. 4.

The experiments are implemented with Pytorch on one NVIDIA A100 GPU. For the self-supervised pre-training on CeFA with RGB+IR+Depth modalities, we use the AdamW (Loshchilov &, 2017) optimizer with learning rate (lr) 1.5e−4, weight decay (wd) 0.05 and batch size 64 at the training stage. ImageNet pre-trained weights are used for our encoder. We train the M\(^{2}\)A\(^{2}\)E for 400 epochs while warming up the first 40 epochs, and then performing cosine decay. For supervised unimodal and multimodal experiments on WMCA and CASIA-SURF, we use the Adam optimizer with the fixed lr = 2e−4, wd = 5e−3 and batch size 16 at the training stage. We finetune models with maximum 30 epochs based on the ImageNet or M\(^{2}\)A\(^{2}\)E pre-trained weights.

4.3 Results of Unimodal and Multimodal FAS

Intra testing on WMCA    The unimodal and multimodal results of protocols ‘seen’ and ‘unseen’ on WMCA (George et al., 2019) are shown in Table 1. On the one hand, compared with the direct finetuning results from ‘ViT’, the ViT+AMA/ConvAdapter can achieve significantly lower ACER in all modalities settings and both ‘seen’ and ‘unseen’ protocols. This indicates the proposed AMA efficiently leverages the unimodal/multimodal local inductive cues to boost original ViT’s global contextual features. On the other hand, when replaced the ImageNet pre-trained ViT with self-supervised M\(^{2}\)A\(^{2}\)E from CeFA, the generalization for unseen attack detection improves obviously with modalities ‘IR’, ‘Depth’, ‘RGB+IR’, ‘IR+Depth’, and ‘RGB+IR+Depth’, indicating its excellent transferability for downstream modality-agnostic tasks. It is surprising to find in the last block that the proposed methods with RGB+IR+Depth modalities perform even better than ‘MC-CNN’ (George et al., 2019) with four modalities in both ‘seen’ and ‘unseen’ protocols. With complex and specialized modules, although ‘MA-ViT’ (Liu &, 2022) outperforms the proposed methods with RGB+Depth modalities in ‘unseen’ protocol by \(-\)2.01% ACER, the proposed AMA and M\(^{2}\)A\(^{2}\)E might be potential to plug-and-play in ‘MA-ViT’ for further performance improvement.

Table 1 ACER(%) results of protocols ‘seen’ and ‘unseen’ on WMCA
Table 2 The results on CASIA-SURF
Table 3 The HTER (%) values from the cross-testing between WMCA and CASIA-SURF datasets
Table 4 Results with missing modality ratio \(\alpha \) = 70% on WMCA
Table 5 Results with missing modality ratio \(\alpha \)=70% on CASIA-SURF

Intra testing on CASIA-SURF    For CASIA-SURF, we compare with three famous multimodal methods ‘SEF’ (Zhang et al., 2019b), ‘MS-SEF’ (Zhang et al., 2020), and ‘MA-ViT’ (Liu &, 2022). From Table 2, we can observe that the performance of ‘ViT’ is usually worse than ‘MS-SEF’ with multimodal settings due to the limited modality fusion ability. When equipped with AMA and M\(^{2}\)A\(^{2}\)E-based self-supervised pre-training, ‘ViT+AMA+M\(^{2}\)A\(^{2}\)E’ outperforms MS-SEF by a large margin in most modality settings (‘RGB’, ‘Depth’, ‘RGB+IR’, ‘RGB+Depth’, ‘RGB+IR+Depth’) in terms of ACER metrics. Thanks to the powerful multimodal representation capacity from the M\(^{2}\)A\(^{2}\)E pre-trained model, the proposed method surpasses the dedicated ‘MA-ViT’ with ‘RGB+IR+Depth’ modalities.

Cross-dataset testings    To evaluate the unimodal and multimodal generalization, we conduct cross-testing experiments between models trained on CASIA-SURF and WMCA with Protocol ‘seen’. We also introduce the ‘MM-CDCN’ (Yu et al., 2020b) and ‘MA-ViT’ (Liu &, 2022) as baselines. Table 3 lists the HTER of all methods trained on one dataset and tested on another dataset. From these results, the proposed ‘ViT+AMA+M\(^{2}\)A\(^{2}\)E’ outperforms ‘MM-CDCN’ in most modality settings and ‘MA-ViT’ with ‘RGB+IR+Depth’ on both two cross-testing protocols, indicating that the learned multimodal features are robust to the sensors, resolutions, and attack types. Specifically, directly finetuning ImageNet pre-trained ViTs (see results of ‘ViT’) usually generalize worse than ‘MM-CDCN’ in multimodal settings. When assembling with AMA and M\(^{2}\)A\(^{2}\)E, the HTER can be further reduced 9.25%/5.38%/10.41%/10.59% and 4.96% / 4.13% / 8.64% / 4.38% for ‘RGB + IR’/‘RGB+Depth’/‘IR+Depth’/‘RGB+IR+Depth’ when tested on CASIA-SURF and WMCA, respectively.

Fig. 5
figure 5

Quantitative intra-testing results on WMCA with different modality-missing ratios under four flexible-modal settings

Fig. 6
figure 6

Quantitative intra-testing results on CASIA-SURF with different modality-missing ratios under four flexible-modal settings

4.4 Results of Flexible-Modal FAS

Following four flexible-modal sub-protocols (i.e., (1) RGB-D when missing depth; (2) RGB-IR when missing IR; (3) RGB-D-IR when missing overlapped depth and IR; and (4) RGB-D-IR when missing Depth and IR with limited complete RGB-DIR data or limited overlapped D-IR data.) in Yu et al. (2023a), we show results of missing modality ratio \(\alpha \)=70% on WMCA and CASIA-SURF datasets with our proposed methods and five baseline methods including vanilla multimodal ViT (ViT) (Dosovitskiy et al., 2021), visual prompt tuning (Prompt) (Jia et al., 2022), cross-modal focal loss (CMFL) (George & Marcel, 2021), cross attention (CrossAtten) (Yu et al., 2023b), and VP-FAS (Yu et al., 2023a).

Intra testing on WMCA    As shown in Table 4, we find that the proposed ‘ViT+AMA’ consistently improves the state-of-the-art ‘VP-FAS’ by a convincing margin (at least 0.8% ACER decrease) in all the scenarios, indicating that using the adaptive multimodal adapter, without entire model finetuning, is able to represent rich modality-aware live/spoof cues while the adaptive modality weighting mechanism benefits modality-complementary relation modeling thus alleviating missing-modality issues. Besides, we also find from the results ‘ViT+AMA+M\(^{2}\)A\(^{2}\)E’ that pretraining with M\(^{2}\)A\(^{2}\)E further reduces at least 0.6% ACER in all missing-modality scenarios. It indicates that the modality-asymmetric masked modeling from large-scale unlabeled task-related data enhances cross-modal live/spoof intrinsic representation thus mitigating severe missing-modality degradation.

Fig. 7
figure 7

Results of direct finetuning on different transformer blocks

Fig. 8
figure 8

Impacts of inputs with local feature descriptors (e.g., LBP, HOG, PLGF) for ViT using direct finetuning and ConvAdapter strategies on a RGB, b IR, and c depth modalities

Intra testing on CASIA-SURF    Similar to the results in WMCA, it can be seen from Table 5 that the proposed AMA outperforms ‘Prompt’, ‘CMFL’, ‘CrossAtten’ and ‘VP-FAS’ by large margins on the CASIA-SURF dataset under all flexible-modal settings, showing the effectiveness of adaptive multimodal parameter-efficient tuning. Different from other advanced multimodal learning strategies (i.e., ‘Prompt’, ‘CMFL’, ‘CrossAtten’, and ‘VP-FAS’) suffering poor performance under the flexible RGB-IR setting due to noisy IR imaging data in CASIA-SURF, the proposed AMA is robust even in this challenging scenario because of its reliable adaptive modality reweighting mechanism. Furthermore, it can be seen clearly from the results ‘ViT+AMA+M\(^{2}\)A\(^{2}\)E’ that pretraining with M\(^{2}\)A\(^{2}\)E enhances the robustness of multimodal feature representation especially on two flexible RGB-D-IR settings (more than 2% ACER decrease).

Robustness to different \(\alpha \)    We also evaluate models’ robustness against missing modalities on WMCA and CASIA-SURF in four flexible-modal settings with different modality-missing ratios \(\alpha \). As shown in Fig. 5, we find that on WMCA, the proposed ‘ViT+AMA’ and ‘ViT+AMA+M\(^{2}\)A\(^{2}\)E’ are more robust to larger \(\alpha \), compared with the other baseline methods. Meanwhile, the performance gaps of the proposed and previous methods are clear on two flexible RGB-D-IR settings, indicating the efficacy of AMA and M\(^{2}\)A\(^{2}\)E for robust multimodal FAS feature representation. Moreover, we can find similar conclusions on CASIA-SURF from Fig. 6 that the proposed ‘ViT+AMA’ and ‘ViT+AMA+M\(^{2}\)A\(^{2}\)E’ are less sensitive to missing-modality scenarios. One highlight can be found from Fig. 6c, d that on CASIA-SURF, performance gaps among previous methods on two flexible RGB-D-IR settings are usually close due to the issue of low-quality complete-modality data in CASIA-SURF while the proposed ‘ViT+AMA+M\(^{2}\)A\(^{2}\)E’ outperform them by convincing margins.

Fig. 9
figure 9

Impacts of bimodal inputs with local feature descriptors (e.g., LBP, HOG, PLGF) for ViT using direct finetuning and adaptive multimodal adapter (AMA) strategies on a RGB+IR; b RGB+Depth; and c IR+Depth modalities. ‘Raw+HOG’ indicates ‘Raw’ input and ‘HOG’ input are used for the first and second modalities, respectively

Fig. 10
figure 10

Impacts of trimodal inputs with local feature descriptors for ViT using direct finetuning and AMA strategies on RGB+IR+Depth modalities. ‘Raw+HOG+Raw’ indicates ‘Raw’ input, ‘HOG’ input and ‘Raw’ input are respectively used for three modalities

4.5 Ablation Study

We provide the results of ablation studies for inputs with local descriptors and AMA on ‘seen’ protocol of WMCA to verify models’ discrimination. We then study M\(^{2}\)A\(^{2}\)E on ‘unseen’ protocol of WMCA and cross testing from WMCA to CASIA-SURF to verify models’ generalization capacity.

Ablation on direct finetuning   Here we discuss five finetuning strategies for ImageNet pretrained ViT-Base including finetuning (1) all transformer blocks and classification head; (2) last 8 transformer blocks and classification head; (3) last 4 transformer blocks and classification head; (4) last transformer block and classification head; and (5) only classification head. The results on WMCA with Protocol ‘seen’ are shown in Fig. 7. The composition input ‘GRAY_HOG_PLGF’ is used for IR modality. It is clear that finetuning all transformer blocks perform the worst due to the overfitting issues in huge trainable parameters. In contrast, finetuning the last transformer block or last 8 transformer blocks can achieve stable and reasonable performance. We adopt the strategy of finetuning last transformer block and classification head as the defaulted direct finetuning setting due to its efficiency.

Impact of unimodal inputs with local descriptors   In the default setting of ViT inputs, composition input ‘GRAY_HOG_PLGF’ is adopted on for IR modality, while the raw inputs are utilized for RGB and Depth modalities. In this ablation, we consider three local descriptors (‘LBP’ (Ojala et al., 2002), ‘HOG’ (Dalal & Triggs, 2005), ‘PLGF’ (Bhattacharjee & Roy, 2019)) and their compositions (‘HOG_PLGF’, ‘LBP_HOG_PLGF’, ‘GRAY_HOG_PLGF’). It can be seen from Fig. 8 that the ‘LBP’ input usually performs worse than other features for all three modalities. In contrast, the ‘PLGF’ input achieves reasonable performance (even better performance than raw input for IR modality via direct finetuning). It is clear that raw inputs are good enough for all modalities via ConvAdapter. One highlight is that composition input ‘GRAY_HOG_PLGF’ performs the best for IR modality via both direct finetuning and ConvAdapter, indicating the importance of local detailed and illumination invariant cues in IR feature representation.

Impact of multimodal inputs with local descriptors.   In the default setting of ViT inputs, composition input ‘GRAY_HOG_PLGF’ is adopted on for IR modality, while the raw inputs are utilized for RGB and Depth modalities. Besides the unimodal input results with local descriptors in the main manuscript, we also demonstrate elaborate bimodal and trimodal results with local descriptors on WMCA with Protocol ‘seen’ in Figs. 9 and 10, respectively. Some observations can be found: (1) in the multimodal settings (RGB+IR, IR+Depth, and RGB+IR+Depth), the composition input ‘GRAY_HOG_PLGF’ for IR modality with raw inputs for other modalities performs the best via both direct finetuning and AMA; and (2) leveraging the local descriptor inputs for IR modality via direct finetuing, the performance on ‘IR+Depth’ and ‘RGB+IR+Depth’ are significantly improved compared with using all raw inputs. They all indicate the importance of local detailed and illumination invariant cues in IR feature representation (Fig. 11).

Fig. 11
figure 11

Ablation of the adapter types in transformer blocks

Fig. 12
figure 12

Detailed architectures of five possible adapter types for efficient multimodal learning, including a FC-based ‘vanilla adapter’ (Huang et al., 2022); b independent-modal ‘ConvAdapter’ (Jie &, 2022); c ‘multimodal ConvAdapter’ with \(\Theta _{\text {2D}}\) mapping channels \(D' \times K\) to \(D'\); d ‘multimodal ConvAdapter (huge)’ with \(\Theta _{\text {2D}}\) mapping channels \(D'\times K\) to \(D'\times K\); and e adaptive multimodal ConvAdapter (‘AMA’)

Impact of adapter types   Here we discuss five possible adapter types for efficient multimodal learning, including FC-based ‘vanilla adapter’ (Huang et al., 2022), independent-modal ‘ConvAdapter’ (Jie &, 2022), ‘multimodal ConvAdapter’ with \(\Theta _{\text {2D}}\) mapping channels \(D'\times K\) to \(D'\), ‘multimodal ConvAdapter (huge)’ with \(\Theta _{\text {2D}}\) mapping channels \(D'\times K\) to \(D'\times K\), and adaptive multimodal ConvAdapter (‘AMA’). As shown in Fig. 12a, the FC-based ‘vanilla adapter’ (Huang et al., 2022) is the most efficient one but ignores the modality interaction as well as the local spatial context. In contrast, the independent-modal ‘ConvAdapter’ (Jie &, 2022) in Fig. 12b can benefit from the local spatial context. The ‘multimodal ConvAdapter’ with \(\Theta _{\text {2D}}\) mapping channels \(D'\times K\) to \(D'\) and ‘multimodal ConvAdapter (huge)’ with \(\Theta _{\text {2D}}\) mapping channels \(D'\times K\) to \(D'\times K\) are illustrated in Fig. 12c, d, respectively. They both can capture the multimodal spatial context. The former captures the modality-shared features and then copy & expand back to all modalities, which is more efficient. The latter directly generates interacted features for each modality. Compared with ‘multimodal ConvAdapter’, the proposed adaptive multimodal ConvAdapter (‘AMA’) can dynamically adjust the modality-shared features for each modality with negligible parameter increases.

As shown in Fig. 11, the ConvAdapter based modules perform significantly better than vanilla adapter in multimodal settings, indicating the local inductive biases benefit the ViT-based FAS. Moreover, compared with ‘ConvAdapter’, ‘multimodal ConvAdapter’ reduces more than 0.5% ACER in all multimodal settings via aggregating multimodal local features. In contrast, we cannot see any performance improvement from ‘multimodal ConvAdapter (huge)’. In other words, directly learning high-dimensional (\(D'\times K\)) convolutional features for all K modalities results in serious overfitting. Compared with ‘multimodal ConvAdapter’, AMA enhances the diversity of features for different modalities via adaptively weighting the shared low-dimensional (\(D'\)) convolutional features, which decreases 0.52, 0.31, 1.07, 0.07% ACER for ‘RGB+Depth’, ‘RGB+IR’, ‘IR+Depth’, ‘RGB+IR+Depth’, respectively.

Fig. 13
figure 13

Ablation of the hidden dimensions in AMA

Fig. 14
figure 14

Ablation of the AMA positions in transformer blocks

Impact of dimension and position of AMA   Here we study the hidden dimensions \(D'\) in AMA and the impact of AMA positions in transformer blocks. It can be seen from Fig. 13 that despite more lightweight, lower dimensions (16 and 32) cannot achieve satisfactory performance due to weak representation capacity. The best performance can be achieved when \(D'=64\) in all multimodal settings. In terms of AMA positions, it is interesting to find from Fig. 14 that plugging AMA along FFN performs better than along MHSA in multimodal settings. This might be because the multimodal local features complement the limitation of point-wise receptive field in FFN. Besides, it is reasonable that applying AMA on MHSA+FFN performs the best.

Impact of mask ratio in M\(^{2}\)A\(^{2}\)E   Fig. 15a illustrates the generalization of the M\(^{2}\)A\(^{2}\)E pre-trained ViT when finetuning on ‘unseen’ protocol of WMCA and cross testing from CASIA-SURF to WMCA. Different from the conclusions from (He et al., 2022; Bachmann et al., 2022) using very large mask ratio (e.g., 75 and 83%), we find that mask ratio ranges from 30 to 50% are suitable for multimodal FAS, and the best generalization performance on two testing protocols are achieved when mask ratio equals to 40%. In other words, extreme high mask ratios (e.g., 70–90%) might force the model to learn too semantic features but ignoring some useful low-/mid-level live/spoof cues.

Fig. 15
figure 15

Ablation of the a mask ratio; b self-supervision training epochs; and c decoder depth in M\(^{2}\)A\(^{2}\)E

Impact of training epochs and decoder depth in M\(^{2}\)A\(^{2}\)E   We also investigate how the training epochs and decoder depth influence the M\(^{2}\)A\(^{2}\)E. As shown in Figs. 15b, c, training M\(^{2}\)A\(^{2}\)E with 400 epochs and decoder of 4 transformer blocks generalizes the best. More training iterations and deeper decoder are not always helpful due to the severe overfits on reconstruction targets.

Comparison between multimodal MAE (Bachmann et al., 2022) and M\(^{2}\)A\(^{2}\)E   We also compare M\(^{2}\)A\(^{2}\)E with the symmetric multimodal MAE (Bachmann et al., 2022) when finetuning on all downstream modality settings. It can be seen from Fig. 16 that with more challenging reconstruction target (from masked unimodal inputs to multimodal prediction), M\(^{2}\)A\(^{2}\)E is outperforms the best settings of multimodal MAE (Bachmann et al., 2022) on most modalities (‘RGB’, ‘IR’, ‘RGB+IR’, ‘RGB+Depth’, ‘RGB+IR+Depth’), indicating its excellent downstream modality-agnostic capacity.

Fig. 16
figure 16

Results of the multimodal MAE (Bachmann et al., 2022) and our M\(^{2}\)A\(^{2}\)E

Fig. 17
figure 17

Visualization of the reconstructed multimodal faces from M\(^{2}\)A\(^{2}\)E with unimodal masked inputs on CeFA

4.6 Visualization and Discussion

Parameter efficiency of AMA.   The proposed AMA is parameter-efficient, which respectively requires only 7.25 and 8.8% parameters to adapt pretrained ViT-Base models in two- and three-modality settings, thus avoiding finetuning heavy transformers.

Visualization of M\(^{2}\)A\(^{2}\)E   We also conduct experiments to visualize the reconstruction results of the proposed M\(^{2}\)A\(^{2}\)E. As illustrated in Fig. 17, we find that even though with unimodal (40%) masked input, the reconstruction of all three modalities still keep some important live/spoof clues (e.g., convincing facial geometric depth predicted from masked RGB input for live samples while equivocal depth info for spoof ones), which benefits the self-supervised pre-training for generalized task-aware feature representation.

Comparison between self-Supervised learning (SSL) and supervised methods   Here we show the results of SSL with multimodal MAE and M\(^{2}\)A\(^{2}\)E using linear probes. Specifically, we pretrain models on the training set of WMCA without using labels. Then the labeled data from WMCA are used for tuning the linear classification head only. The ACER of supervised SSL-MAE/SSL-M\(^{2}\)A\(^{2}\)E/ViT with RGB+IR+Depth on ‘seen’ and ‘unseen’ protocols are 4.08%/3.35%/2.52% and 8.6 ± 11.85%/7.72 ± 10.37%/6.46± 8.12%, respectively. It indicates the superiority of M\(^{2}\)A\(^{2}\)E over MAE despite performance gaps between SSL and supervised methods.

Fig. 18
figure 18

Ablation of local descriptors at the feature level

Fig. 19
figure 19

Ablation on various ViT structures

Exploration of local descriptors at the feature level   Besides leveraging local descriptors at the input level, we also evaluate the results when merging the philosophy of local gradient difference clues into the feature-level adapters [e.g., AMA with CDC (Yu et al., 2020c) and with DCDCA (Cai et al., 2023)]. We can find from Fig. 18 that AMA-CDC (when hyperparameter \(\theta \) = 0.7) and AMA-DCDCA perform better or on par with the vanilla AMA in most multimodal settings. Compared with multimodal settings with Depth modality, AMA-CDC and AMA-DCDCA improves significantly on settings with IR modality using ‘GRAY_HOG_PLGF’ input, indicating the potential of combining local descriptors for multimodal ViT from both input level and feature level.

Discriminative capacity of M\(^{2}\)A\(^{2}\)E   Despite generalized task-aware contextual semantics and intrinsic physical features representation via cross-modality translation, compared with supervised pretrained models via large-scale categories in ImageNet, the discriminative capacity of M\(^{2}\)A\(^{2}\)E is slightly limited. These phenomena can be observed from the results between ‘ViT+AMA (Ours)’ and ‘ViT+AMA+M\(^{2}\)A\(^{2}\)E (Ours)’ of ‘Modalities: RGB+IR, RGB+Depth, IR+Depth, and RGB+IR+Depth’ under ‘seen’ protocol of Table 1. The reason behind this might be M\(^{2}\)A\(^{2}\)E is conducted on the instance level but weak in mining contrastive and discriminative live/spoof clues across multiple instances. One possible solution is to introduce self-supervised contrastive learning (Cao et al., 2022) and design distillation-based strategy to improve the discriminative capacity of M\(^{2}\)A\(^{2}\)E models via teacher ImageNet pretrained models.

Generalization on various ViT structures   Here we show the results on ‘seen’ protocol of WMCA with ‘Modalities: RGB+IR, RGB+Depth, IR+Depth, and RGB+IR+Depth’ using two more efficient backbones, which is edge-friendly and practical for real-world deployment. The results are shown in Fig. 19. We can find from the figure below that MobileViT-S (Mehta &, 2021) and SwinTransformer-S (Liu et al., 2021d) with AMA respectively achieve \(-2.08\)%/\(-3.41\)%/\(-3.26\)%/\(-1.47\)% and \(-1.99\)%/\(-3.11\)%/\(-3.05\)%/\(-1.58\)% ACER for RGB+Depth/RGB+IR/IR+Depth/RGB+IR+Depth modality settings, indicating the superiority and generalization of AMA on different ViT-based models.

Fairness of the evaluation protocols   Please note that only ‘ViT+AMA+M\(^{2}\)A\(^{2}\)E’ needs self-supervised pertaining from CeFA. In other words, all results of ‘ViT+AMA’ are fairly compared with other methods without CeFA. The reasons we use CeFA for self-supervised pertaining are that (1) self-supervised pertaining usually conducts on large-scale unlabeled task-aware datasets; and (2) to use CeFA for pertaining can avoid data leakage on WMCA and CASIA-SURF. We also show the results of intra-dataset testings on WMCA and CASIA-SURF datasets when self-supervised pretrained on WMCA and CASIA-SURF, respectively. In this case, no extra datasets (e.g., CeFA) are introduced. ‘ViT+AMA+M\(^{2}\)A\(^{2}\)E’ with ‘RGB+IR+Depth’ modalities can achieve 1.16%, 3.43 ± 6.52%, and 0.56% ACER for ‘seen’ protocol WMCA, ‘unseen’ protocol WMCA, and CASIA-SURF, respectively.

5 Conclusions and Future Work

In this paper, we investigate three key factors (i.e., inputs, pretraining, and finetuning) for ViT-based multimodal FAS. We propose to combine local descriptors for IR inputs, and design modality-asymmetric masked autoencoder and adaptive multimodal adapter for efficient self-supervised pre-training and supervised finetuning for multimodal FAS. We note that the study of ViT-based multimodal FAS is still at an early stage. Future directions include: 1) Besides inputs, integrating local descriptors into transformer blocks (Yu et al., 2022) or adapters is potential for ViT-based multimodal FAS; 2) Besides generalization, the discriminative capacity of M\(^{2}\)A\(^{2}\)E pre-trained models should be improved.