Dependency-Aware Attention Control for Unconstrained Face Recognition with Image Sets

Liu, Xiaofeng; Kumar, B. V. K. Vijaya; Yang, Chao; Tang, Qingming; You, Jane

doi:10.1007/978-3-030-01252-6_34

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11215))

Included in the following conference series:

European Conference on Computer Vision

2773 Accesses
21 Citations

Abstract

This paper targets the problem of image set-based face verification and identification. Unlike traditional single media (an image or video) setting, we encounter a set of heterogeneous contents containing orderless images and videos. The importance of each image is usually considered either equal or based on their independent quality assessment. How to model the relationship of orderless images within a set remains a challenge. We address this problem by formulating it as a Markov Decision Process (MDP) in the latent space. Specifically, we first present a dependency-aware attention control (DAC) network, which resorts to actor-critic reinforcement learning for sequential attention decision of each image embedding to fully exploit the rich correlation cues among the unordered images. Moreover, we introduce its sample-efficient variant with off-policy experience replay to speed up the learning process. The pose-guided representation scheme can further boost the performance at the extremes of the pose variation.

You have full access to this open access chapter, Download conference paper PDF

Learning Collaborative Reinforcement Attention for 3D Face Reconstruction and Dense Alignment

BroadFace: Looking at Tens of Thousands of People at once for Face Recognition

Densifying Supervision for Fine-Grained Visual Comparisons

Article 09 August 2020

Keywords

1 Introduction

Recently, unconstrained face recognition (FR) has been rigorously researched in computer vision community [1, 2]. In its initial days, the single image setting is used for FR evaluations, e.g., Labeled Faces in the Wild (LFW) verification task [3]. The trend of visual media explosion pushes the research into the next phase, where the video face verification attracts much attention, such as the YouTube Faces (YTF) dataset [4]. Since the LFW and YTF have a well-known frontal pose selection bias, the unconstrained FR is still considered an unsolved problem [5, 6]. In addition, the open-set face identification is actually more challenging compared to the verification popularized by the LFW and YTF datasets [7, 8].

The IARPA Janus Benchmark A (IJB-A) [9] provides a more practical unconstrained face verification and identification benchmark. It takes a set (containing orderless images and/or videos with extreme head rotations, complex expressions and illuminations) as the smallest unit for representation. The set of a subject can be sampled from the mugshot history of a criminal, lifetime enrollment images for identity documents, different check points, and trajectory of a face in the video. This kind of setting is more similar to the real-world biometric scenarios [10]. Capturing human faces from multiple views, background environments, camera parameters, does result in the large inner-set variations, but also incorporates more complementary information hopefully leading to higher accuracy in practical applications [11].

A commonly adopted strategy to aggregate identity information in each image is the average/max pooling [12,13,14,15]. Since the images vary in quality, a neural network-based assessment module has been introduced to independently assign the weight for each frame [11, 16]. By doing this, the frontal and clear faces are favored by their model. However, this may result in redundancy and sacrifice the diversity in a set. As shown in Fig. 2, these inferior frontal images are given relatively high weights in a set, sometimes as high as the weight given to the most discriminative one. There is little additional information that can be extracted from the blurry version of the same pose, while the valuable profile information etc., are almost ignored by the system. We argue that the desired weighting decision should depends on the other images within a set.

Instead, we propose to formulate the attention scheme as a Markov Decision Process and resort to the actor-critic reinforcement learning (RL) to harness model learning. The dependency-aware attention control (DAC) module learns a policy to decide the importance of each image step-by-step with the observation of the other images in a set. In this way, we adaptively aggregate the feature vectors into a highly-compact representation inside the convex hull spanned by them. It not only explicitly learns to advocate high-quality images while repelling low-quality one, but also considers the inner-set dependency to reduce the redundancy and maintains the benefit of diversity information.

Moreover, extracting a set-level invariable feature can be always challenging to incorporate all of the potential information in varying poses, illumination conditions, resolutions etc. Some approaches aggregate the image-level pair-wise similarity scores of two compared sets to fully use all images [17,18,19,20]. Given n as the average number of images in a set, then this corresponds to the $\mathcal {O}(n^2)$ computational complexity per match operation and $\mathcal {O}(n)$ space complexity per set are not desirable. More recently, [21,22,23] are proposed to trade-off between speed and accuracy of processing paired video inputs using value-based Q-learning methods. These configurations focus on the verification, and cannot be scaled well for large scale identification tasks [16]. Conventionally, the feature extraction of the probe and gallery samples are independent processes [8].

We notice that pose is the primary challenge in the IJB-A dataset and real applications [7, 24, 25], and there is a prior that the structures of frontal and profile face are significantly different. Therefore, we simply utilize a pose-guided representation (PGR) scheme with stochastic routing to model the inter-set dependency. It well balances the computation cost and information utilization.

Considering the above factors, we propose to fully exploit both the inner- and inter-set relationship for the unified set-based face verification and identification. (1) To the best of our knowledge, this is the first effort to introduce deep actor-critic RL into visual recognition problem. (2) The DAC can potentially be a general solution to incorporate rich correlation cues among orderless images. Its coefficients can be trained in a normal recognition training task given only set-level identity annotation, without the need of extra supervision signals. (3) To further improve the sample-efficiency, the trust region-based experience replay is introduced to speedup the training and achieve a stronger convergence property. (4) The PGR scheme well balances the computation cost and information utilization at the extremes of pose variation with the prior domain knowledge of human face. (5) The module-based feature-level aggregation also inherits the advantage of conventional pooling strategies e.g., taking varied number of inputs as well as offering time and memory efficiency.

We show that our method leads to the state-of-the-art accuracy on IJB-A dataset and also generalizes well in several video-based face recognition tasks, e.g., YTF and Celebrity-1000.

2 Related Work

Image Set/Video-Based Face Recognition has been actively studied in recent years [2]. The multi-image setting in the template-based dataset is similar to the multiple frames in the video-base recognition task. However, the temporal structure within a set is usually disordered, and the inner/inter-set variations are more challenging [9]. We will not cover the methods which exploit the temporal dynamics here. There are two kinds of conventional solutions, i.e., manifold-based and image-based methods. In the first category, each set/video is usually modeled as a manifold, and the similarity or distance is measured in the manifold-level. In previous works, the affine hull, SPD model, Grassmann manifolds, n-order statistics and hyperplane similarity have been proposed to describe the manifolds [26,27,28,29,30]. In these methods, images are considered as equal importance. They usually cannot handle the large appearance variations in the unconstrained FR task. For the second category, the pairwise similarities between probe and gallery images are exploited for verification [17, 18, 20, 31, 32]. The quadratic number of comparisons make them not scale well for identification tasks. Yang et al. [16] propose an attention model to aggregate a set of features to a single representation with an independent quality assessment module for each feature. Reference [33] further up-sampled the aggregated features to an image, then fed it to an image-based FR network. However, weighting decision for an image does not take the other images into account as discussed in Sect. 1. Since the frequently used RNN in video task [22, 34, 35] is not fit for the image set, in this work, we consider the dependency within a set of features in a different way, where we use deep reinforcement learning to suggest the attention of each feature.

Reinforcement Learning (RL) trains an agent to interact (by trail and error) with a dynamic environment with the objective to maximize its accumulated reward. Recently, deep RL with convolutional neural networks (CNN) achieved human-level performance in Atari Games [36]. The CNN is an ideal approximate function to address the infinite state space [37]. There are two main streams to solve RL problems: methods based on value function and methods based on policy gradient. The first category, e.g., Q-learning, is the common solution for discrete action tasks [36]. The second category can be efficient for continuous action space [38, 39]. There is also a hybrid actor-critic approach in which the parameterized policy is called an actor, and the learned value-function is called a critic [40, 41]. As it is essentially a policy gradient method, it can also be used for continuous action space [42].

Besides, policy-based and actor-critic methods have faster convergence characteristics than value-based methods [43], but they usually suffer from low sample-efficiency, high variance and often converge to local optima, since they typically learn via on-policy algorithms [44, 45]. Even the Asynchronous Advantage Actor-Critic [40, 41] also requires new samples to be collected for each gradient step on the policy. This quickly becomes extravagantly expensive, as the number of gradient steps to learn an effective policy increases with task complexity. Off-policy learning instead aims to reuse past experiences. This is not directly feasible with conventional policy gradient formulations, despite it relatively straightforward for value-based methods [37]. Hence in this paper, we focus on combining the stability of actor-critic methods with the efficiency of off-policy RL, which capitalizes in recent advances on deep RL [40], especially off-policy algorithms [46, 47].

In addition to its traditional applications in robotics and control, recently RL has been successfully applied to a few visual recognition tasks. Mnih et al. [48] introduce the recurrent attention model to focus on selected regions or locations from an image for digits detection and classification. This idea is extended to identity alignment by iteratively removing irrelevant pixels in each image [49]. The value-based Q-learning methods are used for object tracking [50] and the video verification in a computationally efficient view by dropping inefficient probe-gallery pairs [22] or stopping the comparison after receiving sufficient pairs [21, 23]. However, this will inevitably result in information loss of the unused pairs and only applicable for verification. There has been little progress made in policy gradient/actor-critic RL for visual recognition.

3 Proposed Methods

The flow chart of our framework is illustrated in Fig. 3. It takes a set of face images as input and processes them with two major modules to output a single (w/o PGR)/three (with PGR) feature vectors as its representation for recognition. We adopt a modern CNN module to embed an image into a latent space, which can largely reduce the computation costs and offer a practicable state space for RL. Then, we cascade the DAC, which works as an attention model reads all feature vectors and linearly combines them with adaptive weighting at the feature-level. Following the memory attention mechanism described in [16, 34, 35], the features are treated as the memory and the feature weighting is cast as a memory addressing procedure.These two modules can be trained in a one-by-one or end-to-end manner. We choose the first option, which makes our system benefit from the sufficient training data of the image-based FR datasets. The PGR scheme can further utilize the prior knowledge of human face to address a set with large pose variants.

3.1 Inner-Set Dependency Control

In the set-based recognition task, we are given M sets/videos $\mathop {({\mathcal {X}}^m, y^m)}_{m=1}^{M}$, where ${\mathcal {X}}^m$ is a image set/video sequence with varying number of images $T^m$ (i.e., ${\mathcal {X}}^m=\left\{ {{x}_{1}^{m}},{{x}_{2}^{m}},\cdots ,{{x}_{T^m}^{m}}\right\} $, ${{x}_{t}^{m}}$ is the t-th image in a set) and the $y^m$ is the corresponding set-level identity label. We feed each image ${{x}_{t}^{m}}$ to our model, and its corresponding feature representation ${{f}_{t}^{m}}$ are extracted using our neural embedding network. Here, we adopt the GoogLeNet [51] with Batch Normalization [52] to produce a 128-dimensional feature as our encoding of each image. With a relatively simple architecture, GoogLeNet has shown superior performance on several FR benchmarks. It can be easily replaced by other advanced CNNs for better performance. In the rest of the paper, we will simply refer to our neural embedding network as CNN, and omit the upper index (identity) where appropriate for better readability.

Since the features are deterministically computed from the images, they also inherit and display large variations. Simply discarding some of them using the hard attention scheme may result in loss of too much information in a set [16, 22]. Our attention control can be seen as the task of reinforcement learning to find the optimal weights of soft attention, which defines how much of them are focused by the memory attention mechanism. Moreover, the principle of taking different number of images without temporal information, and having trainable parameters through standard recognition training are fully considered.

Our solution of inner-set dependency modeling is to formulated as a MDP. At each time step t, the agent receives a state ${s_t}$ in a state space $\mathcal {S}$ and chooses an action ${a_t}$ from an action space $\mathcal {A}$, following a policy $\pi (a_t\mid s_t)$, which is the behavior of the agent. Then the action will determine the next state i.e., $s_{t+1}$ or termination, and receive a reward ${r_t(s_t,a_t)}\in \mathcal {R} \subseteq \mathbb {R}$ from the environment. The goal is to find an optimal policy $\pi ^*$ that maximizes the discounted total return $R_t=\sum _{i\ge 0}^{T}\gamma ^{i}{r_{t+i}(s_t,a_t)}$ in expectation, where $\gamma \in [0,1)$ is the discount factor to trade-off the importance of immediate and future rewards [37].

In the context of image-set based face recognition, we define the actions, $i.e., \left\{ a^{1},a^{2},\cdots ,a^{T}\right\} $, as the weights of each feature representation ${{\left\{ f\right\} }_{i=1}^{T}}$. The weights of soft attention ${{\left\{ a\right\} }_{i=1}^{T}}$ are initialized to be 1, and are updated step-by-step. The state $s_t$ is related to the $t-1$ weighted features and $T-(t-1)$ to-be weighted features. In contrast to image-level dependency modeling, the compact embeddings largely shrink the state space and make our RL training feasible. In our practical applications, $s_t$ is the concatenation of $f_t$ and the aggregation of the remaining features with their updated weights at time step t. The termination means all of the images in this set have been successfully traversed.

$$\begin{aligned} {s_t}= \left\{ \frac{(\sum _{i=1}^T {a_i}{f_i})-{f_t}}{(\sum _{i=1}^T {a_i})-1} \right\} Concatenate \left\{ {f_t}\right\} \end{aligned}$$

(1)

We define the global reward for RL by the overall recognition performance of the aggregated embeddings, which drives the RL network optimization. In practice, we add on the top of the DAC few fully connected layers h and followed by a softmax to calculate the cross-entropy loss $L_m = -\log \left( \left\{ e^{o_{y^m}}\right\} /{ \sum _j^M e^{o_j} }\right) $ to calculate the reward at this time step. We use the notation $o_j$ to mean the j-th element of the vector of class scores o. $g(\cdot )$ is the weighted average aggregation function, h maps the aggregated feature with the updated weights $g({\mathcal {X}}^m\mid s_{t})$ to the o. The reward are defined as follows:

$$\begin{aligned} g({\mathcal {X}^m}\mid s_{t},\text {CNN})=\sum _{i=1}^{T^m} \frac{{a_i}{f_i}^m}{\sum {a_i}}~~(\text {with updated}~a_i~\text {at step}~t) \end{aligned}$$

(2)

$$\begin{aligned} {r_t}&= \left\{ L_m[h(g({\mathcal {X}}^m\mid s_{t}))]-L_m[h(g({\mathcal {X}}^m\mid s_{t+1}))]\right\} +\lambda \text {max}[0,({1-a_t})] \end{aligned}$$

(3)

where the hinge loss term serves as a regularization to encourage redundancy elimination and is balanced with the $\lambda $. It also contributes to stabilize training. The aggregation operation essentially selects a point inside of the convex hull spanned by all feature vectors [26].

Considering that the action space here is a continuous space $\mathcal {A}\in \mathbb {R}^+$, the value-based RL (e.g., Q-Learning) cannot tackle this task. We adapt the actor-critic network to directly grade each feature dependent on the observation of the other features. In a policy-based method, the training objective is to find a parametrized policy ${\pi }_\theta (a_t\mid s_t)$ that maximizes the expected reward $J(\theta )$ over all possible aggregation trajectories given a starting state. Following the Policy Gradient Theorem [43], the gradient of the parameters given the objective function has the form:

$$\begin{aligned} {\nabla }_\theta J(\theta )=\mathbb {E}[{\nabla }_\theta \text {log} {\pi }_\theta (a_t\mid s_t)({Q}(s_t,a_t)-b(s_t))] \end{aligned}$$

(4)

where ${Q}(s_t,a_t)=\mathbb {E}[R_t\mid s_t,a_t]$ is the state-action value function, in which the initial action $a_t$ is provided to calculate the expected return when starting in the state $s_t$. A baseline function $b(s_t)$ is typically subtracted to reduce the variance while not changing the estimated gradient [44, 53]. A natural candidate for this baseline is the state only value function ${V}(s_t)=\mathbb {E}[R_t\mid s_t]$, which is similar to $Q(s_t,a_t)$, except the $a_t$ is not given here. The advantage function is defined as ${A}(s_t,a_t)={{Q}(s_t,a_t)}-{{V}(s_t)}$ [37]. Equation (4) then becomes:

$$\begin{aligned} {\nabla }_\theta J(\theta )=\mathbb {E}[{\nabla }_\theta \text {log} {\pi }_\theta (a_t\mid s_t){A}(s_t,a_t)] \end{aligned}$$

(5)

This can be viewed as a special case of the actor-critic model, where ${\pi }_\theta (a_t\mid s_t)$ is the actor and the ${A}(s_t,a_t)$ is the critic. To reduce the number of required parameters, the parameterized temporal difference (TD) error ${\delta _\omega }={r_t}+{\gamma }V_{\omega }(S_{s+1})-V_{\omega }(S_{s})$ can be used to approximate the advantage function [45]. We use two different symbols $\theta $ and $\omega $ to denote the actor and critic function, but most of these parameters are shared in a main stream neural network, then separated to two branches for policy and value predictions, respectively.

3.2 Off-Policy Actor-Critic with Experience Replay

On-policy RL methods update the model with the samples collected via the current policy. The experience replay (ER) can be used to improve the sample-efficiency [54], where the experiences are randomly sampled from a replay pool $\mathcal {P} $. This ensure the training stability by reducing the data correlation. Since these past experiences were collected from different policies, the use of ER leads to off-policy updates.

When training models with RL, $\varepsilon $-greedy action selection is often used to trade-off between exploitation and exploration, whereby a random action is chosen with a probability otherwise the top-ranking action is selected. A policy used to generate a training weight is referred to as a behavior policy $\mu $, in contrast to the policy to-be optimized which is called the target policy $\pi $.

The basic advantage actor-critic (A2C) training algorithm described in Sect. 3.1 is on-policy, as it assume the actions are drawn from the same policy as the target to-be optimized (i.e., $\mu =\pi $). However, the current policy $\pi $ is updated with the samples generated from old behavior policies $\mu $ in off-policy learning. Therefore, an importance sampling (IS) ratio is used to rescale each sampled reward to correct the sampling bias at time-step t: $\rho _t=\pi (a_t\mid s_t)/\mu (a_t\mid s_t)$ [55]. For A2C, the off-policy gradient for the parametrized state only value function $V_\omega $ thus has the form:

$$\begin{aligned} \varDelta \omega ^{\text {off}}=\sum _{t=1}^{T}(\bar{R}_t-\hat{V}_\omega (s_t))\nabla _\omega \hat{V}_\omega (s_t)\prod _{i=1}^t {\rho _i} \end{aligned}$$

(6)

where $\bar{R}_t$ is the off-policy Monte-Carlo return [56]:

$$\begin{aligned} \bar{R}_t=r_t+\gamma r_{t+1}\prod _{i=1}^1 {\rho _i}+\cdots +\gamma ^{T-t} r_{T}\prod _{i=1}^{T-t} {\rho _{t+i}} \end{aligned}$$

(7)

Likewise, the updated gradient for policy $\pi _\theta $ is:

$$\begin{aligned} \varDelta \theta ^{\text {off}}=\sum _{t=1}^{T}\rho _{t} {\nabla }_\theta \text {log} {\pi }_\theta (a_t\mid s_t) \hat{\delta }_\omega \end{aligned}$$

(8)

where $\hat{\delta }_\omega =r_t+\gamma \hat{V}_\omega (s_{t+1}-\hat{V}_\omega (s_{t})$ is the TD error using the estimated value of $\hat{V}_\omega $.

Here, we introduce a modified Trust Region Policy Optimization method [46, 47]. In addition to maximizing the cumulative reward $J(\theta )$, the optimization is also subject to a Kullback-Leibler (KL) divergence limit between the updated policy $\theta $ and an average policy $\theta _a$ to ensure safety. This average policy represents a running average of past policies and constrains the updated policy from deviating too far from the average $\theta _a\leftarrow [\alpha \theta _a+(1-\alpha )\theta ]$ with a weight $\alpha $. Thus, given the off-policy policy gradient $\varDelta \theta ^{\text {off}}$ in Eq. (8), the modified policy gradient with trust region z is calculated as follows:

$$\begin{aligned} \begin{aligned} {\mathop {}_{z}^{minimize}}&~~\frac{1}{2}\Vert \varDelta \theta ^{\text {off}}-z\Vert _2^2,\\ \text {Subject to:}&\nabla _\theta D_{KL}[\pi _{\theta _a}(s_t)\Vert \pi _{\theta }(s_t)]^{\text {T}} ~z \le \xi \end{aligned} \end{aligned}$$

(9)

where $\pi $ is the policy parametrized by $\theta $ or $\theta _a$, and $ \xi $ controls the magnitude of the KL constraint. Since the constraint is linear, a closed form solution to this quadratic programming problem can be derived using the KKT conditions. Setting $k=\nabla _\theta D_{KL}[\pi _{\theta _a}(s_t)\Vert \pi _{\theta }(s_t)]$, we get:

$$\begin{aligned} z_{tr}^*=\varDelta \theta ^{\text {off}}-max\left\{ \frac{k^{\text {T}}\varDelta \theta ^{\text {off}}-\xi }{\Vert k\Vert _2^2},0\right\} k \end{aligned}$$

(10)

This direction is also shown to be closely related to the natural gradient [57, 58]. The above enhancements speed up and stabilize our A2C network training.

3.3 Pose-Guided Inter-set Dependency Model

To model the inter-set dependency without paired-input, we propose a pose-guided stochastic routing scheme. Such a divide-and-conquer idea originated in [59], which constructs several face detectors to charge each view. Given a set of face image, we extract its general feature aggregation $\mathcal {F}_0$, as well as the aggregation of the frontal face features $\mathcal {F}_1$ and profile face feature $\mathcal {F}_2$. The $\mathcal {F}_1$ and $\mathcal {F}_2$ are the weighted average of the features from the near-frontal face images (${\le } 30^{\circ }$) and profile face images (${>}30^{\circ }$) respectively, in which the attention is assigned with the observation of full set. We use PIFA [60] to estimate the yaw angle. The sum of weights of the frontal and profile features $p_1$ and $p_2$ are with respect to the quality of each pose group. Considering the mirror transforms in data augmentation and the symmetry property of human faces, we do not discriminate the right face and the left face. With PGR, the distance d between two sets of samples is computed as:

$$\begin{aligned} d= \frac{1}{2}S(\mathcal {F}^1_0,\mathcal {F}^2_0)+\frac{1}{2}\sum _{i=1}^{2} {\sum _{j=1}^2 {S(\mathcal {F}^1_i,\mathcal {F}^2_j)p^1_ip^2_j}} \end{aligned}$$

(11)

where S is the L2 distance function to measure the distance between two feature vectors. We treat the generic features and pose-specific features equally, and fuse them for evaluations. The number of distance evaluations is decreased to $\mathcal {O}(5n)$. This achieves promising verification performance requiring fewer comparisons than conventional image-level similarity measurements. It is also readily applied to the other variations.

4 Numerical Experiments

We evaluated the performance of the proposed method on three Set/video-based FR datasets: the IJB-A [9], YTF [4], and Celebrity-1000 [61]. To utilize the millions of available still images, we train our CNN embedding module separately. As in [16], 3M face images from 50K identities are detected with the JDA [62] and aligned using the LBF [63] method for our GoogleNet training. This part is fixed when we train the DAC module on each set/video face dataset. Benefiting from the highly-compact 128-d feature representation and the simple neural network of the DAC, the training time of our DAC(off) on IJB-A dataset with a single Xeon E5 v4 CPU is about 3 h, the average testing time per each set-pair for verification is 62 ms. We use Titan Xp for CNN processing.

As our baseline methods, $\hbox {CNN} + \hbox {Mean}$ L2 measures the average L2 distances of all image pairs of two sets, while the $\hbox {CNN}+\hbox {AvePool}$ uses average pooling along each feature dimension for aggregation. The previous work NAN [16] uses the same CNN structure as our framework, but adopts a neural network module for independently quality assessment of each image. Therefore, NAN can be also regarded as our baseline. We refer the vanilla A2C as DAC(on), and use DAC(off) for the actor-critic with trust region-based experience replay scheme. The DAC(off) + PGR is the combination of the DAC(off) and PGR.

4.1 Results on IJB-A Dataset

IJB-A [9] is a face verification and identification dataset, containing images captured from unconstrained environments with wide variations of pose and imaging conditions. There are 500 identities with a total of 25,813 images (5,397 still images and 20,412 video frames sampled from 2,042 videos). A set of images for a particular identity is called a template. Each template can be a mixture of still images and sampled video frames. The number of images (or frames) in a template ranges from 1 to 190 with approximately 11.4 images and 4.2 videos per subject on average. It provides a ground truth bounding box for each face with 3 landmarks. There are 10 training and testing splits. Each split contains 333 training and 167 testing identities.

We compare the proposed framework with the existing methods on both face verification and identification following the standard evaluation protocol on IJB-A dataset. Metrics for the 1:1 compare task are evaluated using the receiver operating characteristics (ROC) curves in Fig. 5(a). We also report the true accept rate (TAR) vs. false positive rates (FAR) in Table 1. For the 1:N search task, the performance is evaluated in terms of a Cumulative Match Characteristics (CMC) curve as shown in Fig. 5(b). It is an information retrieval metric, which plots identification rates corresponding to different ranks. A rank-k identification rate is defined as the percentage of probe searches whose gallery match is returned with in the top-k matches. The true positive identification rate (TPIR) vs. false positive identification rate (FPIR) as well as the rank-1 accuracy are also reported in Table 1.

Table 1. Performance evaluation on the IJB-A dataset. For verification, the true accept rates (TAR) vs. false positive rates (FAR) are reported. For identification, the true positive identification rate (TPIR) vs. false positive identification rate (FPIR) and the Rank-1 accuracy are presented.

Full size table

These results show that both the verification and the identification performance are largely improved compared to our baseline methods. The RL networks have learned to be robust to low-quality and redundant image. The DAC(on) outperforms the previous approaches in most of the operating points, showing that our representation is more discriminative than the weighted feature in [11, 16] without considering the inner-set dependency. The experience replay can further help the stabilization of our training and the state-of-the-art performance is achieved. Combining the off-policy DAC and pose-guide representation scheme also contributes to the final results in an efficient way.

4.2 Results on YouTube Face Dataset

The YouTube Face (YTF) dataset [4] is a widely used video face verification dataset, which contains 3,425 videos of 1,595 different subjects. In this dataset, there are many challenging videos, including amateur photography, occlusions, problematic lighting, pose and motion blur. The length of face videos in this dataset varies from 48 to 6,070 frames, and the average length of videos is 181.3 frames. In experiments, we follow the standard verification protocol as in [16, 22, 33], which test our method for unconstrained face 1:1 verification with the given 5,000 video pairs. These pairs are equally divided into 10 splits, and each split has around 250 intra-personal pairs and 250 inter-personal pairs.

Table 2 presents the results of our DAC and previous methods. It can be seen that the DAC outperforms all the previous state-of-the-art methods following the setting that without fine-tuning the feature embedding module on YTF. Since this dataset has frontal face bias [6] and the face variations in this dataset are relatively small as shown in Fig. 2, we have not used the pose-guided representation scheme. It is obvious that the video sequences are redundant, considering the inner-video relationship does contribute to the improvement over [16]. The comparable performance with temporal representation-based methods suggests the DAC could be a potential substitute for RNN in some specific areas. Actually, the RNN itself is computationally expensive and sometimes difficult to train [67]. We directly model the dependency in the feature-level, which is faster than the temporal representation of original images [22], and more effective than the adversarial face generation-based method [33].

Table 2. Comparisons of the average verification accuracy with the recently state-of-the-art results on the YTF dataset. $\dag $ fine-tuned the CNN model with YTF.

Full size table

It also indicates that DAC achieves a very competitive performance without highly-engineered CNN models. Note that the FaceNet [18], NAN [16] also use the GoogleNet style structure. We show that DAC outperforms them on both the verification accuracy and the standard variation. The Deep FR, TBE-CNN and TR methods have additional fine-tuning of the CNN-model with YTF dataset, and the residual constitutional networks are used in TR. Considering our module-based structure, these advanced CNNs can be easily added on the DAC and boost its performance. We see that the DAC can generalizes well in video-based face verification datasets.

4.3 Results on Celebrity-1000 Dataset

We then test our method on the Celebrity-1000 dataset [61], which is designed for the unconstrained video-based face identification problem. 2.4M frames from 159,726 face videos (about 15 frames per sequence) of 1,000 subjects are contained in this dataset. It is released with two standard evaluation protocols: open-set and closed-set. We follow the standard 1 : N identification setting as in [61] and report the result of both protocols.

For the closed-set protocol, we use the softmax outputs from the reward network, and the subject with the maximum score as the result. Since the baseline methods do not have a multi-class prediction unit, we simply compare the L2 distance as in [16]. We present the results in Table 3, and show the CMC curves in Fig. 6(a). With the help of end-to-end learning and large volume training data for CNN model, deep learning methods outperform [12, 61] by a large margin. It can be seen that the state-of-the-art is achieved by the DAC. We can also benefit from the experience replay to achieve improvements over the baselines.

Table 3. Identification performance (rank-1 accuracies), on the Celebrity-1000 dataset for the closed-set tests (left) and open-set tests (right).

Full size table

For the open-set testing, we take multiple image sequences of each gallery subject to extract a highly compact feature representation as in NAN [16]. Then the open-set identification is performed by comparing the L2 distance of the aggregated probe and gallery representations. Figure 6(b) and Table 3 show the results of different methods in our experiments. We see that our proposed methods outperform the previous methods again, which clearly shows that DAC is effective and robust.

5 Conclusions

We have introduced the actor-critic RL for visual recognition problem. We cast the inner-set dependency modeling to a MDP, and train an agent DAC to make attention control for each image in each step. The PGR scheme well balances the computation cost and information utilization. Although we only explore their ability in set/video-based face recognition tasks, we believe it is a general and practicable methodology that could be easily applied to other problems, such as Re-ID, action recognition and event detection etc.

References

Chen, J.C., et al.: Unconstrained still/video-based face verification with deep convolutional neural networks. Int. J. Comput. Vis., 1–20 (2017)
Google Scholar
Learned-Miller, E., Huang, G.B., RoyChowdhury, A., Li, H., Hua, G.: Labeled faces in the wild: a survey. In: Kawulok, M., Celebi, M., Smolka, B. (eds.) Advances in Face Detection and Facial Image Analysis, pp. 189–248. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-25958-1_8
Chapter Google Scholar
Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report, Technical Report 07–49, University of Massachusetts, Amherst (2007)
Google Scholar
Wolf, L., Hassner, T., Maoz, I.: Face recognition in unconstrained videos with matched background similarity. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 529–534 (2011)
Google Scholar
Phillips, P.J., Hill, M.Q., Swindle, J.A., O’Toole, A.J.: Human and algorithm performance on the PaSC face recognition challenge. In: 2015 IEEE 7th International Conference on Biometrics Theory, Applications and Systems (BTAS), pp. 1–8. IEEE (2015)
Google Scholar
Crosswhite, N., Byrne, J., Stauffer, C., Parkhi, O., Cao, Q., Zisserman, A.: Template adaptation for face verification and identification. In: FG, pp. 1–8. IEEE (2017)
Google Scholar
Hayat, M., Khan, S.H., Werghi, N., Goecke, R.: Joint registration and representation learning for unconstrained face identification. In: IEEE CVPR, pp. 2767–2776 (2017)
Google Scholar
Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: IEEE CVPR, vol. 1 (2017)
Google Scholar
Klare, B.F., et al.: Pushing the frontiers of unconstrained face detection and recognition: IARPA Janus benchmark A. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1931–1939 (2015)
Google Scholar
Grother, P., Ngan, M.: Face recognition vendor test (FRVT). Performance of face identification algorithms (2014)
Google Scholar
Liu, Y., Yan, J., Ouyang, W.: Quality aware network for set to set recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5790–5799 (2017)
Google Scholar
Li, H., Hua, G., Shen, X., Lin, Z., Brandt, J.: Eigen-PEP for video face recognition. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9005, pp. 17–33. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16811-1_2
Chapter Google Scholar
Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: BMVC, vol. 1, p. 6 (2015)
Google Scholar
Chen, J.C., Ranjan, R., Kumar, A., Chen, C.H., Patel, V.M., Chellappa, R.: An end-to-end system for unconstrained face verification with deep convolutional neural networks. In: IEEE CVPRW, pp. 118–126 (2015)
Google Scholar
Chowdhury, A.R., Lin, T.Y., Maji, S., Learned-Miller, E.: One-to-many face recognition with bilinear CNNs. In: WACV, pp. 1–9. IEEE (2016)
Google Scholar
Yang, J., Ren, P., Zhang, D., Chen, D., Wen, F., Li, H., Hua, G.: Neural aggregation network for video face recognition. In: IEEE CVPR, pp. 4362–4371 (2017)
Google Scholar
Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Deepface: closing the gap to human-level performance in face verification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1708 (2014)
Google Scholar
Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
Google Scholar
Sun, Y., Wang, X., Tang, X.: Deeply learned face representations are sparse, selective, and robust. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2892–2900. IEEE (2015)
Google Scholar
Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 499–515. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_31
Chapter Google Scholar
Zhang, J., Wang, N., Zhang, L.: Multi-shot pedestrian re-identification via sequential decision making. arXiv preprint arXiv:1712.07257 (2017)
Rao, Y., Lu, J., Zhou, J.: Attention-aware deep reinforcement learning for video face recognition. In: IEEE ICCV, pp. 3931–3940 (2017)
Google Scholar
Janisch, J., Pevnỳ, T., Lisỳ, V.: Classification with costly features using deep reinforcement learning. arXiv preprint arXiv:1711.07364 (2017)
Zhu, Z., Luo, P., Wang, X., Tang, X.: Multi-view perceptron: a deep model for learning face identity and view representations. In: Advances in Neural Information Processing Systems, pp. 217–225 (2014)
Google Scholar
Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y.: Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell. 31(2), 210–227 (2009)
Article Google Scholar
Cevikalp, H., Triggs, B.: Face recognition based on image sets. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2567–2573. IEEE (2010)
Google Scholar
Huang, Z., Wu, J., Van Gool, L.: Building deep networks on Grassmann manifolds. arXiv preprint arXiv:1611.05742 (2016)
Huang, Z., Van Gool, L.J.: A Riemannian network for SPD matrix learning. In: AAAI, vol. 2, p. 6 (2017)
Google Scholar
Wang, R., Guo, H., Davis, L.S., Dai, Q.: Covariance discriminative learning: a natural and efficient approach to image set classification. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2496–2503. IEEE (2012)
Google Scholar
Lu, J., Wang, G., Moulin, P.: Image set classification using holistic multiple order statistics features and localized multi-kernel metric learning. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 329–336. IEEE (2013)
Google Scholar
Sivic, J., Everingham, M., Zisserman, A.: Who are you?-Learning person specific classifiers from video. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 1145–1152. IEEE (2009)
Google Scholar
Lu, J., Wang, G., Deng, W., Moulin, P., Zhou, J.: Multi-manifold deep metric learning for image set classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1137–1145 (2015)
Google Scholar
Rao, Y., Lin, J., Lu, J., Zhou, J.: Learning discriminative aggregation network for video-based face recognition. In: IEEE ICCV, pp. 3781–3790 (2017)
Google Scholar
Graves, A., Wayne, G., Danihelka, I.: Neural turing machines. arXiv preprint arXiv:1410.5401 (2014)
Vinyals, O., Bengio, S., Kudlur, M.: Order matters: sequence to sequence for sets. arXiv preprint arXiv:1511.06391 (2015)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529 (2015)
Article Google Scholar
Li, Y.: Deep reinforcement learning: an overview. arXiv preprint arXiv:1701.07274 (2017)
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: ICML (2014)
Google Scholar
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937 (2016)
Google Scholar
Babaeizadeh, M., Frosio, I., Tyree, S., Clemons, J., Kautz, J.: Reinforcement learning through asynchronous advantage actor-critic on a GPU (2017)
Google Scholar
Arulkumaran, K., Deisenroth, M.P., Brundage, M., Bharath, A.A.: Deep reinforcement learning: a brief survey. IEEE Sig. Process. Mag. 34(6), 26–38 (2017)
Article Google Scholar
Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: NIPS, pp. 1057–1063 (2000)
Google Scholar
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992)
MATH Google Scholar
Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation (2017)
Google Scholar
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: ICML, pp. 1889–1897 (2015)
Google Scholar
Wang, Z., et al.: Sample efficient actor-critic with experience replay (2017)
Google Scholar
Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. In: Advances in Neural Information Processing Systems, pp. 2204–2212 (2014)
Google Scholar
Lan, X., Wang, H., Gong, S., Zhu, X.: Identity alignment by noisy pixel removal. arXiv preprint arXiv:1707.02785 (2017)
Huang, C., Lucey, S., Ramanan, D.: Learning policies for adaptive tracking with deep feature cascades. arXiv preprint arXiv:1708.02973 (2017)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)
Google Scholar
Andrew, A.M.: Reinforcement learning: an introduction by Richard S. Sutton and Andrew G. Barto, adaptive computation and machine learning series, MIT Press (Bradford book), Cambridge, Mass., 1998, xviii+ 322 pp, ISBN 0-262-19398-1, (hardback, £ 31.95). Robotica 17(2), 229–235 (1999)
Article Google Scholar
Lin, L.J.: Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn. 8(3–4), 293–321 (1992)
Google Scholar
Meuleau, N., Peshkin, L., Kaelbling, L.P., Kim, K.E.: Off-Policy Policy Search. MIT Artifical Intelligence Laboratory, Cambridge (2000)
Google Scholar
Precup, D., Sutton, R.S., Dasgupta, S.: Off-policy temporal-difference learning with function approximation. In: ICML, pp. 417–424 (2001)
Google Scholar
Amari, S.I.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)
Article Google Scholar
Peters, J., Schaal, S.: Policy gradient methods for robotics. In: 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2219–2225. IEEE (2006)
Google Scholar
Li, Y., Zhang, B., Shan, S., Chen, X., Gao, W.: Bagging based efficient kernel fisher discriminant analysis for face recognition. In: 2006 18th International Conference on Pattern Recognition, ICPR 2006, vol. 3, pp. 523–526. IEEE (2006)
Google Scholar
Jourabloo, A., Liu, X.: Pose-invariant face alignment via CNN-based dense 3D model fitting. Int. J. Comput. Vis. 124(2), 187–203 (2017)
Article MathSciNet Google Scholar
Liu, L., Zhang, L., Liu, H., Yan, S.: Toward large-population face identification in unconstrained videos. IEEE Trans. Circuits Syst. Video Technol. 24(11), 1874–1884 (2014)
Article Google Scholar
Chen, D., Ren, S., Wei, Y., Cao, X., Sun, J.: Joint cascade face detection and alignment. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 109–122. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_8
Chapter Google Scholar
Ren, S., Cao, X., Wei, Y., Sun, J.: Face alignment at 3000 fps via regressing local binary features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1685–1692 (2014)
Google Scholar
Wang, D., Otto, C., Jain, A.K.: Face search at scale: 80 million gallery. arXiv preprint arXiv:1507.07242 (2015)
Masi, I., Rawls, S., Medioni, G., Natarajan, P.: Pose-aware face recognition in the wild. In: IEEE CVPR, pp. 4838–4846 (2016)
Google Scholar
Masi, I., et al.: Do we really need to collect millions of faces for effective face recognition? In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 579–596. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_35
Chapter Google Scholar
Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Bengio, C.L.Y., Courville, A.: Towards end-to-end speech recognition with deep convolutional neural networks. arXiv preprint arXiv:1701.02720 (2017)
Ding, C., Tao, D.: Trunk-branch ensemble convolutional neural networks for video-based face recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2017)
Google Scholar

Download references

Acknowledgement

This work was supported in part by the National Key R&D Plan 2016YFB0501003, Hong Kong Government General Research Fund GRF 152202/14E, PolyU Central Research Grant G-YBJW, Youth Innovation Promotion Association, CAS (2017264), Innovative Foundation of CIOMP, CAS (Y586320150).

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Xiaofeng Liu & B. V. K. Vijaya Kumar
University of Southern California, Los Angeles, CA, 90089, USA
Chao Yang
Toyota Technological Institute at Chicago, Chicago, IL, 60637, USA
Qingming Tang
The Hong Kong Polytechnic University, Hong Kong, Hong Kong
Jane You

Authors

Xiaofeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
B. V. K. Vijaya Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Chao Yang
View author publications
You can also search for this author in PubMed Google Scholar
Qingming Tang
View author publications
You can also search for this author in PubMed Google Scholar
Jane You
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaofeng Liu .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, X., Kumar, B.V.K.V., Yang, C., Tang, Q., You, J. (2018). Dependency-Aware Attention Control for Unconstrained Face Recognition with Image Sets. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11215. Springer, Cham. https://doi.org/10.1007/978-3-030-01252-6_34

Download citation

DOI: https://doi.org/10.1007/978-3-030-01252-6_34
Published: 06 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01251-9
Online ISBN: 978-3-030-01252-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Dependency-Aware Attention Control for Unconstrained Face Recognition with Image Sets

Abstract

Similar content being viewed by others

Learning Collaborative Reinforcement Attention for 3D Face Reconstruction and Dense Alignment

BroadFace: Looking at Tens of Thousands of People at once for Face Recognition

Densifying Supervision for Fine-Grained Visual Comparisons

Keywords

1 Introduction

2 Related Work