Keywords

1 Introduction

Analysis of human face has been an important task in computer vision because it plays a major role in soft biometrics, and human-computer interaction [7, 33]. Facial behavior is known to benefit perception of the identity [32, 34]. In particular, facial dynamics play crucial roles for improving the accuracy of facial trait estimation such as age estimation or gender classification [6, 9].

In recent progress of deep learning, convolutional neural networks (CNN) have shown outstanding performance on many fields of computer vision. Several research efforts have been devoted to developing spatio-temporal feature representation in various applications such as action recognition [13, 21, 23, 40] and activity parsing [26, 42]. In [23], a long short-term memory (LSTM) network has been designed on top of CNN features to encode dynamics in video. The LSTM network is a variant of recurrent neural network (RNN), which is designed to capture long-term temporal information in sequential data [19]. By using the LSTM, the temporal correlation of CNN features was effectively encoded.

Recently, a few research efforts have been made regarding facial dynamic feature encoding for a facial analysis [6, 9, 24, 25].It is generally known that the dynamic features of local regions are valuable for facial trait estimation [6, 9]. Usually, the motion of facial local region in facial expression is related to the motion of other facial regions [39, 43]. However, to the best of our knowledge, there are no studies that utilize relations between facial motions and interpret the important relations between local dynamics for facial trait estimation.

In this paper, a novel deep network has been proposed for interpreting relations between local dynamics in facial trait estimation. To interpret the relations between facial local dynamics, the proposed deep network consists of a facial local dynamic feature encoding network and a facial dynamics interpreter network. By the facial dynamics interpreter network, the importance of relations for estimating facial traits is encoded. The main contributions of this study are summarized in following three aspects:

  1. 1.

    We propose a novel deep network which estimates facial traits by using relations between facial local dynamics of smile expression.

  2. 2.

    The proposed deep network has been designed to be able to interpret the relations between local dynamics in facial trait estimation. For that purpose, the relational importance is devised. The relational importance is encoded from the relational features of facial local dynamics. The relational importance is used for interpretation of important relations in facial trait estimation.

  3. 3.

    To validate the effectiveness of the proposed method, comparative experiments have been conducted on two facial trait estimation problems (i.e. age estimation and gender classification). In the proposed method, the facial trait estimation is conducted by combining the relational features based on the relational importance. By exploiting the relational features and considering the importance of relations, the proposed method could more accurately estimate facial traits compared with the state-of-the-art methods.

2 Related Work

Age Estimation and Gender Classification. A lot of research efforts have been devoted to development of automatic age estimation and gender classification techniques from face image [2, 4, 4, 16, 22, 28, 29, 29, 38, 38, 41]. Recently, deep learning methods show notable potential in various face analysis tasks. One of the main focus of these methods is to design suitable deep network structure for some specific tasks. Parkhi et al. [31] reported VGG-style CNN learned from large-scale static face images. Deep learning based age estimation method and gender classification method have been reported but they were mostly designed on static face image [22, 27, 28, 41].

Facial Dynamic Analysis. The temporal dynamics of face have been ignored in both age estimation and gender classification. Recent studies have reported that facial dynamics could be an important cue for facial trait estimation [6, 8,9,10, 17]. With aging, the face loses muscle tone and underlying fat tissue, which creates wrinkles, sunken eyes and increases crow’s feet around the eyes [9]. Aging also affects facial dynamics along with appearance. As a human being gets older, the elastic fibers of the face show fraying. Therefore facial dynamic features of local facial regions are important cues for age estimation. In cognitive-psychological studies [1, 5, 18, 36], evidence for gender-dimorphism in the human expression has been reported. Females express emotions more frequently compared with males. Males have a tendency to show restricted emotions and to be unwilling to self-disclose intimate feelings [6]. In [6], Dantcheva et al. used dynamic descriptors extracted from facial landmarks for gender classification. However, there are no studies for learning relations of dynamic features in facial trait estimation.

Relational Network. In this paper, we propose a novel deep learning architecture for analyzing relations of facial dynamic features in facial trait estimation. A relational network has been reported in visual question and answering (VQA) [35]. In [35], the authors defined an object as a neuron on feature map obtained from CNN and designed a neural network for relational reasoning. However, it was designed for image-based VQA. In this paper, the proposed method automatically encodes the importance of relations by considering the locational information on face. By utilizing the importance of relations, the proposed method could interpret the relations between facial dynamics in facial trait estimation.

3 Proposed Facial Dynamics Interpreter Network

Overall structure of the proposed facial dynamics interpreter network is shown in Fig. 1. The aim of the proposed method is to interpret the important relations between local dynamics in facial trait estimation from expression sequence. The proposed method largely consists of the facial local dynamic feature encoding network, the facial dynamics interpreter network, and interpretation on important relation between facial local dynamics. The details are described in the following subsections.

Fig. 1.
figure 1

Overall structure of the proposed facial dynamics interpreter network.

3.1 Facial Local Dynamic Feature Encoding Network

Given a face sequence, appearance features are computed by CNN on each frame. For the purpose of appearance feature extraction, we employ the VGG-face network [31] which is trained with large-scale face images. The pre-trained VGG face model is used to get off-the-shelf CNN features in this study. With given CNN features, the proposed facial dynamics interpreter network has been investigated. The output of convolutional layer in the VGG-face network is used as feature map of facial appearance representation.

Based on the feature map, the face is divided into \(N_0\) local regions. The location of local regions was determined to interpret the relation of local dynamics base on semantically meaningful facial local region (i.e. left eye, forehead, right eye, left cheek, nose, right cheek, left mouth side, mouth, and right mouth side in this study). Note that each face sequence is automatically aligned based on the landmark detection [3]. Let \(\mathbf {x}_i^t\) denote the local appearance features of i-th facial local part at t-th time step. To encode local dynamic features, an LSTM network has been devised with fully-connected layer on top of the local appearance features \(\mathbf {X}_i=\left\{ \mathbf {x}_i^1,\ldots ,\mathbf {x}_i^t,\ldots ,\mathbf {x}_i^T\right\} \) as followings:

$$\begin{aligned} \begin{aligned} \mathbf {d}_i=f_{\phi _{D}}(\mathbf {X}_i), \end{aligned} \end{aligned}$$
(1)

where \(\mathbf {d}_i\) denotes the facial local dynamic feature of i-th local part and \(f_{\phi _{D}}\) is a function with learnable parameters \(\phi _{D}\). \(f_{\phi _{D}}\) consists of the fully-connected layer and the LSTM layers as shown in Fig. 1. T denotes the length of face sequence. The LSTM network could deal with the different length of sequences. The various dynamic related features including variation of appearance, amplitude, speed, and acceleration could be encoded from the sequence of local appearance features. The detailed configuration of the network used in the experiments will be presented in Sect. 4.1.

3.2 Facial Dynamics Interpreter Network

We extract object features (i.e. facial local dynamic features and locational features) for pairs of objects. The locational features are defined as the central position of the object (i.e. facial local region). For the purpose of telling the location information of objects to the facial dynamics interpreter network, the local dynamic features and the locational features are embedded and defined as object features \(\mathbf {o}_i\). The object feature can be written as

$$\begin{aligned} \begin{aligned} \mathbf {o}_i=[\mathbf {d}_i,p_i,q_i], \end{aligned} \end{aligned}$$
(2)

where \([p_i,q_i]\) denotes the normalized central position of i-th object.

The design philosophy of the proposed facial dynamics interpreter network is to make the functional form of a neural network which captures the core relations for facial trait estimation. The importance of the relation could be different for each pair of object features. The proposed facial dynamics interpreter network is designed to encode relational importance in facial trait estimation. The relational importance could be used for interpreting the relation between local dynamics in facial trait estimation.

Let \(\lambda _{i,j}\) denote a relational importance between i-th and j-th object feature. The relational feature, which represents latent relation between two objects for facial trait estimation, can be written as

$$\begin{aligned} \begin{aligned} \mathbf {r}_{i,j}=g_{\phi _{R}}(\mathbf {s}_{i,j}), \end{aligned} \end{aligned}$$
(3)

where \(g_{\phi _{R}}\) is a function with learnable parameters \(\phi _{R}\). \(\mathbf {s}_{i,j}=(\mathbf {o}_i,\mathbf {o}_j)\) is relation pair from i-th and j-th facial local parts. \(\mathbf {S}=\left\{ \mathbf {s}_{1,2},\cdots ,\mathbf {s}_{i,j},\cdots ,\mathbf {s}_{(N_0-1),N_0}\right\} \) is a set of relation pairs where \(N_0\) denotes the number of objects in face. \(\mathbf {o}_i\) and \(\mathbf {o}_j\) denote the i-th and j-th object features, respectively. The relational importance \(\lambda _{i,j}\) for relation between two object features (\(\mathbf {o}_i\), \(\mathbf {o}_j\)) is encoded as:

$$\begin{aligned} \begin{aligned} \lambda _{i,j}=h_{\phi _{I}}(\mathbf {r}_{i,j}), \end{aligned} \end{aligned}$$
(4)

where \(h_{\phi _{I}}\) is a function with learnable parameters \(\phi _{I}\). In this paper, \(h_{\phi _{I}}\) is defined with \(\phi _{I}=\left\{ \left( \mathbf {W}_{1,2},\mathbf {b}_{1,2}\right) ,\cdots ,\left( \mathbf {W}_{(N_0-1),N_0},\mathbf {b}_{(N_0-1),N_0}\right) \right\} \) as followings:

$$\begin{aligned} \begin{aligned} h_{\phi _{I}}\left( \mathbf {r}_{i,j}\right) =\frac{\exp \left( \mathbf {W}_{i,j}\mathbf {r}_{i,j}+\mathbf {b}_{i,j}\right) }{\sum _{i,j}\exp \left( \mathbf {W}_{i,j}\mathbf {r}_{i,j}+\mathbf {b}_{i,j}\right) }. \end{aligned} \end{aligned}$$
(5)

The aggregated relational features \(\mathbf {f}_{agg}\) are represented by

$$\begin{aligned} \begin{aligned} \mathbf {f}_{agg}=\sum _{i,j}\lambda _{i,j}\mathbf {r}_{i,j}. \end{aligned} \end{aligned}$$
(6)
figure a

Finally, the facial trait estimation can be performed with

$$\begin{aligned} \begin{aligned} \mathbf {y}=k_{\phi _{E}}(\mathbf {f}_{agg}), \end{aligned} \end{aligned}$$
(7)

where \(\mathbf {y}\) denotes estimated result and \(k_{\phi _{E}}\) is a function with parameters \(\phi _{E}\). \(k_{\phi _{E}}\) and \(g_{\phi _{R}}\) are implemented by fully-connected layers.

3.3 Interpretation on Important Relations Between Facial Local Dynamics

The proposed method is useful for interpreting the relations in facial trait estimation. The relational importance calculated in Eq. (4) is utilized to interpret the relations of facial local dynamics. Note that the high relational importance values mean that the relational features of corresponding facial local parts are important for estimating facial traits. The pseudocodes for calculating relational importance of \(N_I\) objects are given in Algorithm 1. By analyzing the relational importance, important relations for estimating facial traits could be explained. In Sects. 4.2 and 4.3, we discuss the important relations for age estimation and gender classification, respectively.

4 Experiments

4.1 Experimental Settings

Database. To evaluate the effectiveness of the proposed facial dynamics interpreter network, comparative experiments were conducted. For generalization purpose, we verified the proposed method on both age estimation and gender classification tasks. Age and gender were known as representative facial traits [28]. The public UvA-NEMO Smile database was used for both tasks [10, 11]. The UvA-NEMO smile database has been known as the largest smile database [12]. The database consists of 1,240 smile videos collected from 400 subjects. Among 400 subjects, 185 subjects are female and remaining 215 subjects are male. The ages of subjects range from 8 to 76 years. For evaluating the performance of age estimation, we used the experimental protocol defined in [9,10,11]. The 10-fold cross-validation scheme was used to calculate the performance of the proposed method. Each fold was divided in a way where there was no subject overlap [9,10,11]. Each time an independent test fold was separated and it was only used for calculating the performance. The remaining 9-folds were used to train the deep network and optimize hyper-parameters. To evaluate the performance of gender classification, we followed the experimental protocol used in [6].

Evaluation Metric. For age estimation, the mean absolute error (MAE) [41] was utilized for evaluation. The MAE could measure the error between the predicted age and the ground-truth. The MAE was computed as follows:

$$\begin{aligned} \begin{aligned} \epsilon =\frac{{\sum _{n=1}^{N_{test}}||\mathbf {\hat{y}}_n-\mathbf {y}_n^*||}_1}{N_{test}}, \end{aligned} \end{aligned}$$
(8)

where \(\mathbf {\hat{y}}_n\) and \(\mathbf {y}_n^*\) denote predicted age and ground-truth age of n-th test sample, respectively. \(N_{test}\) denotes the number of the test samples. For the case of gender classification, classification accuracy was used for evaluation. We reported the MAE and classification accuracy averaged over all test folds.

Implementation Details. The face images used in the experiments were automatically aligned based on the two eye locations detected by the facial landmark detection [3]. The face images were cropped and resized to 96 \(\times \) 96 pixels. For the appearance representation, the frontal 10 convolutional layers and 4 max-pooling layers of VGG-face network was used. As a result, 6 \(\times \) 6 \(\times \) 512 size of feature map was obtained from each face image. Each facial local region was defined on the feature map with size of 2 \(\times \) 2 \(\times \) 512. In other words, there were 9 objects in face sequence (\(N_0=9\)). The fully-connected layer with 1024 units and the stacked LSTM layers were used for \(f_{\phi _{D}}\). We stacked two LSTMs and each LSTM had 1024 memory cells. Two-layer full-connected layers consisting of 4096 units (with dropout [37]) per layer was used for \(g_{\phi _{R}}\) with RELU [30]. \(h_{\phi _{I}}\) was implemented by a fully-connected layer and softmax function. Two-layer fully-connected layers consisting of 2048, 1024 units (with dropout, RELU, and batch normalization [20]) and one fully-connected layer (1 neuron for age estimation and 2 neurons for gender classification) were used for \(k_{\phi _{E}}\). The mean squared error was used for training the deep network in age estimation. The cross-entropy loss was used for training the deep network in gender classification.

4.2 Age Estimation

Interpreting Relations Between Facial Local Dynamics in Age Estimation. To understand the mechanism of the proposed facial dynamics interpreter network in age estimation, the relational importance calculated from each sequence was analyzed. Figure 2 shows the important relations where the corresponding pair has high relational importance values. We showed the difference of important regions over different ages by presenting the important relations over age groups. Ages were divided into five age groups (8–12, 13–19, 20–36, 37–65, and 66+) according to [15]. To interpret the important relations between each age group, the relational importance values encoded from test set were averaged in each age group, respectively. Four groups were visualized with example face images (there was no subject to be permitted for reporting in age group of [8–12]). As shown in the figure, when estimating age group of [66+], the relation between two eye regions was important. The relation between two eye regions could represent discriminative dynamic features according to crow’s feet and sunken eyes, which could be important factors for estimating ages of the older people. In addition, when considering three objects, the relation among left eye, right eye, and left cheek had highest relational importance in age group of [66+]. There was a tendency to symmetry about the relational importance. For example, the relation among left eye, right eye, and right cheek was included in top-5 high relational importance among 84 relations in age group of [66+]. Although the relation of action unit (AU) for determining specific facial expressions has been reported [14], the relation of the motions for estimating age or classifying gender was not investigated. In this study, the facial dynamics interpreter network was designed to interpret the relation of motions in facial trait estimation. It was found that the relation of dynamic features related with AU 2 and AU 6 was highly used by the deep network for estimating ages in range [66+].

Fig. 2.
figure 2

Example of facial dynamic interpretation in age estimation. Most important relations are visualized with yellow box for the relation between 2 objects in upper side and the relation among 3 objects in bottom side. (a) age group of [13–19], (b) age group of [20–36], (c) age group of [37–66], (d) age group of 66+.

Fig. 3.
figure 3

Perturbing the local dynamic features by replacing it with the local dynamic features of another age group subject (e.g., older subject).

In addition, to verify the effect of important relations, we made perturbation on the dynamic features as shown in Fig. 3. For the sequence of 17 years old subject, we changed the local dynamic features of left cheek region with that of 73 years old subject in the experiment. Note that the cheek constructed important pairs for estimating age group of [13–19] as shown in Fig. 2(a). By the perturbation, the absolute error was changed from 0.41 to 2.38. In the same way, we changed the dynamic features of other two regions (left eye and right eye) one by one. The other two regions constructed relatively less important relations and achieved the absolute error of 1.40 and 1.81 (left eye and right eye, respectively). The increase of absolute errors was less than the case which made perturbation on the left cheek. It showed that the relations with the left cheek were important for estimating age compared with the relations with eye in age group of [13–19].

Fig. 4.
figure 4

Perturbing local dynamic features by replacing it with zero vector.

For the same sequence, the facial dynamics interpreter network without the use of relational importance was also analyzed. For the facial dynamics interpreter network without the use of relational importance, the absolute error of the estimated age was increased by perturbation on the local dynamic feature of the left cheek from 1.20 to 7.45. When conducting perturbation on the left eye and the right eye, the absolute errors were 1.87 and 4.21, respectively. The increase of absolute error became much larger when conducting perturbation on the left cheek. Moreover, the increase of error was larger when the facial dynamics interpreter network did not use relational importance. In other words, the facial dynamics interpreter network with the relational importance was more robust to feature contamination because it adaptively encoded the relational importance from the relational features as in Eq. (4).

Table 1. Mean absolute errors (MAE) measured after perturbing local dynamic features of different location for subjects with age groups of [37–66].

In order to statistically analyze the effect of contaminated features in the proposed facial dynamics interpreter network, we also evaluated the MAE when conducting perturbation on each dynamic features of facial local parts with zero vector as shown in Fig. 4. For 402 videos which were collected from the subjects in age group of [37–66] in the UvA-NEMO database, the MAE was calculated as shown in Table 1. As shown in the table, the perturbation on most important facial region (i.e. right cheek in age group of [37–66]) had more influenced the accuracy of age estimation compared with the case which made perturbation on less important parts (i.e. left eye, forehead, and right eye in age group of [37–66]). The difference of MAE between the cases which made perturbation on important part and less important parts was statically significant (p < 0.05).

Table 2. Mean absolute error (MAE) of age estimation on UvA-NEMO smile database for analyzing the effectiveness of locational features and relational importance. L.F. and R.I. denote locational features and relational importance, respectively.
Table 3. Mean absolute error (MAE) on the UvA-NEMO smile database compared with other methods.
Fig. 5.
figure 5

Examples of the proposed method on age estimation. For visualization purpose, face sequences are displayed in 5 frames per sec.

Assessment of Facial Dynamics Interpreter Network for Age Estimation. We evaluated the effectiveness of the facial dynamics interpreter network. First, the effects of relational importance and locational features were validated for age estimation. Table 2 shows the MAE of the facial dynamics interpreter network with locational feature and relational importance. To verify the effectiveness of the relational features, the aggregation of local dynamic features using regional importance were compared. In the aggregation of local dynamic features using regional importance approach, facial local dynamic features were aggregated with regional importance in unsupervised way. As shown in the table, using the relational features improved the accuracy of age estimation. Moreover, the locational features could improve the performance of the age estimation by making the network know the location information of the object pairs. The locational features of the objects were meaningful as the objects of the face sequence were automatically aligned by the facial landmark detection. By utilizing both the relational importance and the locational features, the proposed facial dynamics interpreter network achieved the lowest MAE of 3.87 over all test set. It was mainly due to the reason that the importance of relations for age estimation was different. By considering the importance of the relational features, the accuracy of age estimation was improved. Moreover, we further analyzed the MAE of the age estimation according to the spontaneity of the smile expression. The MAE of the facial dynamic interpreter network was slightly lower in posed smile (p > 0.05).

To assess the effectiveness of the proposed dynamics interpreter network (with locational features and relational importance), the MAE of the proposed method was compared with the state-of-the-art methods (please see Table 3). The VLBP [17], displacement [10], BIF [16], BIF with dynamics [9], IEF [2], IEF with dynamics [9], and holistic dynamic approach were compared. In the holistic dynamic approach, appearance features were extracted by the same VGG-face network used in the proposed method and the dynamic features were encoded with the LSTM network on the holistic appearance feature without dividing the face into local parts. It was compared because it has been widely used architecture for a spatio-temporal encoding [13, 24, 25] As shown in the table, the proposed method achieved lowest MAE. The MAE of the proposed facial dynamics interpreter network was lower than the MAE of the IEF + Dynamics and the difference was statistically significant (p < 0.05). It was mainly attributed to the fact that the proposed method encoded the latent relational features from object features (facial local dynamic features and locational features) and effectively combined the relational features based on the relational importance. Examples of age estimation from the proposed method and the holistic dynamic approach are shown in Fig. 5.

Fig. 6.
figure 6

Example of interpreting important relations between facial dynamics in gender classification. Most important relations are visualized with yellow box for the relation between 3 objects for recognizing male (a) and for recognizing female (b).

4.3 Gender Classification

Interpreting Relations Between Facial Local Dynamics in Gender Classification. In order to interpret important relations in gender classification, the relational importance values encoded from each sequence were analyzed. Figure 6 shows the important relations where the relational importance had high values at classifying gender from face sequence. As shown in the figure, the relation among forehead, nose, and mouth side was important in making decisions on males. Note that there was a tendency to symmetry about the relational importance. For determining male, the relation among forehead, nose, and right mouth side and the relation among forehead, nose, and left mouth side were top-2 important relations among 84 relations of three objects. For the case of female, the relation among forehead, nose, and cheek was important. It could be related to the observation that the females express emotions more frequently compared with males and the males have a tendency to show restricted emotions compared with the females. In other words, the females have a tendency to make smiles bigger than males by using muscles of cheek regions. Therefore, the relations among cheek and other face parts were important for recognizing females.

Table 4. Accuracy of gender classification on the UvA-NEMO smile database to analyze the effectiveness of the locational feature and relational importance. L.F. and R.I. denote locational features and relational importance, respectively.
Table 5. Accuracy of gender classification on the UvA-NEMO smile database compared with other methods.

Assessment of Facial Dynamics Interpreter Network for Gender Classification. We also evaluated the effectiveness of the proposed facial dynamics interpreter network for gender classification. First, the classification accuracy of the facial dynamics interpreter network with relational importance and locational features are summarized in Table 4. For comparison, the aggregation of local dynamic features using regional importance was compared. The proposed facial dynamics interpreter network achieved the highest accuracy by using both locational features and relational importance. The locational features and the relational importance in the facial dynamics interpreter network were also important for gender classification.

Table 5 shows the classification accuracy of the proposed facial dynamics interpreter network compared with other methods on UvA-NEMO database. Two types of appearance based approach named “how-old.net” and “commercial off-the-shelf (COTS)” were combined with a hand-crafted dynamic approach for gender classification [6]. How-old.net was a website (http://how-old.net/) launched by Microsoft for online age and gender recognition. COTS was a commercial face detection and recognition software, which included a gender classification. The dynamic approach calculated the facial local region’s dynamic descriptors such as amplitude, speed, and acceleration as described in [6]. In holistic dynamic approach, appearance features were extracted by the same VGG-face network used in the proposed method and the dynamic features were encoded on the holistic appearance features. An image based method [27] was also compared to validate the effectiveness of utilizing facial dynamics in gender classification. The accuracy of how-old.net + dynamics and COTS + dynamics were directly from [6] and the accuracy of the image-based CNN and the holistic dynamic approach were calculated in this study. By exploiting the relations between local dynamic features, the proposed method achieved the highest accuracy compared with other methods. The performance difference between the holistic approach and the proposed method was statistically significant (p < 0.05).

5 Conclusions

According to cognitive-psychological studies, facial dynamics could provide crucial cues for face analysis. The motion of facial local regions from facial expression is known that it is related to the motion of other facial regions. In this paper, the novel deep learning approach which could interpret the relations between facial local dynamics was proposed to interpret relations of local dynamics in facial trait estimation from the smile expression. Facial traits were estimated by combining relational features of facial local dynamics based on the relational importance. By comparative experiments, the effectiveness of the proposed method was verified for facial trait estimation. The important relations between facial dynamics were interpreted by the proposed method in gender classification and age estimation. The proposed method could accurately estimate facial traits (age and gender) compared with the state-of-the-art methods. We will attempt to extend the proposed method to other facial dynamic analysis such as spontaneity analysis [11] and video facial expression recognition [24].