Keywords

1 Introduction and Related Work

It is not possible to judge the personality of a person by a mere glimpse of the face, but people attribute apparent personality traits for a face they newly encounter, in a stereotypical way, and with remarkable consistency [1]. In this work, we tackle the problem of predicting the apparent personality using the data and protocol from the ChaLearn Looking at People 2016 First Impression Challenge [2].

It is not surprising that emotional expressions influence the attribution of personality traits. It is more likely for a smiling person to be perceived as more trustworthy, and friendly. Todorov et al. convincingly argued that rapid, unreflective trait inferences from faces can influence consequential decisions [3]. This is why people do not typically use frowning or angry pictures in their resumés. Also the context of the image can affect the perception of the face. In our proposed approach, we estimate emotional facial expressions, as well as cues from the context of the face to predict first impressions.

Before describing the followed approach, we provide a brief literature review on automatic personality trait recognition. In the past, various approaches have been used for recognizing apparent personality traits from different modalities such as audio [4, 5], text [68] and visual information [9, 10]. As in other recognition problems, multimodal systems are also investigated to improve robustness of prediction [1114]. These works aim to estimate personality traits from given input. In psychology, personality is often assessed by running a “Big Five” questionnaire that measures Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism (OCEAN) [15]. Apparent personality is also frequently assessed in these five dimensions.

In their work, Borkenau and Liebler used the Brunswik’s lens model and categorized the particular cues that may communicate a certain personality [16]. They included a large number of indicators such as overall impression variables (e.g. estimated age, masculinity, attractiveness), acoustic variables (e.g. softness of voice, pleasantness, clarity), static visual variables (e.g. appearance, make-up, garments, thin lips, hair style, facial expression), and dynamic visual variables (e.g. movement speed, hand movements, walking style). In order to assess the personality trait attributions, they measured “validity,” which indicates the correlation between self-ratings of personality and ratings by strangers or acquaintances. The Brunswik’s lens model looks at cues used for perceived traits, and links some of these cues to actual traits by assessing their ecological validity [17]. It is a useful conceptualization, also used in approaches to personality computing [18].

According to the literature, faces are a rich source of cues for apparent personality attribution, related to stereotype judgments. For an automatic analysis system, the first steps of a visual face analysis pipeline are face detection [19, 20] and facial landmark localization [2123]. Face alignment (or registration) is an important step, as all further processing depends on its accuracy. Recent deep neural network approaches are known to be more resistant to registration errors.

Face alignment is followed by visual feature extraction, which can include image-level appearance descriptors such as Local Binary Patterns (LBP) [24], Histogram of Oriented Gradients (HOG) [25], Scale-invariant Feature Transform (SIFT) [26], video-level descriptors such as Local Gabor Binary Patterns from Three Orthogonal Planes (LGBP-TOP) [27] and Local Phase Quantization (LPQ)-TOP [28], or geometric information [9, 10].

Deep learning based approaches have achieved state-of-the-art results in human behavior analysis. These approaches, when trained with large datasets, can provide representations that are very robust to variations exhibited in the data. Deep learning has been successfully applied to many tasks related to computer vision such as object recognition [29, 30], face recognition [31], emotion recognition [32] and age estimation [3337]. Moreover, deep representations of images are often usable for multiple tasks, enabling transfer learning from pre-trained models. The disadvantages are the relatively high computational requirements for training such systems, the large amount of training data required, and (relatively) poor temporal extension to video processing.

In recent approaches to personality impressions classification, Support Vector Machines (SVM) [38] have been widely used [5, 8, 12, 14]. Recently, a learning approach called Extreme Learning Machines (ELM) that is similar to SVMs but providing faster learning schemes has become popular [39]. The use of ELM’s name is debated in the literature, because of its strong resemblance to earlier methods. We continue to use it in this work for convenience. The approach has been shown to provide good performance in a number of applications including face recognition [40, 41], emotion recognition [42, 43], and smile detection [44].

Given the success of deep learning approaches and the speed of ELM, we propose to use a fusion of deep face and scene features, followed by regularized regression with a kernel ELM classifier. The main contribution of this work is the effective combination of emotion related and ambient features that are efficiently extracted from pre-trained/fine-tuned Deep Convolutional Neural Network (DCNN) models. Our method is illustrated in a simplified flowchart in Fig. 1.

Fig. 1.
figure 1

Flowchart of the proposed method.

The remainder of this paper is organized as follows. In the next section we provide background and details on the methodology. Then in Sect. 3, we present the experimental results, followed by implementation details in Sect. 4. Finally, Sect. 5 concludes with future directions.

2 Methodology

Our proposed approach evaluates a short video clip that contains a single person, and outputs an estimate of apparent personality traits in the five dimensions mentioned earlier. In this section, we describe the three main steps of our pipeline, namely, face alignment, feature extraction, and modeling.

2.1 Face Alignment

For detecting and aligning faces from the videos, we use Xiong and de la Torre’s Supervised Descent Method (SDM), also known as IntraFace [21]. This approach locates 49 landmarks on the face. After the landmarks are located, we estimate the roll angle of the face from the eye corner locations and rotate the image to rectify the face. We then add a margin of 20 % interocular distance around the outer landmarks to compute a loose bounding box from which we crop facial images. After the face is cropped, it is resized to \(64\times 64\) pixels, and registered as a new frame. Frames from a sample input video and the corresponding aligned face images are shown in Fig. 2.

Fig. 2.
figure 2

Face alignment example.

2.2 Feature Extraction

We extract facial features that are summarized over an entire video segment, and scene features from the first image of each video. The assumption is that videos do not stretch over multiple shots.

Face Features: After aligning the faces, we extract image-level deep features from a network that is trained for facial emotion recognition. The training of this network is explained in more detail in Sect. 2.3. For comparison, we also extract features from the original VGG-Face network that was trained for face recognition [31]. For both networks, we use the response of the 33\(^{rd}\) layer of the 37-layer architecture, which is the lowest-level 4096-dimensional descriptor.

We compare deep features with traditional appearance descriptors and geometric information that is shown to be effective in emotion recognition [45]. We report the cross validation accuracy of each approach in Sect. 3.2.

Video Features: After extracting frame-level features from each registered face, we summarize the videos by computing functional statistics of each dimension over time. The functionals include mean, standard deviation, offset, slope, and curvature. Offset and slope are calculated from the first order polynomial fit to each feature contour, while curvature is the leading coefficient of the second order polynomial. An empirical comparison of the individual functionals is given in Sect. 3.2.

Scene Features: In order to use ambient information in the images to our advantage, we extract features using the VGG-19 network [30], which is trained for an object recognition task on the ILSVRC 2012 dataset. Similar to face features, we use the 4096-dimensional feature from the 39\(^{th}\) layer of the 43-layer architecture, hence we obtain a description of the overall image that contains both the face and the scene, which we combine with face features using feature-level fusion.

2.3 CNN Fine Tuning

We start with the VGG-Face network [31], changing the final layer (originally a 2622-dimensional recognition layer), to a 7-dimensional emotion recognition layer, where the weights are initialized randomly. We fine-tune this network with the softmax loss function using around 30,000 training images from the FER-2013 dataset [46]. We choose an initial learning rate of 0.0001, a momentum of 0.9 and a batch size of 64. We train the model for 5 epochs, and we show the validation set performance for each epoch in Fig. 3.

Fig. 3.
figure 3

Fine tuning the VGG-Face network on the FER-2013 public test set. The figure on the left shows the softmax loss, whereas the figure on the right shows the top-1 and top-2 classification errors.

2.4 Regression with Kernel ELM

In order to model personality traits from visual features, we used kernel ELM, due to the learning speed and accuracy of the algorithm. In the following paragraphs, we briefly explain the learning strategy of ELM.

ELM proposes a simple and robust learning algorithm for single-hidden layer feedforward networks. The input layer’s bias and weights are initialized randomly to obtain the output of the second (hidden) layer. The bias and weights of the second layer are calculated by a simple generalized inverse operation of the hidden layer output matrix.

ELM tries to find the mapping between the hidden node output matrix \(\mathbf {H} \in \mathbb {R}^{N \times h}\) and the label vector \(\mathbf {T} \in \mathbb {R}^{N \times 1}\) where N and h denote the number of samples and the hidden neurons, respectively. The set of output weights \(\mathbf {\beta } \in \mathbb {R}^{h \times 1}\) is calculated by the least squares solution of the set of linear equations \(\mathbf {H} \mathbf {\beta }=\mathbf {T}\), as:

$$\begin{aligned} \mathbf {\beta } = \mathbf {H}^{\dagger }\mathbf {T}, \end{aligned}$$
(1)

where \(\mathbf {H}^{\dagger }\) denotes the Moore-Penrose generalized inverse [47] that minimizes the \(L_2\) norms of \(||\mathbf {H}\mathbf {\beta }-\mathbf {T}||\) and \(||\mathbf {\beta }||\) simultaneously.

To increase the robustness and the generalization capability of ELM, a regularization coefficient \(\mathbf {C}\) is included in the optimization procedure. Therefore, given a kernel \(\mathbf {K}\), the set of weights is learned as follows:

$$\begin{aligned} \mathbf {\beta } = (\frac{\mathbf {I}}{C}+\mathbf {K})^{-1} \mathbf {T}. \end{aligned}$$
(2)

In order to prevent parameter overfitting, we use the linear kernel \(\mathbf {K}(x,y) = x^Ty\), where x and y are the original feature vectors after min-max normalization of each dimension among the training samples. With this approach, the only parameter of our model is the regularization coefficient C, which we optimize with a 5-fold subject independent cross-validation on the training set. In Sect. 3.2, we report the average score of each fold with the selected parameter.

3 Experiments

3.1 Challenge and Corpus

The “ChaLearn LAP Apparent Personality Analysis: First Impressions” challenge consists of 10,000 clips collected from 5,563 YouTube videos, where the poses are more or less frontal, but the resolution, lighting and background conditions are not controlled, hence providing a dataset with in-the-wild conditions. Each clip in the training set is labeled for the Big Five personality traits. Basic statistics of the dataset partitions are provided in Table 1. The detailed information on the challenge and corpus can be found in [2].

Table 1. Dataset summary

Performance Evaluation: The performance score in this challenge is the Mean Absolute Error subtracted from 1, which is formulated as follows:

$$\begin{aligned} 1-\sum _{i}^{N} \frac{|\hat{y}_i-y_i|}{N}, \end{aligned}$$
(3)

where N is the number of samples, \(\hat{y}\) is the predicted label and y is the true label (\(0\le y \le 1\)). This score is then averaged over five tasks. This means the final score varies between 0 (worst case) and 1 (best case).

3.2 Experimental Results

In this section, we report the regression performance of various visual descriptors. Tables 2 and 3 summarize the performances of the different systems with 5-fold subject-independent cross-validation on the training set.

We first look at the performance of individual functionals, which are described in Sect. 2.2. As can be seen in Table 2, the combination of mean, standard deviation, and offset features works well, and the mean by itself is the most informative functional.

Table 2. Functional statistics with deep face features.

We evaluate a set of features with different dimensionalities individually. Geometric features (GEO), LPQ-TOP, LBP-TOP, and different deep neural network features were individually tested. Table 3 summarizes the results, and gives the dimensionality of each selected feature set. We observe that features from the deep face model fine tuned on the FER emotion corpus provide higher performances compared to both original deep features and hand-crafted visual features. Combining these features with ambient (scene) information further improves the prediction performance.

Table 3. Regression performance with various visual descriptors

The best fusion system (ID 9 in Table 3) gives a test set mean accuracy of 0.9094, which ranks the fifth in the official competition. Considering the obtained test set performance in comparison to other competitors’ accuracies (see Table 4), we observe that the performances are around 0.90–0.91 in general. The top accuracy is 0.9130, while the top six teams’ accuracies are higher than 0.9.

Table 4. Final ranking on the test set

We show the estimations of our system during cross validation in Figs. 4 and 5. The results in Fig. 4 show how precisely our system can estimate the personality traits under various imaging conditions. Figure 5 shows that examples with labels very close to 0 or 1 tend to have higher error, which might be due to the approximately normal distribution of training labels with mean values around 0.5.

Fig. 4.
figure 4

Six examples from the training set where our approach produced good estimations for the traits. For each example, the first column shows the ground truth (True), and the second column shows the estimation of the model (Pred.)

Fig. 5.
figure 5

Examples from the training set where our approach produced poor estimations for the traits. For each example, the first column shows the ground truth (True), and the second column shows the estimation of the model (Pred.)

4 Implementation Details

The whole system is implemented in MATLAB R2015b on a 64-bit Windows 10 PC with 32 GB RAM and an Intel i7-6700 CPU. For fine-tuning and feature extraction with CNNs, the MatConvNet library [48] has been used with GPU parallelization, using an NVidia GeForce GTX 970 GPU. Time spent on important parts of the pipeline is summarized in Table 5.

Table 5. Time requirement for each step of the pipeline

5 Conclusions

In this paper, we proposed to use transfer learning in order to estimate the personality trait perceptions during first impressions. We use deep convolutional neural networks (DCNN) that are originally trained for other tasks such as face, object, and emotion recognition, and we employ their features directly. Hence, we show the feasibility of deep transfer learning for this task.

Combining two sets of DCNN features that carry facial expression and ambient information, we achieve better results compared to each of these approaches, as well as compared to other hand-crafted visual features. In this work, we did not make use of the audio modality, which was shown to be beneficial in earlier works. Audio-based and multimodal analyses constitute our future work. In this work, video modeling is carried out using simple statistical functionals. This approach is fast and shown to be accurate. For future works, a wider set of functionals will be investigated.