Abstract
First impressions influence the behavior of people towards a newly encountered person or a human-like agent. Apart from the physical characteristics of the encountered face, the emotional expressions displayed on it, as well as ambient information affect these impressions. In this work, we propose an approach to predict the first impressions people will have for a given video depicting a face within a context. We employ pre-trained Deep Convolutional Neural Networks to extract facial expressions, as well as ambient information. After video modeling, visual features that represent facial expression and scene are combined and fed to a Kernel Extreme Learning Machine regressor. The proposed system is evaluated on the ChaLearn Challenge Dataset on First Impression Recognition, where the classification target is the “Big Five” personality trait labels for each video. Our system achieved an accuracy of 90.94 % on the sequestered test set, 0.36 % points below the top system in the competition.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction and Related Work
It is not possible to judge the personality of a person by a mere glimpse of the face, but people attribute apparent personality traits for a face they newly encounter, in a stereotypical way, and with remarkable consistency [1]. In this work, we tackle the problem of predicting the apparent personality using the data and protocol from the ChaLearn Looking at People 2016 First Impression Challenge [2].
It is not surprising that emotional expressions influence the attribution of personality traits. It is more likely for a smiling person to be perceived as more trustworthy, and friendly. Todorov et al. convincingly argued that rapid, unreflective trait inferences from faces can influence consequential decisions [3]. This is why people do not typically use frowning or angry pictures in their resumés. Also the context of the image can affect the perception of the face. In our proposed approach, we estimate emotional facial expressions, as well as cues from the context of the face to predict first impressions.
Before describing the followed approach, we provide a brief literature review on automatic personality trait recognition. In the past, various approaches have been used for recognizing apparent personality traits from different modalities such as audio [4, 5], text [6–8] and visual information [9, 10]. As in other recognition problems, multimodal systems are also investigated to improve robustness of prediction [11–14]. These works aim to estimate personality traits from given input. In psychology, personality is often assessed by running a “Big Five” questionnaire that measures Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism (OCEAN) [15]. Apparent personality is also frequently assessed in these five dimensions.
In their work, Borkenau and Liebler used the Brunswik’s lens model and categorized the particular cues that may communicate a certain personality [16]. They included a large number of indicators such as overall impression variables (e.g. estimated age, masculinity, attractiveness), acoustic variables (e.g. softness of voice, pleasantness, clarity), static visual variables (e.g. appearance, make-up, garments, thin lips, hair style, facial expression), and dynamic visual variables (e.g. movement speed, hand movements, walking style). In order to assess the personality trait attributions, they measured “validity,” which indicates the correlation between self-ratings of personality and ratings by strangers or acquaintances. The Brunswik’s lens model looks at cues used for perceived traits, and links some of these cues to actual traits by assessing their ecological validity [17]. It is a useful conceptualization, also used in approaches to personality computing [18].
According to the literature, faces are a rich source of cues for apparent personality attribution, related to stereotype judgments. For an automatic analysis system, the first steps of a visual face analysis pipeline are face detection [19, 20] and facial landmark localization [21–23]. Face alignment (or registration) is an important step, as all further processing depends on its accuracy. Recent deep neural network approaches are known to be more resistant to registration errors.
Face alignment is followed by visual feature extraction, which can include image-level appearance descriptors such as Local Binary Patterns (LBP) [24], Histogram of Oriented Gradients (HOG) [25], Scale-invariant Feature Transform (SIFT) [26], video-level descriptors such as Local Gabor Binary Patterns from Three Orthogonal Planes (LGBP-TOP) [27] and Local Phase Quantization (LPQ)-TOP [28], or geometric information [9, 10].
Deep learning based approaches have achieved state-of-the-art results in human behavior analysis. These approaches, when trained with large datasets, can provide representations that are very robust to variations exhibited in the data. Deep learning has been successfully applied to many tasks related to computer vision such as object recognition [29, 30], face recognition [31], emotion recognition [32] and age estimation [33–37]. Moreover, deep representations of images are often usable for multiple tasks, enabling transfer learning from pre-trained models. The disadvantages are the relatively high computational requirements for training such systems, the large amount of training data required, and (relatively) poor temporal extension to video processing.
In recent approaches to personality impressions classification, Support Vector Machines (SVM) [38] have been widely used [5, 8, 12, 14]. Recently, a learning approach called Extreme Learning Machines (ELM) that is similar to SVMs but providing faster learning schemes has become popular [39]. The use of ELM’s name is debated in the literature, because of its strong resemblance to earlier methods. We continue to use it in this work for convenience. The approach has been shown to provide good performance in a number of applications including face recognition [40, 41], emotion recognition [42, 43], and smile detection [44].
Given the success of deep learning approaches and the speed of ELM, we propose to use a fusion of deep face and scene features, followed by regularized regression with a kernel ELM classifier. The main contribution of this work is the effective combination of emotion related and ambient features that are efficiently extracted from pre-trained/fine-tuned Deep Convolutional Neural Network (DCNN) models. Our method is illustrated in a simplified flowchart in Fig. 1.
The remainder of this paper is organized as follows. In the next section we provide background and details on the methodology. Then in Sect. 3, we present the experimental results, followed by implementation details in Sect. 4. Finally, Sect. 5 concludes with future directions.
2 Methodology
Our proposed approach evaluates a short video clip that contains a single person, and outputs an estimate of apparent personality traits in the five dimensions mentioned earlier. In this section, we describe the three main steps of our pipeline, namely, face alignment, feature extraction, and modeling.
2.1 Face Alignment
For detecting and aligning faces from the videos, we use Xiong and de la Torre’s Supervised Descent Method (SDM), also known as IntraFace [21]. This approach locates 49 landmarks on the face. After the landmarks are located, we estimate the roll angle of the face from the eye corner locations and rotate the image to rectify the face. We then add a margin of 20 % interocular distance around the outer landmarks to compute a loose bounding box from which we crop facial images. After the face is cropped, it is resized to \(64\times 64\) pixels, and registered as a new frame. Frames from a sample input video and the corresponding aligned face images are shown in Fig. 2.
2.2 Feature Extraction
We extract facial features that are summarized over an entire video segment, and scene features from the first image of each video. The assumption is that videos do not stretch over multiple shots.
Face Features: After aligning the faces, we extract image-level deep features from a network that is trained for facial emotion recognition. The training of this network is explained in more detail in Sect. 2.3. For comparison, we also extract features from the original VGG-Face network that was trained for face recognition [31]. For both networks, we use the response of the 33\(^{rd}\) layer of the 37-layer architecture, which is the lowest-level 4096-dimensional descriptor.
We compare deep features with traditional appearance descriptors and geometric information that is shown to be effective in emotion recognition [45]. We report the cross validation accuracy of each approach in Sect. 3.2.
Video Features: After extracting frame-level features from each registered face, we summarize the videos by computing functional statistics of each dimension over time. The functionals include mean, standard deviation, offset, slope, and curvature. Offset and slope are calculated from the first order polynomial fit to each feature contour, while curvature is the leading coefficient of the second order polynomial. An empirical comparison of the individual functionals is given in Sect. 3.2.
Scene Features: In order to use ambient information in the images to our advantage, we extract features using the VGG-19 network [30], which is trained for an object recognition task on the ILSVRC 2012 dataset. Similar to face features, we use the 4096-dimensional feature from the 39\(^{th}\) layer of the 43-layer architecture, hence we obtain a description of the overall image that contains both the face and the scene, which we combine with face features using feature-level fusion.
2.3 CNN Fine Tuning
We start with the VGG-Face network [31], changing the final layer (originally a 2622-dimensional recognition layer), to a 7-dimensional emotion recognition layer, where the weights are initialized randomly. We fine-tune this network with the softmax loss function using around 30,000 training images from the FER-2013 dataset [46]. We choose an initial learning rate of 0.0001, a momentum of 0.9 and a batch size of 64. We train the model for 5 epochs, and we show the validation set performance for each epoch in Fig. 3.
2.4 Regression with Kernel ELM
In order to model personality traits from visual features, we used kernel ELM, due to the learning speed and accuracy of the algorithm. In the following paragraphs, we briefly explain the learning strategy of ELM.
ELM proposes a simple and robust learning algorithm for single-hidden layer feedforward networks. The input layer’s bias and weights are initialized randomly to obtain the output of the second (hidden) layer. The bias and weights of the second layer are calculated by a simple generalized inverse operation of the hidden layer output matrix.
ELM tries to find the mapping between the hidden node output matrix \(\mathbf {H} \in \mathbb {R}^{N \times h}\) and the label vector \(\mathbf {T} \in \mathbb {R}^{N \times 1}\) where N and h denote the number of samples and the hidden neurons, respectively. The set of output weights \(\mathbf {\beta } \in \mathbb {R}^{h \times 1}\) is calculated by the least squares solution of the set of linear equations \(\mathbf {H} \mathbf {\beta }=\mathbf {T}\), as:
where \(\mathbf {H}^{\dagger }\) denotes the Moore-Penrose generalized inverse [47] that minimizes the \(L_2\) norms of \(||\mathbf {H}\mathbf {\beta }-\mathbf {T}||\) and \(||\mathbf {\beta }||\) simultaneously.
To increase the robustness and the generalization capability of ELM, a regularization coefficient \(\mathbf {C}\) is included in the optimization procedure. Therefore, given a kernel \(\mathbf {K}\), the set of weights is learned as follows:
In order to prevent parameter overfitting, we use the linear kernel \(\mathbf {K}(x,y) = x^Ty\), where x and y are the original feature vectors after min-max normalization of each dimension among the training samples. With this approach, the only parameter of our model is the regularization coefficient C, which we optimize with a 5-fold subject independent cross-validation on the training set. In Sect. 3.2, we report the average score of each fold with the selected parameter.
3 Experiments
3.1 Challenge and Corpus
The “ChaLearn LAP Apparent Personality Analysis: First Impressions” challenge consists of 10,000 clips collected from 5,563 YouTube videos, where the poses are more or less frontal, but the resolution, lighting and background conditions are not controlled, hence providing a dataset with in-the-wild conditions. Each clip in the training set is labeled for the Big Five personality traits. Basic statistics of the dataset partitions are provided in Table 1. The detailed information on the challenge and corpus can be found in [2].
Performance Evaluation: The performance score in this challenge is the Mean Absolute Error subtracted from 1, which is formulated as follows:
where N is the number of samples, \(\hat{y}\) is the predicted label and y is the true label (\(0\le y \le 1\)). This score is then averaged over five tasks. This means the final score varies between 0 (worst case) and 1 (best case).
3.2 Experimental Results
In this section, we report the regression performance of various visual descriptors. Tables 2 and 3 summarize the performances of the different systems with 5-fold subject-independent cross-validation on the training set.
We first look at the performance of individual functionals, which are described in Sect. 2.2. As can be seen in Table 2, the combination of mean, standard deviation, and offset features works well, and the mean by itself is the most informative functional.
We evaluate a set of features with different dimensionalities individually. Geometric features (GEO), LPQ-TOP, LBP-TOP, and different deep neural network features were individually tested. Table 3 summarizes the results, and gives the dimensionality of each selected feature set. We observe that features from the deep face model fine tuned on the FER emotion corpus provide higher performances compared to both original deep features and hand-crafted visual features. Combining these features with ambient (scene) information further improves the prediction performance.
The best fusion system (ID 9 in Table 3) gives a test set mean accuracy of 0.9094, which ranks the fifth in the official competition. Considering the obtained test set performance in comparison to other competitors’ accuracies (see Table 4), we observe that the performances are around 0.90–0.91 in general. The top accuracy is 0.9130, while the top six teams’ accuracies are higher than 0.9.
We show the estimations of our system during cross validation in Figs. 4 and 5. The results in Fig. 4 show how precisely our system can estimate the personality traits under various imaging conditions. Figure 5 shows that examples with labels very close to 0 or 1 tend to have higher error, which might be due to the approximately normal distribution of training labels with mean values around 0.5.
4 Implementation Details
The whole system is implemented in MATLAB R2015b on a 64-bit Windows 10 PC with 32 GB RAM and an Intel i7-6700 CPU. For fine-tuning and feature extraction with CNNs, the MatConvNet library [48] has been used with GPU parallelization, using an NVidia GeForce GTX 970 GPU. Time spent on important parts of the pipeline is summarized in Table 5.
5 Conclusions
In this paper, we proposed to use transfer learning in order to estimate the personality trait perceptions during first impressions. We use deep convolutional neural networks (DCNN) that are originally trained for other tasks such as face, object, and emotion recognition, and we employ their features directly. Hence, we show the feasibility of deep transfer learning for this task.
Combining two sets of DCNN features that carry facial expression and ambient information, we achieve better results compared to each of these approaches, as well as compared to other hand-crafted visual features. In this work, we did not make use of the audio modality, which was shown to be beneficial in earlier works. Audio-based and multimodal analyses constitute our future work. In this work, video modeling is carried out using simple statistical functionals. This approach is fast and shown to be accurate. For future works, a wider set of functionals will be investigated.
References
Cuddy, A.J., Fiske, S.T., Glick, P.: Warmth and competence as universal dimensions of social perception: the stereotype content model and the bias map. Adv. Exp. Soc. Psychol. 40, 61–149 (2008)
Lopez, V.P., Chen, B., Places, A., Oliu, M., Corneanu, C., Baro, X., Escalante, H.J., Guyon, I., Escalera, S.: ChaLearn LAP 2016: first round challenge on first impressions - dataset and results. In: ChaLearn Looking at People Workshop on Apparent Personality Analysis, ECCV Workshop Proceedings (2016)
Todorov, A., Mandisodza, A.N., Goren, A., Hall, C.C.: Inferences of competence from faces predict election outcomes. Science 308(5728), 1623–1626 (2005)
Valente, F., Kim, S., Motlicek, P.: Annotation and recognition of personality traits in spoken conversations from the AMI meetings corpus. In: INTERSPEECH, pp. 1183–1186 (2012)
Madzlan, N., Han, J., Bonin, F., Campbell, N.: Towards automatic recognition of attitudes: prosodic analysis of video blogs. In: Speech Prosody, Dublin, Ireland, pp. 91–94 (2014)
Alam, F., Stepanov, E.A., Riccardi, G.: Personality traits recognition on social network-Facebook. In: WCPR (ICWSM-2013), Cambridge, MA, USA (2013)
Nowson, S., Gill, A.J.: Look! who’s talking? Projection of extraversion across different social contexts. In: Proceedings of the 2014 ACM Multimedia Workshop on Computational Personality Recognition, pp. 23–26. ACM (2014)
Gievska, S., Koroveshovski, K.: The impact of affective verbal content on predicting personality impressions in Youtube videos. In: Proceedings of the 2014 ACM Multimedia Workshop on Computational Personality Recognition, pp. 19–22. ACM (2014)
Fernando, T., et al.: Persons’ personality traits recognition using machine learning algorithms and image processing techniques. Adv. Comput. Sci.: Int. J. 5(1), 40–44 (2016)
Qin, R., Gao, W., Xu, H., Hu, Z.: Modern physiognomy: an investigation on predicting personality traits and intelligence from the human face. arXiv preprint arXiv:1604.07499 (2016)
Sarkar, C., Bhatia, S., Agarwal, A., Li, J.: Feature analysis for computational personality recognition using Youtube personality data set. In: Proceedings of the 2014 ACM Multimedia Workshop on Computational Personality Recognition, pp. 11–14. ACM (2014)
Alam, F., Riccardi, G.: Predicting personality traits using multimodal information. In: Proceedings of the 2014 ACM Multimedia Workshop on Computational Personality Recognition, pp. 15–18. ACM (2014)
Farnadi, G., Sushmita, S., Sitaraman, G., Ton, N., De Cock, M., Davalos, S.: A multivariate regression approach to personality impression recognition of vloggers. In: Proceedings of the 2014 ACM Multimedia Workshop on Computational Personality Recognition, pp. 1–6. ACM (2014)
Sidorov, M., Ultes, S., Schmitt, A.: Automatic recognition of personality traits: a multimodal approach. In: Proceedings of the 2014 Workshop on Mapping Personality Traits Challenge, pp. 11–15. ACM (2014)
Gosling, S.D., Rentfrow, P.J., Swann, W.B.: A very brief measure of the big-five personality domains. J. Res. Pers. 37(6), 504–528 (2003)
Borkenau, P., Liebler, A.: Trait inferences: sources of validity at zero acquaintance. J. Pers. Soc. Psychol. 62(4), 645 (1992)
Zebrowitz, L.A., Collins, M.A.: Accurate social perception at zero acquaintance: the affordances of a gibsonian approach. Pers. Soc. Psychol. Rev. 1(3), 204–223 (1997)
Vinciarelli, A., Mohammadi, G.: A survey of personality computing. IEEE Trans. Affect. Comput. 5(3), 273–291 (2014)
Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004)
Mathias, M., Benenson, R., Pedersoli, M., Gool, L.: Face detection without bells and whistles. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 720–735. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10593-2_47
Xiong, X., De la Torre, F.: Supervised descent method and its application to face alignment. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 532–539(2013)
Ren, S., Cao, X., Wei, Y., Sun, J.: Face alignment at 3000 FPS via regressing local binary features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1685–1692 (2014)
Xiong, X., De la Torre, F.: Global supervised descent method. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2664–2673 (2015)
Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 971–987 (2002)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 886–893. IEEE (2005)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Almaev, T.R., Valstar, M.F.: Local Gabor binary patterns from three orthogonal planes for automatic facial expression recognition. In: Humaine Association Conference on Affective Computing and Intelligent Interaction, pp. 356–361. IEEE (2013)
Jiang, B., Valstar, M.F., Pantic, M.: Action unit detection using sparse appearance descriptors in space-time video volumes. In: 2011 IEEE International Conference on Automatic Face and Gesture Recognition and Workshops (FG 2011), pp. 314–321. IEEE (2011)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105 (2012)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: British Machine Vision Conference (2015)
Kim, B.K., Lee, H., Roh, J., Lee, S.Y.: Hierarchical committee of deep CNNs with exponentially-weighted decision fusion for static facial expression recognition. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 427–434. ACM (2015)
Rothe, R., Timofte, R., Gool, L.: Dex: deep expectation of apparent age from a single image. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 10–15 (2015)
Liu, X., Li, S., Kan, M., Zhang, J., Wu, S., Liu, W., Han, H., Shan, S., Chen, X.: Agenet: deeply learned regressor and classifier for robust apparent age estimation. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 16–24 (2015)
Zhu, Y., Li, Y., Mu, G., Guo, G.: A study on apparent age estimation. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 25–31 (2015)
Escalera, S., Torres, M., Martinez, B., Baro, X., Jair Escalante, H., Guyon, I., Tzimiropoulos, G., Corneou, C., Oliu, M., Bagheri, M.A., Valstar, M.: Chalearn looking at people and faces of the world: Face analysis workshop and challenge 2016. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 1–8, June 2016
Gürpınar, F., Kaya, H., Dibeklioğlu, H., Salah, A.A.: Kernel ELM and CNN based facial age estimation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Las Vegas, Nevada, USA, pp. 80–86, June 2016
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Huang, G.B., Zhou, H., Ding, X., Zhang, R.: Extreme learning machine for regression and multiclass classification. IEEE Trans. Syst. Man Cybern. Part B: Cybern. 42(2), 513–529 (2012)
Zong, W., Huang, G.B.: Face recognition based on extreme learning machine. Neurocomputing 74(16), 2541–2551 (2011)
Mohammed, A.A., Minhas, R., Wu, Q.J., Sid-Ahmed, M.A.: Human face recognition based on multidimensional PCA and extreme learning machine. Pattern Recogn. 44(10), 2588–2597 (2011)
Utama, P., Ajie, H., et al.: A framework of human emotion recognition using extreme learning machine. In: 2014 International Conference of Advanced Informatics: Concept, Theory and Application (ICAICTA), pp. 315–320. IEEE (2014)
Kaya, H., Karpov, A.A., Salah, A.A.: Robust acoustic emotion recognition based on cascaded normalization and extreme learning machines. In: Cheng, L., Liu, Q., Ronzhin, A. (eds.) ISNN 2016. LNCS, vol. 9719, pp. 115–123. Springer, Heidelberg (2016). doi:10.1007/978-3-319-40663-3_14
An, L., Yang, S., Bhanu, B.: Efficient smile detection by extreme learning machine. Neurocomputing 149, 354–363 (2015)
Kaya, H., Gürpınar, F., Afshar, S., Salah, A.A.: Contrasting and combining least squares based learners for emotion recognition in the wild. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 459–466. ACM (2015)
Goodfellow, I.J., et al.: Challenges in representation learning: a report on three machine learning contests. In: Lee, M., Hirose, A., Hou, Z.-G., Kil, R.M. (eds.) ICONIP 2013. LNCS, vol. 8228, pp. 117–124. Springer, Heidelberg (2013). doi:10.1007/978-3-642-42051-1_16
Rao, C.R., Mitra, S.K.: Generalized Inverse of Matrices and Its Applications, vol. 7. Wiley, New York (1971)
Vedaldi, A., Lenc, K.: MatConvNet - convolutional neural networks for MATLAB. (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Gürpınar, F., Kaya, H., Salah, A.A. (2016). Combining Deep Facial and Ambient Features for First Impression Estimation. In: Hua, G., Jégou, H. (eds) Computer Vision – ECCV 2016 Workshops. ECCV 2016. Lecture Notes in Computer Science(), vol 9915. Springer, Cham. https://doi.org/10.1007/978-3-319-49409-8_30
Download citation
DOI: https://doi.org/10.1007/978-3-319-49409-8_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49408-1
Online ISBN: 978-3-319-49409-8
eBook Packages: Computer ScienceComputer Science (R0)