1 Introduction

Humans use different modalities such as speech, voice, facial expression and gesture for communicating emotions and intentions. Among these different modalities, facial expression can be seen as a strong tool to look into the inner emotion/attitude of a person irrespective of his/her ethnicity/mother-tongue. The authors of [28] found that inconsistent facial expressions combined with statement analysis annotations could correctly classify 90 % of the cases in videos as to whether the participants lied or told the truth. Therefore, expression recognition can play an important role in deception detection in criminology. Facial expression analysis has got its application in medical field also (e.g., patient monitoring system [39], mental state analysis of psychiatric patients [16]). In addition, there is a recent trend of analysis of facial expressions in marketing and advertisement [29]. In market analysis, the emotional expression displayed through facial expressions are used as a cue to the likability of a product by the customer. Facial expression recognition has also got important application in human-computer-interaction where the intelligent computer system can recognize the emotional facial expression of the user and act accordingly [17]. For these and many such reasons facial expression recognition has become an important research area in machine vision and pattern recognition field.

A number of factors associated with facial images/videos make facial expression recognition a challenging task. Some of the factors include complex non-rigid movement of facial components, different skin-tone and illumination condition, different scale and position of faces in images/videos. On the other hand different emotional expressions such as anger and disgust or anger and sadness may have similar appearance [3, 55]. These aspects make facial expression recognition complicated. We propose a unique combination of optical flow (OF) and image gradient (IG) as feature for expression recognition. This unique feature is robust to changes in skin-tone. Use of this unique combination of OF and IG in the form of histogram in the bag-of-words setting makes the proposed feature independent of scale of face in the video. Since, our method tracks the movement of pixels, it is also capable of capturing the complex non-rigid movement of facial components. Thus, the proposed method overcomes the above mentioned hindrances (such as non-rigid movement of facial components and different skin-tones and scales) associated with facial expression recognition and gives better results with respect to the state-of-the-art works.

Ekman and Friesen [14] researched systematically on the universality of six basic emotional expressions, anger, disgust, fear, happiness, sadness and surprise. In recent times a number of works have been done on facial expression recognition in machine vision and pattern recognition field. Different features such as Gabor, active appearance model, local binary pattern (LBP), local phase quantizer (LPQ), OF and IG [13, 8, 12, 35, 38, 54] have been tested and significant development has been made in this area. Still, given the problematic factors as discussed above, there is a large scope of improvement in this field. Mainly two types of features are used in the literature for recognition of facial expressions, 1. appearance and 2. geometry based. The appearance based features consider the changes in texture. The change in texture can be due to appearance or disappearance of wrinkles, bulges, furrows or changes in shapes of facial components such as eyes, mouth etc. The authors of [1] and [38] used active appearance and LBP features respectively as appearance based features. The discriminating learning of Gaussian mixture models had been explored in [41] for multi-view facial expression recognition. The geometry based features [19, 20, 42, 44] consider the distances between specific pairs of points on face, relative velocity of those points, shape of the facial components etc. Over the time, researchers have given arguments for or against the superiority of appearance based features over geometry based features. While [54] showed that appearance based features are more powerful than geometry based features, [44] and [20] showed that geometry based features are similar or better at expression recognition. In our paper, we have extracted features from the appearance change of face due to the expressions and avoid the use of geometry based facial features.

Some of the researches [11, 21, 37] showed that dynamic expression descriptors extracted from videos give better classification accuracy as compared to static descriptors extracted from images. Exploiting the temporal information in videos [37] and [11] could increase the overall recognition accuracy by 8.3 % and 3 % respectively as compared to the static counterparts. The dynamic descriptor based on local binary pattern on three orthogonal planes (LBP-TOP) [55] outperformed other features (e.g., Gabor [54]) till then. The authors of [21], in the similar fashion as LBP-TOP, devised LPQ-TOP (local phase quantizer on three orthogonal planes) and claimed that LPQ-TOP gives better accuracy compared to LBP-TOP for action unit recognition. Action unit (AU) is minimum observable expression produced by one or more facial muscles [42]. Considering the fact that the temporal information in the video helps to get good expression recognition accuracy, we extract dynamic descriptor containing temporal information from videos. We show that the unique combination of OF and IG that we propose, outperforms both LBP-TOP and LPQ-TOP in terms of accuracy, time and space complexity.

While expressing emotions, facial appearance goes through different expressive states. Image based expression recognition methods, consider only one state of the entire emotion signal while disregarding all the other states [11]. Dynamic transitions between different states often give vital insight into the emotional content of the expression [37]. Therefore, analyzing the entire sequence of images is necessary, but most of the works [19, 38] in the literature that recognize emotion from single image, report recognition accuracy only on images displaying apex of expression. Image based expression recognition techniques can only use spatial information regarding the shape and the appearance of the face while video based expression recognition techniques can utilize spatial as well as temporal information. The image based models [19] use location of landmark points or their relative distances in the spatial domain as geometric feature while, video based models can also use movement patterns (in the temporal domain) of the landmark points [40]. While LBP [38] or LPQ [12] are used for capturing 2D micro-patterns in the image, LBP-TOP [55] or LPQ-TOP [21] are used for capturing 3D micro-patterns from the video volume. In the manifold based expression recognition approaches, image is represented as a points in the low-dimensional manifold [50] while video is represented as a path connecting multiple points [9].

While analyzing video may help in better understanding the emotion behind the expression, constructing a video based expression recognition model has several issues. For the recognition models that extract features from every image in the video, the dimension of the features become very high resulting in higher time and space complexity. Moreover, there may be in-plane and out-of-plane movement of the head in the video. In the current paper we propose a novel technique to efficiently recognize emotions from videos while addressing the above mentioned issues to some extent. Now we briefly review some video based state-of-the-work emotional expression recognition techniques.

Recently researchers have been trying to capture the expression dynamics in low dimensional manifold embedded in the original data-space. A number of techniques have been used to project the sequence of images into the manifold. Shan et al. [37] proposed to use supervised locality preserving projection to derive a generalized low dimensional expression manifold for representing emotional facial expressions in image sequences. Then they formulated a Bayesian temporal model of the manifold for expression recognition. As feature they used LBP. The paper [9] by Buenaposada et al. classified six emotional facial expressions in a low dimensional manifold in the space of facial deformations. They used a nearest neighbor technique to find the probability of an image to come from a facial expression. Then they used recursively a Bayesian procedure to sequentially combine these probabilities to estimate the posterior probability. The topology of the continuous expression data was modeled in a low dimensional manifold by Rudovic et al. [34] also. The topology was then incorporated into the Laplacian shared-parameter hidden conditional ordinal random field (LSH-CORF) framework for dynamic ordinal regression by constraining H-CORF parameters to lie on the ordinal manifold. They tested both locality preserving projection and supervised ordinal locality preserving projection for building the low dimensional manifold.

In addition to low dimensional manifold, other approaches had been taken for facial expression recognition from videos. Bejani et al. [7] encoded the motion history (the where and when) of the expressive face video in integrated time motion image (ITMI) and quantized image matrix (QIM). An interesting approach towards representation of facial expressions in videos can be seen in the work of Wang et al. [49]. Wang et al. [49] treated facial expression as complex activity consisting of temporally overlapping or sequentially primitive events. They proposed interval temporal Bayesian network to capture the complex spatio-temporal relations among the primitive facial events. They constructed one model for each of the emotional expressions. The model was constructed in the form of a graph where each node represented an event corresponding to a facial feature point and the edge represented the temporal dependence between the events. Li et al. [25] proposed a unified framework based on dynamic Bayesian network that simultaneously do facial feature-point tracking and recognize AUs and facial emotional expressions. In this framework not just facial feature-point tracking helps in expression recognition but, expression recognition also gives feedback in feature-point tracking. Chew et al. [11] proposed a sparse representation based classifier for recognition of seven basic expressions from CK + database. They represented the test video as linear combination of the training video sequences represented as a dictionary of videos. The combination coefficients were derived by solving an l 1 norm minimization problem. To facilitate the minimization problem, they projected the vectors representing videos to a low dimensional space. We compare our facial expression recognition model with the state-of-the-art methods in Section 5.1. We show that the proposed model works efficiently as compared to competing methods.

We use a unique combination of OF and IG based feature as appearance based feature. In literature, OF and IG have been used separately for expression recognition. The authors of [35] tracked a number of points on face using Lucas-Kanade OF algorithm. They added the inter-frame displacement vectors to obtain the global displacement vector. For selecting the initial points they took two approaches. In the first approach they placed a rectangular grid on the face and tracked the vertices of the grid. They called it dense flow tracking (DFT). In the second approach they tracked specific 15 points on the face. They called it feature point tracking (FPT). The global displacement vectors were used as features to recognize expression using support vector machine (SVM). The authors of [12] had used pyramid histogram of oriented gradient (PHOG) and local phase quantizer (LPQ) as features for recognizing expressions from facial videos. The facial motion was detected using OF in [51]. Boughara et al. [8] extracted 8 gradient maps each representing the norm of the gradient of the image in a particular direction. In our approach, we combine the advantages of OF and IG and exploit the local motion information contained in the video in recognizing the facial emotional expressions. We exploit the OF and IG based feature to represent the emotional expression as a special combination of local motion patterns. Bag-of-words based approach is utilized in this respect. Other works in the literature have also utilized bag-of-words based approach for facial expression recognition. Chew et al. [11] represented the test video as linear combination of the training video sequences represented as a dictionary of videos. The combination coefficients were derived by solving an l 1 norm minimization problem. A bag-of-distance model was used to represent expressions by [19]. Local distances between the landmark points were calculated following Delaunay triangulation. Some distances, which they call holistic distances, between two adjacent regions were also calculated. The main contribution of our work in the bag-of-words setting is to propose an adaptive learning technique for the key-words. In other bag-of-words based approaches [11, 19] the unique key-words remain rigid i.e., the key-words once selected, cannot adapt to the incoming video sequences. Our proposed adaptive learning technique enables the key-words to adapt to the incoming video sequences and thus, better represent the unseen videos. We compare the proposed approach with that of [35], [11] and [19] and show that the combined effect of OF and IG along with the adaptive learning process increases the expression recognition accuracy as demonstrated in the result section. The two main contributions of our approach are listed below.

  • We propose a unique combination of optical flow and image gradient as feature (motion-descriptor) to represent the local motion information of some objectively selected facial regions.

  • We propose a novel adaptive learning technique for adapting the key-words (key-motion-descriptors) to the training ambiance in the bag-of-words setting.

A brief overview of our method is given next.

2 Overview

Specific motion patterns generated by movement of facial muscles represent emotional expressions [13]. Therefore, to identify facial expressions, we need to capture the movement patterns of facial components. For this, we propose a novel motion descriptor (MD) which is inspired by the pose descriptor presented in [30]. For expression recognition we adopt bag-of-words based approach. We find the OF between two consecutive frames of a video displaying facial expression. This OF gets weighted by the magnitude of the IG. Then we find the distribution of this IG weighted OF along different angular directions for different regions of the face. We call this distribution motion descriptor (MD). Different values of these MDs represent different motion patterns of various facial regions in the temporal dimension locally. The entire video, displaying a particular expression, can be thought of as consisting of a specific combination of these MDs. Therefore, these MDs can be thought of as words and the videos as documents in the bag-of-words setting. As similar emotional expressions may follow similar motion patterns, we have redundancy in the current bagful of words. So, we take the help of kd-tree based data condensation technique [31]. After data condensation from each expression class we find some unique MDs. We call them key-MDs. The set of all the key-MDs from all the training videos displaying all the basic expressions constitute the initial wordbook. The entire process of finding the initial wordbook from the training video sequences displaying the expression happiness is schematically depicted in Fig. 1.

Fig. 1
figure 1

Schematic diagram depicting the process of wordbook preparation. For ease of visualization, a frame transition in video is represented by the last frame of that transition. The blue wide arrows relate the frame transitions to the corresponding MDs. The narrow straight arrows represent the branches of the kd-tree. The root node (of the kd-tree) consisting of a number of MDs is divided into two child nodes. In this depiction, the right child is not further divided as the variance of the MDs in that node is less than a predefined variance. The left child is further divided. The red wide arrows show which MDs from each leaf node of the kd-tree are chosen as key-MDs to form the wordbook

For recognition, MDs from unseen video are to be mapped to key-MDs. For better mapping, we introduce adaptive learning that iteratively modifies the key-MDs such that a key-MD represents a wider range of MDs. This is similar to stemming where affixes or prefixes to words are removed. The pool of (for all the expressions) the modified key-MDs (stemmed words) constitute final dictionary. We construct a histogram representing frequency of each of these unique, stemmed key-MDs in each of the video sequences. We call these histograms expression descriptors (EDs) and use them as features for classifying each of the test video sequences into one of the six basic expressions. With the help of different classifiers such as support vector machine (SVM) and neural networks, we compare the proposed method with state-of-the-art works. We get better results with improved time and space complexity using the proposed method.

The rest of the paper is organized as follows: Section 3 gives a detailed description of the formation process of the MDs followed by the formation of wordbooks for each expression. Section 4 describes how EDs are extracted for each sequence. This section also discusses how the key-MDs in master wordbook are adapted (i.e., stemmed) to its environment using the proposed adaptive learning technique and form the final dictionary. Section 5 presents the experimental results. The details of the data set used in the experiments and comparison of the proposed method with the state-of-the-art approaches are also reported in Section 5. The paper concludes in Section 6.

3 Formation of MD and wordbook

Since motion patterns of specific facial regions are representatives of facial expressions, we wish to capture these motion patterns. The facial regions that participate in producing facial expressions, include eyes, eyebrows, nose, lips, nasolabial furrow, wrinkles etc. The edges of these regions are expected to have high gradient value in the image. We find the optical flow (OF) between every two consecutive frames. At each pixel location, the OF between the two consecutive frames gets weighted by the image gradient (IG) in the previous frame. In this way, we get motion information mainly from the facial regions we are interested in. If we use only optical flow information, that could lead to unnecessary, even derogatory feature as OF gives better approximation of movement of pixels with higher gradient magnitude. Next, we measure the distribution of these weighted optical flow over different angular directions to get motion descriptor (MD).

3.1 Preparation of MD

Let \(\overrightarrow {O}\) represent the OF field corresponding to the frame transition from frame F k to F k+1 and let \(\overrightarrow {G}\) represent the IG field of the frame F k . We weight \(\overrightarrow {O}\) by the magnitude of \(\overrightarrow {G}\) at corresponding pixel location to get \(\overrightarrow {V}\). Let, \(\overrightarrow {\textbf {v}}(x,y)\), \(\overrightarrow {\textbf {g}}(x,y)\) and \(\overrightarrow {\textbf {o}}(x,y)\) represent the elements (vectors) of \(\overrightarrow {V}\), \(\overrightarrow {G}\) and \(\overrightarrow {O}\) respectively at (x,y)th pixel location. Therefore, \(\overrightarrow {\textbf {v}}(x,y)=\parallel \overrightarrow {\textbf {g}}(x,y)\parallel *\overrightarrow {\textbf {o}}(x,y)\) where, ∥z∥ represents the magnitude (second norm) of vector z and ∗ represents multiplication. The advantage of calculating \(\overrightarrow {V}\) in this way is explained in Fig. 2. Figure 2a shows optical flow field in the window encompassing the lips and the nasolabial furrow regions for expression disgust. Notice the presence of OF on the upper lip, the nasolabial furrow and the cheeks regions. Movement of these regions produce the expression disgust. Figure 2b shows the image gradient (IG) within the same window. Notice the IG fields on the lip edges, nasolabial furrow and around the nostrils. In Fig. 2c, the OF field is multiplied by the magnitude of the IG. Therefore, only the regions (upper lip edge and nasolabial furrow) with significant magnitudes of IG and OF, are captured. The OF and IG fields of other regions of face are diminished due to low magnitude of the gradient and optical flow respectively. Thus, gradient magnitude weighted optical flow captures the motion of only the facial regions where approximation of movement using OF is better and the movements are relevant to facial expressions.

Fig. 2
figure 2

a Optical flow field, b gradient field and c optical flow field multiplied by the magnitude of the gradient imposed on the image window displaying lips and nasolabial furrow region displaying expression disgust

Next, to get the motion pattern, we find the distribution of the weighted OF field in different angular directions quantized into L number of bins of a histogram. Let the facial image be represented by I. At any pixel location (x,y) of I, \(\overrightarrow {V}\) has two components v x and v y . Let \(\theta (x,y) = tan^{-1}\frac {V_{y}}{V_{x}}\), m(x,y) = v(x,y) and H={h(1),h(2),...h(L)}. We find the histogram H using the following equation.

$$ h(i) =\sum\limits_{x,y}\left\{\begin{array}{l} m(x,y), ~\text{where,} i=\lfloor(\theta(x,y)*L)/360\rfloor+1,\\ 0, ~\text{otherwise.} \end{array}\right. $$
(1)

We divide the entire frame into l equal number of blocks to capture the local information. For each of the l blocks, the histogram of weighted optical flow is constructed as described in (1). So, we get a total of L×l bins for each frame transition. Subtle movement in some regions (e.g., around eyes) may be as important as large movement in some other region (e.g., around lips). Therefore, the histogram corresponding to each block of a given frame transition is normalized separately. We concatenate the histogram bins corresponding to all the blocks in a given frame transition to prepare the motion descriptor (MD) of that frame transition.

Further, we notice that MD may not be able to distinguish between two different expressions if the size of each block is not chosen correctly. One example is shown in Fig. 3. Figure 3a and d show two completely different expressions surprise and anger. Figure 3b and e show the \(\overrightarrow {V}\) field of the block encompassing the lip region of the two frames in Fig. 3a and d respectively. The direction of \(\overrightarrow {V}\) at any given pixel location in Fig. 3b and e are different. The \(\overrightarrow {V}\) with a direction of 55 on the right part of the upper lip is responsible for opening up the lips, whereas \(\overrightarrow {V}\) with a direction of 55 on the right part of the lower lip is responsible for closing the lips. Similarly,\(\overrightarrow {V}\) with a direction of 115 on the left part of the upper lip is responsible for opening up the lips, whereas \(\overrightarrow {V}\) with a direction of 115 on the left part of the lower lip is responsible for closing the lips. Therefore, both the blocks in Fig. 3b and e encompassing both the upper and lower lips, contribute to similar histograms as shown in Fig. 3c and f respectively. To properly represent the two different expressions (here surprise and anger), the two histograms should have been significantly different. To overcome this problem, we have to choose proper block size, that is, number of blocks per frame should be such that any one facial component such as eyes, lips, etc., does not fall entirely in one single block. We have experimented for different numbers of blocks and above mentioned condition is satisfied for 64 equal sized blocks taking into account the inter person variation of facial structure. This architecture of generation of MD considering 8×8=64 facial blocks is depicted in Fig. 4.

Fig. 3
figure 3

a Expression surprise, b formation of the histogram for the 8th block (in row major order) of a, c histogram of b. d Expression anger, e formation of the histogram for 8th block (in row major order) of d, f histogram of e

Fig. 4
figure 4

a: Partitioning a frame into 64 blocks, b: corresponding histograms of weighted optical flow for each of the 64 blocks, c: corresponding MD

As we find a separate histogram for each local region (block), the resultant MD captures the motion information of spatially local regions of the face. Also, since, the MD captures the motion pattern between two consecutive frames, it (MD) represents temporally local motion information of the video. A video displaying a particular expression, can be thought of as consisting of a specific combination of these MDs. Therefore the MDs can be compared to words and the videos to documents in the bag-of-words model. So after extracting the words (MDs) from the documents (videos), our next objective is to find the unique key-words (key-MDs) and make a wordbook. This step is described next.

3.2 Preparation of wordbook

It is expected that similar expressions will have similar patterns of motion of facial muscles. As MD captures these motion patterns, MDs constructed from similar expressions are expected to be similar. Our aim now is to pick up only the unique MDs which are representatives of similar types of local motion patterns. So we take the help of maxdiff kd-tree based data condensation technique [31]. Each of the MDs is a data point in (L×l) dimensional space. Following the maxdiff kd-tree technique, we place all the MDs extracted from all the videos displaying one type of basic expression, at the root node of maxdiff kd-tree. The dimension for which the distance between two consecutive data points is maximum is chosen as the pivot dimension. The mid point of the two consecutive data points on the pivot dimension is chosen as the pivot value. All the data points for which the value of pivot dimension is less than the pivot value, become the left child and rest data points get placed in the right child. We continue this process for each of the nodes of the kd-tree until the variance of the MDs in the leaf node becomes less than a predefined value ε. Each leaf node of the kd-tree consists of group of similar MDs. We need to find the unique representatives of these groups. So as representative of a group, we choose the MD nearest from the maximum number of MDs in that group. If there is a tie, we keep all the MDs in tie. Next, we mathematically present the process of selection of this representative MD.

Let a leaf node have ρ number of MDs and each MD is represented by P. We define importance (of the MDs in representing the leaf node) count β(i),i=1,...,ρ, as follows:

$$\begin{array}{@{}rcl@{}} \beta(i)=\sum\limits_{j=1}^{\rho}\left\{\begin{array}{l} 1\text{, if }i=\arg\min_{k} d(\textit{P}_{j},\textit{P}_{k}),\\ \qquad\quad~~\forall k \neq j, k=1,...,\rho\text{,} \\ 0\text{, otherwise.} \end{array}\right. \end{array} $$
(2)

Any distance metric that can represent the distance between two frequency distributions, can be used as d(.,.) in (2). Euclidean distance does not take into account the correlations between the dimensions of data. Bhattacharyya distance not obeying ‘triangle inequality’ is not a metric. We choose to use Hellinger distance. Hellinger distance H(P,Q) between two frequency distributions P and Q over the domain Γ is defined as follows.

$$ H(P,Q)=\sqrt{1-\text{BC}(P,Q)}, \text{where,} \text{BC}(P,Q)=\sum\limits_{z\in {\Gamma}}\sqrt{P(z)Q(z)}. $$
(3)

The jth MD is chosen as the representative of the group where \(j=\arg \max _{i}(\beta (i),i=1,...,\rho )\). We call these representative MDs, key-MDs. We construct one kd-tree per basic expression. Let us denote a basic expression by E i . Since, in this paper, we are dealing with six basic expressions, i=1,2...,6. We define C i as the wordbook containing all the MDs that are chosen as representatives of each of the leaf nodes of the maxdiff kd-tree corresponding to expression type E i . Thus, we have C 1,C 2,...,C 6, a total of 6 wordbooks corresponding to the six basic expressions. The process of wordbook preparation, starting from deriving MDs from each frame transition, is illustrated in Fig. 1. We concatenate the key-MDs from the wordbooks C i , i=1,...,6 into one master wordbook \(C^{\prime }\) with \(N^{\prime }\) number of key-MDs. Therefore, \(N^{\prime }={\sum }_{i=1}^{6} {| C_{i} |}\) where, |.| represents the cardinality of the set. In the next section, we derive the expression descriptor that represents the distribution of key-MDs in video.

4 Deriving expression descriptors

In this section we derive the expression descriptor (ED) that gives the frequency distribution of the key-MDs in the video sequence. As the number of key-MDs in the master wordbook is \(N^{\prime }\), ED has \(N^{\prime }\) number of bins. Straightforward (discrete) calculation of this expression descriptor (ED D ) is given next. Let (f−1) be the number of MDs in a given sequence. Let the kth MD in a video sequence be represented by Q k and the jth key-MD in \(C^{\prime }\) be represented by P j . The ith element of ED D is calculated as follows where \(i=1,2, ..., N^{\prime }\).

$$ \textit{ED}_{D}(i)=\frac{1}{f-1}\sum\limits_{k=1}^{f-1}\left\{\begin{array}{l} 1 ~\text{, if } i=\arg\min_{j}d(Q_{k},P_{j}),\\ \qquad\qquad\qquad\forall j=1,...,N^{\prime}\text{,}\\ 0 ~\text{, otherwise,} \end{array}\right. $$
(4)

where d(Q k ,P j ) denotes a suitable distance metric (Hellinger distance in our implementation). The procedure for expression descriptor preparation after the generation of master wordbook \(C^{\prime }\) is illustrated in Fig. 5. Discretely matching an MD to the nearest key-MD as done in (4) has one drawback. Some of the MDs in a given video sequence may not correspond to any of the key-MDs in \(C^{\prime }\). Such ambiguous MDs add significant confusion to the ED calculated using (4). To overcome this problem we can use either the plausibility model (ED P ) [46] or the kernel model (ED K ) [46] to generate ED. We calculate the ith element of these two models of EDs using the following equations where \(i=1,2, ...,N^{\prime }\).

$$\begin{array}{@{}rcl@{}} \textit{ED}_{P}(i)=\frac{1}{f-1}\sum\limits_{k=1}^{f-1}\left\{\begin{array}{l} K_{\sigma}(d(Q_{k},P_{j})) \text{, where }\\ i=\arg\min_{j} d(Q_{k},P_{j}),\\ \qquad\qquad\forall j=1,...,N^{\prime}\\ 0\text{, otherwise,} \end{array}\right. \end{array} $$
(5)
$$\begin{array}{@{}rcl@{}} \textit{ED}_{K}(i)=\frac{1}{f-1}\sum\limits_{k=1}^{f-1} K_{\sigma}(d(Q_{k},P_{i})). \end{array} $$
(6)
Fig. 5
figure 5

The schematic diagram of expression descriptor preparation. For ease of visualization, in the master wordbook, each key-MD is indicated by the last frame of the frame transition that the key-MD represents. The arrows indicate which frame transition (of the test video sequence) is mapped to which key-MD.

In the above two equations K σ (z) = e x p(−z 2/σ 2) where σ is suitably selected standard deviation value. In these models the MD that is less distant or is more similar to any key-MD contributes more to the ED. Similarly, the MD that is less similar to any of the key-MDs can influence the ED less.

Further, we analyze the pattern of the data in ED feature space. An example of ED K for 264 video sequences, displayed one after another, is shown in Fig. 6a. The pattern of isolated dots like structure in Fig. 6a indicates the following. Some MDs of certain video sequences found almost exact match to some of the key-MDs using (6), resulting in high values (represented by bright magenta in Fig. 6a) of those key-MDs in the corresponding ED K . Rest of the MDs of the video sequences could not find any match with the key-MDs, thus, resulting in isolated dots structure of the ED space in Fig. 6a. As many of the MDs of the video sequences could not find good match in the key-MDs, we can say that the key-MDs have not generalized well.

Fig. 6
figure 6

Heat-maps of the ED K feature space a before and b after 30 iterations of adaptation of the key-MDs following Algorithm 1 for the data from the CK + database. The numbers at the bottom of the heat-map represent the indices of the data samples. Each data sample represents one video sequence. Data samples with indices 1 to 40, 41 to 95, 96 to 118, 119 to 175, 176 to 200 and 201 to 264 have the expression labels anger, disgust, fear, happiness, sadness and surprise respectively. The numbers at the left of the heat-map represent the indices of the key-MDs in the master wordbook \(C^{\prime }\). The key-MDs with indices 1 to 14, 15 to 27, 28 to 39, 40 to 53, 54 to 65 and 66 to 80 come from the wordbooks corresponding to each of the six expressions respectively. The color at any location, say (x,y), represents the value of the key-MD number x for the ED K representation of the datum number y. The colormaps are indicated by the adjoining colorbars to the right of the heat-maps

The ground truth provided with Cohn-Kanade facial expression data set [22] lists different action unit (AU) combinations leading to the six basic emotional expressions. For describing facial expressions with minute details, Ekman and Friesen [15] had developed facial action coding system (FACS), where 44 AUs are defined. A careful study reveals that two or more basic expressions share some similar motion patterns. For example, all possible AU combinations for anger and most of the AU combinations for fear contain both AU4 (eyebrows drawn medially and down) and AU5 (eyes widened). Some of the AU combinations for sadness also contain AU4. Similarly, some forms of anger and disgust also share some common AUs (e.g., AU10 which is ‘upper lip raised’ and AU17 which is ‘chin raised’). So these emotional expressions share similar characteristics. The smooth cyan (representing low values) regions in the non-diagonal area of Fig. 6a do not represent this phenomenon. Otherwise, the non-diagonal regions as well would have some high values due to the fact that different expressions share some common local motion patterns and therefore share some common key-MDs. Moreover, the smooth pattern of the non-diagonal regions of Fig. 6a indicates that the key-MDs chosen from a particular wordbook representing expression say, E i , are very rigid and represent the exact motion pattern corresponding to that expression only. Thus, the test MD representing a motion pattern similar to some key-MD but not an exact match, does not find a good representative and does not contribute much to the formation of ED. To deal with this problem, we let the key-MDs in the master wordbook \(C^{\prime }\) adapt to their environment i.e. to generalize to similar MDs. This concept is illustrated next.

4.1 Adaptive learning of key-MDs

For illustration purpose let us assume that the MDs are two-dimensional. Let four MDs (A, B, C and D) have key-MD i 1 as their nearest key-MD. The key-MD i 1 is nearest to MD A as shown in Fig. 7a. So ED K (i 1) and ED P (i 1) get highest values for MD A compared to MDs B, C and D. That is, MDs B, C and D do not get good representation in the master wordbook. Therefore, we should allow key-MD i 1 to move towards each of the MDs A, B, C and D by a fraction of the corresponding distances from key-MD i 1 until a predefined stable state is achieved. A state can be called stable when the key-MD does not change its position more than a predefined unit in successive iterations. After stabilization, the position of the key-MD would be somewhere like the one in Fig. 7b. A consequence is that none of the MDs A, B, C and D get high ED K (i 1) and ED P (i 1) values but key-MD i 1 act as a good representative of all the four MDs. The algorithm for adapting the key-MDs of the master wordbook is given in Algorithm 1. The result of applying the Algorithm 1 is that the isolated dots with high values on the diagonal of Fig. 6a get smoothened as a large number of MDs get representatives in the master wordbook. The ED K feature space after adaptation is shown in Fig. 6b. It should be noted that in addition to the high values (represented by bright magenta) along the diagonal of Fig. 6b, some non-diagonal regions also have got high values. These non-diagonal magenta regions indicate that two or more expressions share some common local (in spatio-temporal dimension) motion patterns. For example, anger & disgust and anger & sadness share some local (in spatio-temporal dimension) motion patterns. Therefore some MDs in a video sequence displaying say, anger may choose a key-MD corresponding to disgust and vice-versa in formation of ED. This phenomenon is rightly captured by the adapted master wordbook and is depicted in Fig. 6b. On the other hand the MD representing the local motion patterns of surprise are almost unique and do not share characteristics with other expressions. This is represented by relatively brighter cyan rows between indices 66 to 80 in Fig. 6b. Thus, intuitively we can say that the adaptive learning of the key-MDs enables the EDs to better represent the expressions as compared to the non-adapted ones. The convergence of the adaptive learning Algorithm 1 is discussed in Appendix A. Before that, we present the experimental results showing that the adaptive learning of the key-MDs increases the facial expression recognition accuracy.

Fig. 7
figure 7

Schematic diagram representing a: Positions of MDs A, B, C, D and key-MD i 1 before adaptation of key-MDs and b: the positions of the same MDs and key-MD after adaptation of the key-MD

figure c

5 Results and discussions

To test the efficiency of the proposed method we have used three databases: extended-Cohn-Kanade (CK + ) [22, 27], MUG [4] and MMI [33, 45]. From these databases we select all the videos displaying the six basic expressions. One video displays one expression. The CK + database videos start with near-neutral face and the intensity of expression increases gradually to the highest intensity in the last frame. In the MUG and MMI databases, the videos start with near-neutral expression. The intensity of the expressions increases gradually to the highest intensity and then gradually the expression intensity decreases to near-neutral expression. We manually select the frame with highest intensity expression in the videos from MUG and MMI databases. The frames from the first to the highest intensity expression are selected for our experiments.

The positions and size of faces in the image frames are not same. We take the help of Viola-Jones face detection algorithm [47] to find the bounding box of the face in the image. We find the bounding boxes of the faces in all the frames of a video sequence and take the union of the regions within all the bounding boxes for the video sequence. This bounding box contains the whole face in all the frames in that sequence. Thus, all the cropped images containing the face are of the same size for a video sequence. Some of the video sequences where the faces are not detected properly, get discarded. After applying Viola-Jones algorithm, 264, 79 and 54 number of sequences, each displaying one of the basic expressions, get selected from CK + , MUG and MMI databases respectively. Henceforth, by frame we refer to the corresponding cropped image. The proposed ED feature is not dependent on the size/scale of the face and is extracted after dividing each face into a fixed number of blocks. Thus, no other preprocessing like alignment of eyes (as in [23, 55]) and normalization of the face to a fixed size is done. There are variations in illumination condition and skin-tone among the sequences even within the same database (Fig. 8). Since, a unique combination of OF and IG is used for extracting the ED feature, the proposed method is robust to variations in illumination condition and skin-tone. Slight rotation of the face is also there in some sequences (Fig. 9). Quantization of the IG weighted OF orientations into judiciously chosen L number of bins for extracting MD, allows for slight rotation of the head. We experiment with different values of L and get the best results for L=8. If we quantize the orientations into more refined bins (L>8), this will not allow for face rotation. Quantization into coarser bins (L<8) may not be able to differentiate between motion of facial components into different directions. Therefore, we choose L=8. In our approach, the faces need not be marked with fiducials (as in [1]). The proposed method is also person invariant as we use OF based features and discard the appearance that contains the person’s identity. That means the proposed system need not be trained with the faces of the subject in the test case. The classification results are given next.

Fig. 8
figure 8

Examples of skin color variations in CK + database

Fig. 9
figure 9

Some examples of rotation of face from MMI database. The first and the third images show positions of heads in the first frames of two video sequences. The second and the fourth images show the positions of the heads in the frames with highest intensity expression of the corresponding videos

5.1 Results

To show the contribution of the adaptation (described in Algorithm 1) of the key-MDs of the master wordbook, we test the proposed feature (expression descriptor) on CK + database, both before and after the adaptation. SVM is successfully used as classifier in other works on facial expression recognition [23, 35, 38, 55]. Therefore, for classification, we use SVM with polynomial and RBF kernels. We get the best result for SVM with linear kernel. SVM is a two class classifier and we need to recognize a total of 6 expressions. So we train a total of 6 C 2=15 classifiers, each for classifying any two of the six expressions. A voting system is used for the final classification. This means that each test datum (video sequence) goes through all the classifiers and the class with maximum number of votes, is designated as the observed class for the test datum. If the observed class is the same as the actual class of the test datum, then that test datum is said to be classified accurately. We report a 10-fold cross-validation result. The accuracy is calculated as the percentage of the correctly classified data averaged over all the 10 folds. Performance of the different models (4), (5) and (6) of the ED (for SVM with linear kernel and 10-fold cross-validation) before and after adaptation of the master wordbook, is listed in Table 1. Overall the superiority of the kernel ED model is clear from Table 1. The facial expression recognition accuracy has also increased due to the adaptation of the key-MDs for all the three models of ED. This increase in recognition accuracy shows the success of the proposed adaptation process. Henceforth in our experiments, we use the kernel-ED model after adaptation of key-MDs.

Table 1 Performance of different ED models before and after the adaptation of the master wordbook. Highest accuracies are made bold

We compare the performance of the proposed adapted kernel-ED model with two state-of-the-art features widely used in the literature for facial expression recognition in videos. These features are linear binary pattern from three orthogonal planes (LBP-TOP) and local phase quantizer from three orthogonal planes (LPQ-TOP). As classifier, we use SVM with linear and RBF kernels. How the performance of the SVM classifiers changes with the change in parameters of the classifiers for 10-fold cross-validation for the proposed feature (ED K ), LBP-TOP and LPQ-TOP, is shown in Fig. 10.

Fig. 10
figure 10

Change of classification percentage with the increasing order of a polynomial kernel used in SVM classifier and with the increasing value of b standard deviation of RBF kernel used in SVM classifier. Database used is CK +

The classification percentage here means the percentage of correct classification of the data (all the six expressions) in the validation set averaged over all the 10 folds. We get the best result for the proposed feature for SVM with polynomial kernel when the order of the polynomial is 1 (classification accuracy being 94.2 %) and for SVM with RBF kernel when the RBF (Gaussian) standard deviation (σ) parameter is 12.6 (classification accuracy being 79.07 %). For LBP-TOP and LPQ-TOP the best results are obtained for SVM with linear kernel, the classification accuracy being 93.11 % and 94.36 % respectively. For SVM with RBF kernel, the classification accuracy could not cross 24.70 % for LBP-TOP and 16.67 % for LPQ-TOP. From the graphs of Fig. 10, it can be seen that all the features under consideration i.e., ED, LBP-TOP, LPQ-TOP give best result for SVM with linear kernel. Therefore, for the next set of experiments on all the three features, for SVM classifier, we use linear kernel. For RBF kernel of SVM, the performances of both LBP-TOP and LPQ-TOP are below 30 % whereas the performance of ED is consistently above 75 %. Each dimension in our expression descriptor represents one key-MD. For six types of expressions, there are six groups of key-MDs as separate sets of dimensions. Data belonging to a particular expression class have contributed higher frequency count for the corresponding set of key-MDs (dimensions) and lower frequency count for the other key-MDs. This results in better data separability using linear kernel of SVM.

Since, the expression recognition here is a multiclass problem, we have also subjected the adapted kernel-ED model of the proposed feature, LBP-TOP and LPQ-TOP to six class RBFNN (radial basis function neural network). For a fair comparison of the proposed ED feature with that of LBP-TOP and LPQ-TOP, we present the set of bar-graphs in Fig. 11. These bar-graphs compare the classification accuracy as produced by the three types of features under consideration for the two classifiers, SVM with linear kernel and six-class-RBFNN. From this figure it can be said that the performances of the proposed ED K , LBP-TOP and LPQ-TOP are comparable for CK + and MUG databases when classifier used is SVM with linear kernel. The performance of LBP-TOP decreases considerably for MMI database. LPQ-TOP gives slightly better result for CK + database and considerably better result for MMI database as compared to LBP-TOP. Whereas the proposed ED K feature gives better result for all the three databases and the two classifiers as seen by the last column of each bar-graph. Even for the MMI database where both LBP-TOP and LPQ-TOP do not give average accuracy more than 60 %, the proposed ED K gives average accuracy greater than 72 % for linear SVM classifier. For RBFNN classifier the recognition accuracy for both LBP-TOP and LPQ-TOP decreases for all the three databases. But the proposed ED K feature maintains high recognition accuracy even with RBFNN classifier. Therefore, the proposed ED K feature can be considered a better alternative for facial emotional expression classification.

Fig. 11
figure 11

Performance comparison of the proposed adapted ED K feature with LBP-TOP and LPQ-TOP. Columns 1, 2 and 3 compare the performances of the three features on CK + , MUG and MMI databases respectively. Row 1 and 2 compare the performances for two different classifiers, SVM with linear kernel and six class RBFNN respectively

Table 2 compares the classification accuracy of the proposed adapted ED K with that of the other related state-of-the-art methods. The experiments are done on CK + database. The first set of seven methods recognize expressions from images displaying expressions in apex state. The next set of 10 methods including the proposed ED K in Table 2 recognize basic emotional expressions from videos. A varied combination of features and training models are used in these methods. Some use shape based features [19, 25, 34] and some use appearance based features [7, 11, 37, 38, 50]. Some of the methods formulate the expression recognition problem in bag-of-words framework [11, 19], while some represent the expression features in low dimensional manifold [37, 50]. For boosting the recognition performance, after extracting expression features, some methods [24, 53] even select the important features following some well-known feature selection methods.

Table 2 Performance comparison of the proposed approach with state-of-the-art methods that recognize basic emotional expressions from dynamic videos or static images from CK/CK + dataset

Table 2 shows that the proposed adapted ED K stands firmly in competition with the other methods and gives the highest accuracy on the average over all the six emotions. The proposed ED emerges as clear winners for anger and surprise. For disgust, happiness and sadness also our ED gives good recognition accuracy (94.6 %, 98.2 % and 92.3 % respectively). If we look carefully at the results listed, it shows that for the other methods, the classification performances for anger and fear, compared to other expressions, have decreased. Our approach could maintain the good performance for these two expressions as well.

The method by [35] and the proposed method both utilize OF based features. Therefore, we further analyze and compare the results of [35] with that of ours. Dense flow point tracking (DFT) is seen to give a little better expression recognition accuracy as compared to feature point tracking (FPT). DFT features contain motion information of relatively large number of facial points as compared to FPT, therefore, DFT gives better result. But instead of tracking a set of hand-crafted landmark points, our method diminishes the OF values at unnecessary locations on the face and retains only the motion information at the objectively selected important locations by multiplying OF by the magnitude of the IG. This unique combination of OF and IG thus better represents the facial component motion due to expression. Moreover, our method represents the whole video sequence as a specific combination of local (in spatio-temporal scale) motion patterns in the ED. But the method by [35] considers the global motion information and thus local motion information in the temporal scale is lost. Therefore, for better temporal representation, our method gives better expression recognition performance as seen in Table 2.

While wordbook generation in other approaches use clustering [19], our approach takes the help of maxdiff kd-tree, a better way of condensing the wordbook. Clustering technique such as k-means clustering suffers from initialization issues. Moreover, one needs to identify the true number of clusters in advance in k-means clustering. In maxdiff kd-tree based approach, we do not need to face these issues. In addition we are free to select multiple prototypes from the same cluster, especially in case of clusters with larger variances. Other approaches such as [11] and [19], used wordbook vectors but these wordbooks are rigid, i.e., the key-words cannot generalize well to the unseen data. In contrast, through the approach proposed in this paper, the wordbook can always learn i.e., adapt itself to the incoming video sequences. The adapted key-words better represent the unseen words in the test sequence. Thus, the proposed approach results in better expression recognition accuracy as compared to the other wordbook based approaches.

Consistency of expression descriptor performance

The ED K is also subjected to a permutation-based p-value test for 1000 random arrangements of the data indices in order to judge the consistency of the feature’s performance as suggested in [32]. We have tested on CK+ dataset. The test revealed a significantly low p-value of 0.000994 suggesting that the feature used is significant for a significance level of 99 % (i.e. an α value of 0.01) and that the reported accuracy cannot be attributed to some random pattern which the classifier may have identified in the training data.

Time and space complexity

We have analyzed and compared the time complexity of the proposed feature with that of LBP-TOP [55]. In the method proposed in this paper, each frame transition is considered exactly once for calculating MD corresponding to that frame transition. Let the OF calculation time (in seconds) for each pixel position be a. OF field calculation for a video sequence having (f−1) frame transitions and r×c pixels per frame, requires (f−1)×r×c×a seconds. Let the time required for finding appropriate bin in (1) for each pixel be b seconds. So the total time required for deriving all MDs in a video sequence is (f−1)×r×c×(a + b). The proposed method has one more step, deriving expression descriptors. The number of key-MD in \(C^{\prime }\) is \(N^{\prime }\). In (6), if each of k MDs, k=1,..,f−1 requires c seconds, then ED K preparation time is \((f-1)\times N^{\prime }\times c\) seconds for each video sequence. Typically \(N^{\prime }\) is very small. The kd-tree condensation technique helps to keep \(N^{\prime }\) value low. In our implementation \(N^{\prime }=80\). So the total feature extraction time for the test data is (f−1)×r×c×(a + b)+(f−1)×80×c. In [55], LBP is calculated on three orthogonal planes for each pixel. Let us denote the time (in seconds) taken to derive the LBP for each pixel by \(a^{\prime }\). Then the time taken for calculating LBP-TOP is \(3\times a^{\prime }\) seconds per pixel. Therefore, the time taken to derive LBP-TOP for each sequence with f frames of size r×c is \(3\times f\times r\times c\times a^{\prime }\) seconds. Let the time taken for finding the appropriate bin for each LBP in each plane be \(b^{\prime }\). So the total time required to find the histogram in [55] is \(3\times f\times r\times c\times b^{\prime }\). So the total time taken in extracting the features according to the method proposed in [55] is \(3\times f\times r\times c\times (a^{\prime }+b^{\prime })\). We assume that (a + b) and \((a^{\prime }+b^{\prime })\) are comparable and (f−1)×80×c is quite small as compared to (f−1)×r×c×(a + b). Therefore, the time required in the methods used in [55] is more than 3 times the method proposed in this paper. On a computer with 3.07 GHz Intel Xeon processor, 24 GB RAM and 64-bit operating system, the kernel-ED feature extraction time for a test video sequence is 0.9 seconds. On the same machine for the same test video, the LBP-TOP feature extraction time is 8.7 seconds. The prototypes are implemented in MATLAB R2008a.

In our implementation the proposed ED K model has only 80 dimensions compared to 9×8×3×(58+1)=12744 and 4×2×3×256=6144 dimensions of LBP-TOP and LPQ-TOP respectively considering 9×8 blocks for LBP-TOP and 4×2 blocks for LPQ-TOP as in [55] and [21] respectively. So, the proposed model results in considerable saving in terms of time and space during classification. For an SVM classifier with linear kernel, training time of ED K based method is 2.42 seconds while training time of LBP-TOP based method is 4.37 seconds. After the features of the test data are extracted and the training phase of the classifier (SVM with linear kernel) is over, classification of one test video-sequence for ED K based method requires 0.06 seconds on the average, whereas LBP-TOP based method requires 0.13 seconds. Therefore, we can say that the proposed method offers for considerable saving in time for both training and test.

5.2 Discussions

Observation of Fig. 11 shows that the proposed adapted ED K feature produces good recognition accuracy for CK + database. The accuracy decreases for MUG and MMI databases. We find that the recognition accuracy for MUG database has decreased relatively for the expressions fear, sadness and surprise when the classifier used is SVM with linear kernel. For RBFNN the ED K produces 100 % recognition result for all the expressions in the MUG database except for anger and surprise. We checked with the original expressive images from the MUG database for possible reasons of misclassification. We found that the existence of wrinkles between the eyebrows in both the expression anger (Fig. 12a) and sadness (b) may be a contributing factor for confusion. Similarly, existence of eye opening, flashing prominent white sclera above iris in both fear (Fig. 12c) and surprise (d) may be other contributing factor for confusion between the two emotional expressions. It is difficult even for human observer to distinguish between the anger (Fig. 12a) and sadness (b) examples. Figure 13 gives some misclassified examples form MMI database. The motion of eyebrows, lips and chin may become contributing factor for misclassification of the fear (Fig. 13c) and sadness (d) examples.

Fig. 12
figure 12

Examples showing different factors contributing to confusion in expression perception from MUG database. a and b vertical wrinkles between eyebrows in expression anger and sadness respectively. c and d raising of upper eyelid displaying white sclera above iris in expressions fear and surprise respectively. We display only a part of the face instead of the whole face to retain the privacy of the subjects

Fig. 13
figure 13

Misclassified examples from MMI database. First image: sadness, second image: disgust, third image: fear and fourth image: sadness

The proposed method has some advantages over existing methods in the literature. We do not need to perform complex operations like, tracking eyes in the subsequent frames [43], segmenting lips or eyes [5], correcting illumination [52], normalizing the face to a constant size [6, 26, 36] etc. For calculating motion of the facial features, we have weighted the optical flow field by the magnitude of the gradient at that pixel location. We generally get high gradient at the edges of facial components such as eyes, lips etc. Therefore, optical flow (OF) at pixels with high gray level gradient (at edges of facial components) contribute mostly in the calculation of expression descriptors. Moreover as we normalize the histogram for calculation of MD, the relative values of OF of different pixels matter rather than the absolute value of OF. Therefore, optical flow scheme yields good results in the proposed method. We divide each frame into fixed number of blocks irrespective of the size of the frame. Therefore, we do not need to resize the frames to any fixed template. Dynamic texture descriptors such as LBP-TOP and LPQ-TOP give good expression recognition accuracy. But due to high dimensionality (12744 and 6144 dimensions respectively) these feature descriptors are time and space consuming. Compared to these features, the proposed ED K feature is low dimensional (80 dimensional in our implementation) thus incurs low time and space complexity.

We believe that the efficiency of a set of features depends on the class representation capability rather than the number of dimensions of the feature set. Many of the dimensions (alternatively called features) of high-dimensional feature set, may be correlated to each other. Presence of only one of the many correlated features (dimensions) is enough for efficient class representation. Further, many of the dimensions of the high-dimensional feature set may be counter-productive [10]. We believe that our approach of selecting unique features (motion descriptors) from a group of features using kd-tree based data condensation technique described in Section 3.2 helps in keeping only a small number of similar features. We also believe that the process of ‘adapting features to training set’ presented in Section 4.1 contributes in better recognition accuracy. Some works [24, 53] on recognition of facial expressions have selected only the features important for expression recognition from an initial set of features. The selected set of features not just improves the space and time complexity but also improves the recognition accuracy.

While extracting LBP-TOP or LPQ-TOP features, micro patterns only in three orthogonal directions are considered whereas, in out approach movement of every pixel in all the directions are considered. Thus, our approach better captures the spatio-temporal local movement patterns of face. LBP-TOP quantizes all the possible micro-patterns in 58 uniform types and considers all the non-uniform patterns as one more type. This results in 59 (×3 for TOP) dimensions of LBP per block. So, for 9×8 blocks we have a total of 9×8×59×3=12744 dimension for LBP-TOP and similarly 6144 dimensions for LPQ-TOP to represent one video sequence. While both LBP-TOP and LPQ-TOP represent the histogram of micro-patterns created due to expression per block of the video volume, our approach towards treating the video is different. In the current paper, we represent the face video as a combination of pre-learned spatio-temporal local motion patterns named as key-MDs. These local motion patterns can be thought of as expression units that combine with one-another to produce the emotional expression. Each key-MD form a dimension in the final expression-descriptor (ED). So, each dimension of our proposed ED descriptor is particularly designed for representing motion patterns pertaining to expressions as different from LBP based descriptors where dimensions represent general micro-patterns. Therefore, the ED features are expected to better capture emotional expression specific signature. In our implementation, the number of key-MDs being 80, the dimension of the proposed feature become 80. Had we considered the motion pattern per block of the video volume (like LBP-TOP or LPQ-TOP), we had got 8×8×8=512 dimensions (We consider 8×8 blocks as mentioned in Section 3.1). Experimental results (Tables 1, and 2, and Fig. 11) show that the set of 80 dimensions produce better result as compared to both LBP-TOP and LPQ-TOP for facial emotional expression recognition.

Overall, the proposed method is easier to implement and provides a better alternative considering recognition accuracy and time and space complexity.

6 Conclusions

A novel approach for recognition of emotional facial expressions is proposed. A unique combination of optical flow and gradient magnitude is used to represent regional motion information for each frame transition. The entire model is cast in bag-of-words setting. An adaptive learning approach for the key-words is used to improve the model representing different expressions. Experimental results show that the adaptation process significantly increases the regional motion representation capability of the key-words and thus, increases the efficiency of the proposed expression recognition model. Extensive experiments on three databases show the consistent efficiency of the proposed approach. This simple, real time yet high accuracy method can be seen as an excellent alternative among the existing methods for facial expression recognition. Now we are extending this approach for recognition of spontaneous facial expression recognition.