1 Introduction

Major neurocognitive disorder (NCD), as introduced by the American Psychiatric Association (APA), known previously as dementia, is a decline in mental ability, threatening the independence of a large fraction of the elderly population. Alzheimer’s disease (AD) is the most common form of major NCD, associated with loss of short-term-memory, problems with language, disorientation and other intellectual abilities, severely affecting daily lifeFootnote 1. Worldwide, currently 35 Million people have been diagnosed with major neurocognitive disorder, which has been associated with 530 billion Euros in 2010Footnote 2), tendency increasingFootnote 3. While there is no palliative care, musical therapies have been proposed as efficient therapeutic means, acting as a powerful catalyst for precipitating memories, shown in a number of studies [22, 33, 36]. Specifically mnemotherapy can help elicit autobiographical memories by promoting positive emotional memories [2]. This and other therapies can improve the quality of life in AD patients [1, 23]. However, the assessment of such therapies requires comprehensive manual observation by experienced clinicians [10, 32]. Towards overcoming this limitation, computer vision based methods can offer objective assessment by analyzing affect and expression behaviors, directly related to the effectiveness of therapies.

While expression recognition has attracted significant research attention [19, 28, 38], facial behavior analysis from naturalistic videos, associated to illumination changes, partial occlusions, pose variation, as well as low-intensity expressions pose challenges for current existing methods. In addition, while many areas of computer vision have experienced significant advancements with deep neural networks, analysis of facial dynamics has only recently benefited from deep convolutional networks [14, 26, 34, 43].

Naturally, the accuracy of facial dynamics classification depends on features, as well as architecture used for assessment. Given the plethora of existing algorithms, exploring different types of features and architectures is necessary to devise a robust solution.

Motivated by the above, in this work we explore and compare computer vision methods, introduced in the context of action recognition in our challenging setting, namely in the context of assessing naturalistic facial dynamics. Specifically we have (a) 3D Convolutional Neural Network (C3D) [35], (b) Very Deep Two-Stream Convolutional Network (with VGG-16 and ResNet-152 [30, 42]) and (c) Improved Dense Trajectories (iDT) [39], as well as combinations and variations thereof. Given a video sequence, we firstly detect the face and proceed to extract features pertaining to the respective method. The obtained feature set is then classified in one of four facial dynamics categories, namely neutral, smiling, talking and singing. The automatic detection of named facial dynamics indicates the involvement of the patients during mnemotherapy and hence can support the assessment of a therapy session. Specifically, given that AD-patients in later stages of the disease often suffer from apathy, the (frequent) occurrence of smiling, talking and singing indicates that the therapy is effective. Experiments are conducted on a challenging medical unconstrained dataset containing 322 video sequences of 16 AD-patients including continuous pose-changes, occlusions, camera-movements and artifacts, as well as illumination changes. In addition, the dataset depicts naturalistic facial dynamics of predominantly elderly subjects, which vary in (generally less profound) intensity and occur jointly (e.g., simultaneous talking and smiling). Moreover, we observe a high level of inter- and intra person variability (e.g., expressive and apathetic AD-patients). We note that, despite that, it is imperative to work with such data, as it is representative for current (vast amount of) video-documentation of medical doctors, requiring automated analysis.

We note that we tested existing methods in expression recognition, such as smile detectorsFootnote 4, [4] on the ADP - dataset, as well as facial-landmark based expression recognition algorithms without success, since already the first incorporated step of face detection failed throughout.

2 Related Work

Existing approaches for the analysis of facial dynamics are inspired by cognitive, psychological and neuroscientific findings. The most frequent way to describe facial dynamics is based on the Facial Action Coding System (FACS) proposed by Ekman et al. [7], representing movements of facial muscles in different terms of action units (AUs). Hence, classical methods analyze sequences of images containing the neutral face and the expression apex [19]. More recent methods involve linear deterministic and probabilistic methodologies including general or special Linear Dynamical Systems (LDS), as well as various extensions of deterministic Slow Feature Analysis (SFA) [43]. In addition, HMMs [27] have been used to capture the temporal segments of facial behaviour.

More recently, learning facial features in supervised and unsupervised manner using deep neural networks has attracted considerable attention. We proceed to provide such notable work, analyzing both, images and video sequences.

Recognizing Facial Dynamics in Images. Liu et al. [18] propose a Boosted Deep Belief Network, integrating three separate training stages for expression recognition in images. Han and Meng [11] present the incremental boosting of CNN for AUs recognition. Zhao et al. [44] combine region learning, as well as multiple label learning to detect AUs.

Recognizing Facial Dynamics in Video Sequences. When analyzing facial behavior in videos, many works usually focus on spatial-temporal feature extraction. Jung et al. [14] use two neural networks separately to extract temporal appearance features, as well as temporal geometric features for expression recognition. Zafeiriou et al. [43] propose a slow-feature-auto-encoder for both supervised and semi-supervised learning of facial behavioural dynamics analysis. Hasani and Mahoor [12] combine a 3D Inception-ResNet with a Long short-term memory (LSTM) network in order to extract both, spatial and temporal features from videos. Li et al. [17] combine VGG with ROI and LSTM together towards detection of AUs. Most recently, combining Variational Autoencoder (VAEs) and Generative Adversarial Networks (GANs) has allowed for learning a powerful latent representation utilized for facial behavior analysis in audience [26].

Previous and recent computer vision work related to healthcare have focused on, among others the assessment of: cognitive health in smart home environment [5], daily activities [15], AD symptoms [25], depression [6, 45], assistive technologies [16], as well as pain [24].

The rest of the paper is organized as follows. Section 3 describes the methods we compare. Section 4 introduces our dataset, assembled for the purpose of medical patient recording. Section 5 presents experiments, validating the effectiveness of the evaluated methods in assessing facial dynamics. Finally, Sect. 5.3 discusses and Sect. 6 concludes the paper.

3 Evaluated Methods

Firstly, we utilize face detection, based on which we crop the faces and proceed to extract facial features using Improved Dense Trajectories, as well as two deep neural network models. The latter have been pre-trained on a large-scale human action dataset UCF101 [31]. We inherit the weights in the neural network models and proceed to extract features of our dataset. Finally, we employ Support Vector Machine (SVM) to classify video sequences into four facial dynamics: smiling, talking, neutral and singing.

3.1 Face Detection

There exist a large number of face detection algorithms, based on a large number of features and implementations. We compared a number of pre-trained algorithms including VGG [9], OpenCV [37], and Doppia [20] with our ADP-dataset (see Sect. 4). The latter performed the best and was hence included in the pre-processing step.

3.2 3D ConvNets

3D ConvNets (C3D) has a simple architecture (see Fig. 1) and has shown a performance of 85.2% accuracy on the UCF101 dataset. We adapt the C3D architecture and extract spatial-temporal features towards categorization of AD patients’ facial dynamics.

Fig. 1.
figure 1

C3D based facial dynamics detection: For each video sequence, faces are detected and the face sequences are passed into a pre-trained C3D network to extract a 4096-dim feature vector for each video. Finally a SVM classifier is trained to predict the final classification result. We have blurred the faces of the subject in this figure, in order to preserve the patient’s privacy.

The original C3D network has 8 convolutional layers, 5 max-pooling layers, 2 fully connected layers and a softmax loss layer. For 5 convolutional layers from 1 to 5, the number of convolutional kernels are 64, 128, 256, 256, 256 respectively. Because all kernels have 3 dimensions, the additional parameter d indicates the kernel temporal depth. Tran et al. [35], report for \(d = 3\) the best among all experiments, which we also use in the present work. In C3D network, all convolutional kernels are \(3 \times 3 \times 3\) with stride 1 in both, spatial and temporal directions, in order to ensure a 3 dimensional output. Such an architecture allows for preserving the temporal information between neighboring frames in a video-clip. With the exception of the first pooling layer, all pooling layers are max-pooling layers, with kernel size \(2 \times 2 \times 2\) with stride 1, attaining that the size of the feature map of each layer is reduced by a factor of 8 as opposed to the input (see Fig. 2). The kernel size of the first pooling layer is \(1 \times 2 \times 2\), which ensures that early merging of the temporal signals is avoided. Since we only need features from the FC6 activation layer in our context, we remove the FC7 and last soft max layers from the original model.

We note that the C3D network was pre-trained on the UCF101 dataset, which contains 13320 videos from 101 action categories including single and multiple person actions. Specifically, each video has been divided into video clips of 16-frames-length, with an 8-frame overlap between two video clips by a sliding window method. All these video clips serve as training-input for the C3D network. The computed C3D-feature of a single video sequence is the average of all these clip FC6 activations followed by an L2-normalization.

Fig. 2.
figure 2

3D Convolutional kernel and 3D Max-pooling kernel: In each convolutional layer except the first one, the kernel size is \(3 \times 3 \times 3\) and in each max-pooling layer the kernel size is \(2 \times 2 \times 2\). This 3 dimensional design can preserve both spatial and temporal information. We have blurred the faces of the subject in this figure, in order to preserve the patient’s privacy.

3.3 Very Deep Two-Stream ConvNets

The second method, which we explore is Two-Stream ConvNets [30] (see Fig. 3a), which has reportedly achieved 88% accuracy on the UCF101 action recognition dataset. It extracts features based on RGB frames, as well as based on optical flow from a video sequence. As reported by Wang et al. [42] and Feichtenhofer et al. [8], one successor network, namely the Very Deep Two-Stream ConvNets outperformed the original Two-Stream ConvNets (by 3% on UCF101).

Two-Stream ConvNets incorporates a spatial ConvNet, accepting as input single frame with dimension \(224 \times 224 \times 3\), as well as a separate stream - a temporal ConvNet, accepting as input stacked optical flow fields, with dimension \(224 \times 224 \times 20\). Specifically the optical flow field is composed of horizontal and vertical components \(D_{x}\) and \(D_{y}\). A stack of \(D_{x}\) and \(D_{y}\) of 10 frames together are fed into the following ConvNet. Hence, while the first stream is based on RGB based features, the second stream is based on complementary motion between video frames, resulting to an increased accuracy over each of the streams.

We test two variations of Very Deep Two-Stream ConvNets, the first one including VGG-16 in both streams, the second one ResNet-152 in both streams. We note that for both, VGG-16 and ResNet-152, we remove the last fully connected layer and follow a L2-normalization step after the activations.

Fig. 3.
figure 3

(a) While the spatial ConvNet accepts a single RGB frame as input, the temporal ConvNet’s input is the \(D_{x}\) and \(D_{y}\) of 10 consecutive frames, namely 20 input channels. Both described inputs are fed into the Two-stream ConvNets, respectively. We use in this work two variations of Very Deep Two Stream ConvNets, incorporating VGG-16 [29] ResNet-152 [13] for both streams respectively. (b) The optical flow of each frame has two components, namely \(D_{x}\) and \(D_{y}\). We stack 10 times \(D_{y}\) after \(D_{x}\) for each frame to form a 20 frames length input volume.

Input Configuration: Given a video sequence of T frames, we extract N RGB frames (spatial ConvNet) and N optical flow fields (temporal ConvNet). The step of sampling in spatial ConvNet is \(\left\lfloor \frac{T - 1}{N - 1} \right\rfloor \). If we stack dense optical flow of 10 sequential frames to form a 20 input volume (see Fig. 3b, both horizontal and vertical components times 10), the sampling step for temporal ConvNet is \(\left\lfloor \frac{T - 10 + 1}{N} \right\rfloor \). For each optical flow field volume I, we have \( I_{2t} = D_{x}, I_{2t + 1} = D_{y}, t\in ~10\).

The pre-trained spatial and temporal ConvNets extract two respective feature vectors, which concatenated serve as input for our classifier that we describe below.

3.4 Improved Dense Trajectories

Despite the prevalence of DNN, iDT as introduced by Wang et al. [40] constitute one of the best hand-crafted feature-based approaches. We employ iDT for their good coverage of foreground motion and high performance in action recognition (competitive to DNN). In addition, it is complementary to DNN and hence a fusion of iDT and DNN has shown to provide improved accuracy. iDT extracts local spatio-temporal video trajectories by applying dense sampling of feature points on multiple spatial scales with subsequent tracking of detected feature points using dense optical flow. We extract dense trajectories and proceed to extract local spatio-temporal video volumes around the detected trajectories. We extract 5 types of features aligned with the trajectories to characterize shape (point shifts), appearance (Histogram of Oriented Gradients (HOG) and motion (Histogram of Optical Flow (HOF) and Motion Boundary Histogram (MBHx, MBHy)). We encode the iDT features with bag-of-features (BOF) in order to represent video sequence using the extracted motion trajectories and their corresponding descriptors.

3.5 Classifier

For classification, we train a multi-class SVM classifier for the tested methods. We combine the methods Grid-Search and Cross-validation in order to obtain the best parameters.

4 Dataset

For this study, we created the Alzheimer’s disease patients (ADP) - dataset, comprising of 322 video sequences including 16 female patients, with 5 or more takes of each facial dynamics class. The length of the video sequences ranged from 1.44 s to 33.08 s. All videos have been recorded on 25 fps, with resolution of \(576 \times 720\). Two patients with aphasia endued only video sequences of neutral, smiling and singing. Interestingly, while these two patients were not able to speak, they performed singing-like facial movements, which we labeled as singing. For this study we manually segmented and annotated the data, which was challenging, due to high intra- and inter-class variability of patients, as well as facial dynamics. In terms of annotation, two researchers annotated the data (one working in the area of computer vision, the other one in clinical experiments), overlapping in \({>}85\%\). We note that facial dynamics appeared jointly (e.g., singing and smiling), due to the unintrusive nature of the setting. Currently, such overlaps were not considered in the annotation, classes were annotated mutually exclusively.

The patients participated in individual mnemotherapy sessions, located in a small auditorium. Videos of these sessions were acquired with a camescope Sony Handycam DCR-SR 32, placed sideways of patient and clinician, capturing predominantly non-frontal and highly unconstrained videos of the patient.

We identified the most frequent occurring facial dynamics as neutral, smiling, talking and singing/singing-like movements. We note that even “neutral” there are still facial movements (e.g. blinking), hand or head movements.

5 Experimental Results

5.1 Implementation Details

We conduct our experiments on a single GTX Titan X GPU for both face detection and feature extraction, with emphasis on the use of C3D Network and Two-Stream ConvNets.

Face detection is performed by Doppia [20]. Due to the challenging dataset including among others variations of illumination, patient pose, as well as camera-movements, some faces are not detected in single video frames (constituting to false negatives). In such cases, we remove the concerning frames. In case of prevalence of undetected faces, we exclude the concerning video sequences from the analysis.

To compute optical flow, we follow the work of Wang et al. [41] and use the TVL1 algorithm implemented in OpenCV. In our experiments, we set \(N=25\), for both, RGB and optical flow, following the works [41, 42]. Each detected face (RGB-frame) is rescaled to \(224 \times 224 \times 3\) and optical flow is rescaled to \(224 \times 224 \times 20\) in Two-Stream Networks. For C3D we rescale each detected face to \(112 \times 112 \times 3\).

For classification we use the scikit-learn library [21]. We employ a Stratified 10-Folds cross-validation scheme, which preserves the ratio of samples for each category for train and test set (see Table 2). The dataset is divided into 10-folds, 9 folds are used for training and the remaining fold is used for testing. This is repeated 10 times and reported results are the average thereof. We note that video sequences in the test set are not present in the training set. Per split, we compute mean accuracy (MA) (mean accuracy of 10 splits) and we report the MA over all splits.

We test SVM classifier with linear and radial basis function (RBF) kernels. Our experiments show that e.g., for Two-Stream (ResNet-152) the RBF kernel performs best (with C = 25, gamma = 2).

5.2 Results

In Table 1 the performance of the facial dynamics-classification is presented as mean accuracy (MA) pertained to C3D, Two-Stream ConvNets (based on VGG-16 and ResNet-152), as well as separately to each spatial and temporal net. We observe that while the spatial and temporal net of Two-Stream ConvNets (ResNet-152) substantially outperform the VGG-16 counterparts, the overall Two-Stream ConvNets (VGG-16) and Two-Stream ConvNets (ResNet-152) perform comparably well. The best performance was obtained by the ResNet-152 based Two-Stream ConvNets, namely \(MA=76.40\%\), marginally outperforming VGG-16 based Two-Stream ConvNets (by \(0.4\%\)), and substantially outperforming C3D network (by \(9\%\)). In addition, we show performance of C3D and Two-Stream ConvNets fused with iDT. When fusing iDT with the other algorithms, the classification rate consistently increases, up to \(79.5\%\) for Two-Stream (ResNet-152).

Table 1. Classification accuracies of C3D, Very Deep Two-Stream ConvNets, iDT, as well as fusion thereof on the presented ADP-dataset. We report the Mean Accuracy (MA) associated to the compared methods. Abbreviations used: SN...Spatial Net, TN...Temporal Net.

The classification rates for each split of the 10-fold cross-validation are reported in Table 2.

Fig. 4.
figure 4

Confusion matrix for categorized facial dynamics of Two-Stream ResNet-152 + iDT (best performing method) on the ADP-Dataset.

In Fig. 4 we present the overall confusion matrix, associated to the best performing algorithm – Two-Stream ConvNets (ResNet-152). The related results indicate that the highest confusion rates are observed between the dynamics smiling, singing and talking. This may be explainable by the co-occurrence of facial dynamics, as well as with the general low-intensity facial dynamics exhibited by the elderly patients. In some cases, the categories neutral and smiling have been confused. In Table 2 we show accuracy for each facial dynamics-category associated to Fig. 4. We see that the facial dynamic with highest classification rate is neutral, which is intuitive due to the discriminative low-motion. Smiling on the other hand is classified with the biggest error, which might be due to the low-intensity expressions exhibited by the elderly patients (see Fig. 5). We here note that annotation was performed utilizing audio and video.

Table 2. Classification accuracy of Two-Stream ConvNets (ResNet-152) + iDT on the ADP dataset. The numbers in parentheses indicate the number of “neutral”, “smiling”, “talking” and “singing” samples in each split.
Table 3. Mean Accuracy (%) of Two-Stream ConvNets (ResNet-152) + iDT on the ADP dataset. Assessment by category.

In a similar healthcare setting, an algorithm distinguishing between similar facial expressions and activities, based on spatio-temporal Dense Trajectories and improved Fisher Vectors has been proposed [3], which we outperform by \(16.1\%\) utmost (Table 3).

Fig. 5.
figure 5

Example images of two subjects from the ADP-dataset. From left to right we depict the classes “neutral”, “talk”, “smile” and “sing”. Low-intensity expressions exhibited by elderly patients impede correct classification in some cases.

5.3 Observations and Future Work

In this section we summarize the main findings of this research.

  • Based on our experiments with iDT and the DNN-architectures, we observe that DNNs contribute highly in obtaining very promising classification rates, despite the small size of our dataset. We note that while the presented results significantly outperform previous methods based on handcrafted features (i.e., [3]), when fusing handcrafted features (e.g., iDT) with DNN-based approaches, accuracy increases consistently.

  • The methods adapted from action recognition generalize well to classification of facial dynamics. We observe this by the good classification rates, as well as by the facts that (a) temporal ConvNets performs better than spatial ConvNets (cf. [8, 30]), (b) fusion with iDT consistently improves the performance of DNNs (cf. [8, 42]), (c) Two-Stream ConvNets outperforms C3D (cf. [8]).

  • Due to the limited size of our dataset, an end-to-end training from scratch of a DNN architecture is not feasible. Large action recognition datasets such as the UCF101 human action dataset offer suitable training alternatives for DNN-architectures.

  • The accuracy of facial dynamics classification depends on the features, as well as architecture used for assessment. Given the myriad of existing algorithms, exploring different types of features and architectures will be necessary to devise a robust solution.

  • High inter- and intra-variance of subjects and facial dynamics contribute to the remaining error rates. Further challenges include the low-intensity of facial dynamics exhibited by elderly patients, the unconstrained setting, allowing for facial dynamics to occur jointly, as well as ambiguous human annotation.

However, more work is necessary in this regard. Future work will involve fine-tuning of the involved methods. In addition, we intend to explore personalized facial dynamics assessment, where we will train algorithms on video sequences related to each patient, individually. Finally, we will design specific neural network models for facial dynamics assessment, placing emphasis on a single end-to-end model, incorporating face detection, facial feature extraction, as well as classification.

6 Conclusions

In this work we have compared three methods for assessment of facial dynamics exhibited by AD-patients in mnemotherapy. The three tested methods include Improved Dense Trajectories, 3D ConvNets and Two-Stream ConvNets, which we have adapted from action recognition. Despite the pre-training of mentioned methods on an action recognition dataset, the methods have generalized very well to facial dynamics. Experiments conducted on an assembled dataset of Alzheimer’s disease patients have resulted in a true classification rate of up to 79.5% for the fusion of Two-Stream ConvNets (ResNet-152) and iDT.