Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

4.1 Background

The first study on kinship verification from facial images was made in [7]. In their work, they extracted local features such as skin color, gray value, histogram of gradient, and facial structure information in facial images and select some of them for kinship verification. Since this seminal work, more and more kinship verification methods have been proposed in the literature [4, 7, 8, 11, 13, 16, 18, 20, 22,23,24,25]. These methods can be mainly categorized into two classes: feature-based [4, 7, 8, 24, 25] and model-based [9, 13, 20, 22]. Methods in the first class extract discriminative feature descriptors to represent kin-related information. Representatives of such feature information include skin color [7], histogram of gradient [7, 17, 24], Gabor wavelet [5, 17, 20, 25], gradient orientation pyramid [25], local binary pattern [13], scale-invariant feature transform [13, 17, 22], salient part [8, 19], self-similarity [11], and dynamic features combined with spatiotemporal appearance descriptor [4]. Methods in the second class learn discriminative models to verify kin relationship from face pairs. Typical such models are subspace learning [20], metric learning [13, 22], transfer learning [20], multiple kernel learning [25], and graph-based fusion [9].

Most existing kinship verification methods determine human kinship relationship from still face images. Due to the large variations of human faces, a single still image may not be discriminative enough to verify human kin relationship. Compared with a single image, a face video provides more information to describe the appearance of human face. It can capture the face of the person of interest from different poses, expressions, and illuminations. Moreover, face videos can be easly captured in real applications, because there are extensive surveillance cameras installed in public areas. Hence, it is desirable to employ face videos to determine the kin relations of persons. However, it is also challenging to exploit discriminative information of face videos because intra-class variations are usually larger within a face video than a single sill image.

In this chapter, we investigate the problem of video-based kinship verification via human face analysis. Specifically, we make the two contributions to video-based kinship verification. On one hand, we present a new video face dataset called Kinship Face Videos in the Wild (KFVW) which were captured in wild conditions for the video-based kinship verification study, as well as the standard benchmark. On the other hand, we employ our benchmark to evaluate and compare the performance of several state-of-the-art metric learning-based kinship verification methods. Experimental results are presented to demonstrate the efficacy of our proposed dataset and the effectiveness of existing metric learning methods for video-based kinship verification. Finally, we also test human ability on kinship verification from facial videos and experimental results show that metric learning based computational methods are not as good as that of human observers.

Table 4.1 Comparison of existing facial datasets for kinship verification.
Fig. 4.1
figure 1

@ Reprinted from Ref. [21], with permission from Elsevier

Sampled video frames of our KFVW dataset. Each row lists three face images of a video. From top to bottom are Father–Son (F–S), Father–Daughter (F–D), Mother–Son (M–S) and Mother–Daughter (M–D) kin relationships, respectively.

4.2 Data Sets

In the past few years, several facial datasets have been released to advance the kinship verification problem, e.g., CornellKin [7], UB KinFace [20], IIITD Kinship [11], Family101 [6], KinFaceW-I [13], and KinFaceW-II [13]. Table 4.1 provides a summary of existing facial datasets for kinship verification. However, these datasets only consist of still face images, in which each subject usually has a single face image. Due to the large variations of human faces, a single still image may not be discriminative enough to verify human kin relationship. To address these shortcomings, we collected a new video face dataset called Kinship Face Videos in the Wild (KFVW) for the video-based kinship verification study. Compared with a still image, a face video provides more information to describe the appearance of human face, because it can easily capture the face of the person of interest from different poses, expressions, and illuminations.

The KFVW dataset was collected from TV shows on the Web. We totally collected 418 pairs of face videos, and each video contains about 100 – 500 frames with large variations such as pose, lighting, background, occlusion, expression, makeup, age, etc. The average size of a video frame is about \(900 \times 500\) pixels. There are four kinship relation types in the KFVW dataset: Father–Son (F–S), Father–Daughter (F–D), Mother–Son (M–S), and Mother–Daughter (M-D), and there are 107, 101, 100, and 110 pairs of kinship face videos for kin relationships F–S, F–D, M–S, and M–D respectively. Figure 4.1 shows several examples of our KFVW dataset for each kinship relations. We can see that the KFVW dataset depicts faces of the person of interest from different poses, expressions, background, and illuminations such that it can provide more information to describe the appearance of human face.

4.3 Evaluation

In this section, we evaluated several state-of-the-art metric learning methods for video-based kinship verification on the KFVW dataset, and provided some baseline results on this dataset.

Fig. 4.2
figure 2

@ Reprinted from Ref. [21], with permission from Elsevier

Cropped face images of our KFVW dataset. Each row lists three face images of a video. From top to bottom are Father–Son (F–S), Father–Daughter (F–D), Mother–Son (M–S) and Mother–Daughter (M–D) kin relationships, respectively.

4.3.1 Experimental Settings

For a video, we first detected face region of interest in each frame and then resized and cropped each face region into the size of \(64 \times 64\) pixels. Figure 4.2 shows the detected faces of several videos. In our experiments, if the number of frames of a video is more than 100, we just randomly detected 100 frames of this video. All cropped face images were converted to grayscale, and we extracted the local binary patterns (LBP) [1] on these images. For each cropped face image of a video, we divided each image into \(8 \times 8\) nonoverlapping blocks, in which the size of each block is \(8 \times 8\) pixels, and then we extracted a 59-bin uniform pattern LBP histogram for each block and concatenated histograms of all blocks to form a 3776-dimensional feature vector. To obtain the feature representation for each cropped face video, we averaged the feature vectors of all frames within this video to form a mean feature vector in this benchmark. Then, principal component analysis (PCA) was employed to reduce dimensionality of each vector to 100 dimension.

Table 4.2 The EER (%) and AUC (%) of several metric learning methods using LBP feature on the KFVW dataset.

In this benchmark, we used all positive pairs for each kinship relation, and also generated the same number of negative pairs. The positive pair (or true pair) means that there is a kinship relation between a pair of face videos. The negative pair (or false pair) denotes that there is no kinship relation between a pair of face videos. Specifically, a negative pair consists of two videos, one was randomly selected from the parents’ set, and another who is not his/her true child was randomly selected children’s set. For each kinship relation, we randomly took 80% of video pairs for model training and the rest 20% pairs for testing. We repeated this procedure 10 times, and recorded the Receiver Operating Characteristic (ROC) curve for performance evaluation, under which two measures: the Equal Error Rate (EER) and the Area Under an ROC Curve (AUC) were adopted to report the performance of various metric learning methods for video-based kinship verification. Note that small EER and large AUC show high performance of a method.

Fig. 4.3
figure 3

@ Reprinted from Ref. [21], with permission from Elsevier

The curves of several metric learning methods using LBP feature on our KFVW dataset for the Father–Son kinship relation.

Fig. 4.4
figure 4

@ Reprinted from Ref. [21], with permission from Elsevier

The curves of several metric learning methods using LBP feature on our KFVW dataset for the Father–Daughter kinship relation.

4.3.2 Results and Analysis

This subsection presents the results and analysis of different methods on KFVW dataset for video-based kinship verification.

Fig. 4.5
figure 5

@ Reprinted from Ref. [21], with permission from Elsevier

The curves of several metric learning methods using LBP feature on our KFVW dataset for the Mother–Son kinship relation.

Fig. 4.6
figure 6

@ Reprinted from Ref. [21], with permission from Elsevier

The curves of several metric learning methods using LBP feature on our KFVW dataset for the Mother–Daughter kinship relation.

4.3.2.1 Comparison of Different Metric Learning Methods

We first evaluated several metric learning methods using LBP features for video-based kinship verification, and provided the baseline results on the KFVW dataset. The baseline methods include Euclidean, ITML [3], SILD [10], KISSME [12], and CSML [14]. The Euclidean method means that the similarity/dissimilarity between a pair of face videos is computed by Euclidean distance in the original space. The metric learning method first learns a distance metric from the training data itself, and then employs this learned distance metric to calculate the distance of a pair of videos from the testing data. Table 4.2 shows the EER (%) and AUC (%) of these metric learning methods using LBP feature on the KFVW dataset. From this table, we see that (1) CSML obtains the best performance in terms of the mean EER and mean AUC, and also achieves the best EER and AUC on the F–S and M–S subsets; (2) ITML shows the best performance on the M–D subset; (3) SILD obtains the best EER and AUC on the F–D subset; (4) all metric learning based methods, i.e., ITML, SILD, KISSME, and CSML, outperform Euclidean method in terms of the EER and AUC; (5) most of methods achieve the best performance on F–S subset compared with other three subsets; and (6) the best EER is merely about 38.5%, and thus video-based kinship verification on the KFVW dataset is extremely challenging. Moreover, Figs. 4.3, 4.4, 4.5 and 4.6 plot ROC curves of several metric learning methods using LBP feature on the KFVW dataset for four types of kinship relations.

Table 4.3 The EER (%) and AUC (%) of several metric learning methods using HOG feature on the KFVW dataset.

4.3.2.2 Comparison of Different Feature Descriptors

We also evaluated several state-of-the-art metric learning methods using different feature descriptors. To this end, we extracted the histogram of oriented gradients (HOG) [2] from two different scales for each cropped face image. Specifically, we first divided each image into \(16 \times 16\) non-overlapping blocks, where the size of each block is \(4 \times 4\) pixels. Then, we divided each image into \(8 \times 8\) non-overlapping blocks, where the size of each block is \(8 \times 8\). Subsequently, we extracted a 9-dimensional HOG feature for each block and concatenated HOGs of all blocks to form a 2880-dimensional feature vector. Following the same procedure as in extracting LBP, for a cropped face video, we averaged the feature vectors of all frames within this video to yield a mean feature vector as the final feature representation. Then, PCA was employed to reduce dimensionality of each vector to 100 dimension.

Fig. 4.7
figure 7

@ Reprinted from Ref. [21], with permission from Elsevier

The curves of several metric learning methods using HOG feature on our KFVW dataset for the Father-Son kinship relation.

Fig. 4.8
figure 8

@ Reprinted from Ref. [21], with permission from Elsevier

The curves of several metric learning methods using HOG feature on our KFVW dataset for the Father–Daughter kinship relation.

Fig. 4.9
figure 9

@ Reprinted from Ref. [21], with permission from Elsevier

The curves of several metric learning methods using HOG feature on our KFVW dataset for the Mother–Son kinship relation.

Fig. 4.10
figure 10

@ Reprinted from Ref. [21], with permission from Elsevier

The curves of several metric learning methods using HOG feature on our KFVW dataset for the Mother–Daughter kinship relation.

Table 4.3 reports the EER (%) and AUC (%) of several metric learning methods using HOG feature on the KFVW dataset, and Figs. 4.7, 4.8, 4.9 and 4.10 show ROC curves of these methods using HOG feature. From this table, we see that (1) SILD achieves the best performance in terms of the mean EER and mean AUC, and also obtains the best EER on the F–D and M–S subsets; and (2) KISSME obtains the best AUC on the F–D and M–S subsets. By comparing Tables 4.2 and 4.3, we see that metric learning methods using LBP feature outperform the same methods using HOG feature in terms of the mean EER and mean AUC. The reason may be that LBP feature can capture local texture characteristics of face images which is more useful than gradient characteristics extracted by HOG feature to help improve the performance of video-based kinship verification.

4.3.2.3 Parameter Analysis

We investigated how different dimensions of LBP feature affect the performance of these state-of-the-art metric learning methods. Figures 4.11, 4.12, 4.13 and 4.14 show the EER and the AUC (%) of ITML, SILD, KISSME, and CSML methods versus different dimensions of LBP feature on the KFVW dataset for four types of kin relationships, respectively. From these figures, we see that (1) ITML and CSML methods show the relatively stable AUC on four subsets (i.e., F–S, F–D, M–S, and M–D) by increasing the dimension of LBP feature from 10 to 100; and (2) SILD and KISSME methods achieve the best AUC at dimension of 30 and then gradually reduce AUC with the increasing of dimension of LBP feature from 30 to 100. Therefore, we reported the EER and the AUC of these metric learning methods at dimension of 30 on four subsets for fair comparison.

4.3.2.4 Computational Cost

We conducted experiments on a standard Windows machine (Intel i5-3470 CPU @ 3.20 GHz, and 32 GB RAM) with the MATLAB code. Given a face video, detecting face region of interest of a frame takes about 0.9 s, and extracting LBP feature of a cropped face image with size of \(64 \times 64\) takes about 0.02 s. In model training, the training time of ITML, SILD, KISSME, and CSML methods are around 9.6, 0.6, 0.7, and 6.5 s for each kin relationship, respectively. In testing, the matching time of these methods are about 0.02 s (excluding times of face detection and feature extraction) for a pair of face videos.

Fig. 4.11
figure 11

@ Reprinted from Ref. [21], with permission from Elsevier

The EER and AUC (%) of ITML method using LBP feature on the KFVW dataset.

Fig. 4.12
figure 12

@ Reprinted from Ref. [21], with permission from Elsevier

The EER and AUC (%) of SILD method using LBP feature on the KFVW dataset.

Fig. 4.13
figure 13

@ Reprinted from Ref. [21], with permission from Elsevier

The EER and AUC (%) of KISSME method using LBP feature on the KFVW dataset.

Fig. 4.14
figure 14

@ Reprinted from Ref. [21], with permission from Elsevier

The EER and AUC (%) of CSML method using LBP feature on the KFVW dataset.

4.3.2.5 Human Observers for Kinship Verification

As another baseline, we also evaluated human ability to verify kin relationship from face videos on the KFVW dataset. For each kinship relation, we randomly chose 20 positive pairs of face videos and 20 negative pairs of face videos, and displayed these video pairs for ten volunteers to decide whether there is a kin relationship or not. These volunteers consist of five male students and five female students, whose ages range from 18 to 25 years, and they have not experienced any training on verifying kin relationship from face videos. We designed two tests (i.e., Test A and Test B) to examine the human ability to verify kin relationship from face videos. In Test A, the cropped face videos were provided to human volunteers, and volunteers did decision-making on the detected face regions with size of \(64 \times 64\) pixels. In Test B, the original face videos were presented to volunteers, and human volunteers can make their decisions by exploiting multiple cues in the whole images, e.g., skin color, hair, race, background, etc. Table 4.4 lists the mean verification accuracy (%) of human ability on video-based kinship verification for different types of kin relationships on the KFVW dataset. We see that Test B reports better performance than Test A on four kinship relations. The reason is that Test B can exploit more cues such as hair and background to help make correct kinship verification. From this table, we also observe that human observers provide higher verification accuracy than metric learning-based methods on KFVW dataset.

From experimental results shown in Tables 4.2, 4.3 and 4.4 and Figs. 4.3 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 4.10, 4.11, 4.12, 4.13 and 4.14, we make the following observations:

  • State-of-the-art metric learning methods outperform predefined metric-based method (i.e., Euclidean distance) for video-based kinship verification. The reason is that metric learning method can learn a distance metric from the training data itself to increase the similarity of a positive pair and to decrease the similarity of a negative pair in the learned metric space.

  • LBP feature presents the better performance than HOG feature for video-based kinship verification. The reason may be that LBP feature can encode local texture characteristics of face images which is more useful than gradient characteristics extracted by HOG feature to help improve the performance of video-based kinship verification.

  • Metric leaning methods and human observers achieve the poor performance on F–D subset compared with other three subsets, which shows that kinship verification on F–D subset is a more challenging task.

Table 4.4 The mean verification accuracy (%) of human ability on video-based kinship verification on the KFVW dataset four types of kin relationships.