1 Introduction

With advances in computer vision and graphics, it has become possible to generate image/videos with realistic synthetic faces. Companies like Google, Baidu, Nvidia, Adobe and startups such as Voicecey have recently funded efforts to fabricate audio or video. These companies have released do-it yourself software and open source tools available on GitHub such as DeepFake. Currently in-use methods can generate manipulated videos in real time Face2Face, can synthesize video based on audio input or can artificially animate static images. New technologies allow users to edit facial expressions. This gained incredible attention in the context of fake-news discussions. The results are raising concerns that face swaps technology can be used to spread misleading information.

According recent publication [1] “Right now, there is no tool that works all the time” says Mikel Rodriguez, a researcher in Mitre DCorp. The overview of face image synthesis approaches using deep learning techniques is presented in [2]. Most of the techniques used to swap faces generate an output as a face image or 3D facemask. For instance, it was virtually impossible to distinguish between the real Paul Walker and the computer-generated one in the film “The Fast and the Furious 7”. The death of the actor during filming led the director to use previously recorded digital 3D scan data to reconstruct Mr. Walker’s face for the unfinished scenes. Another example is Pro Evolution Soccer2, a video game developed and published by Konami. Since the 2012 version, the images of the soccer players are rendered so realistically that they look almost like real people [3].

In June, 2017 NVIDIA created a GAN that used CelebA-HQ’s database of photos of famous people to generate images of people who don’t actually exist. In 2018 NVIDIA proposed a new GAN that increases the variation in generated images [4]. GANs are used to apply face aging, to generate new viewpoints, or to alter face attributes like skin color [5].

Virtual Worlds have been used constructively for the benefit of the society. However, there are safety and security concerns as well e.g. cyberterrorism activities, child pornography detection and economic crimes such as money laundering.

Unreal images and videos can be used to harm people or to gain political and/or economic advantage. For example, fake images or videos about aliens, disasters, statesmen, or businessmen can create confusion or change people’s’ opinions. Social media platforms such as Facebook, Twitter, Flickr, or YouTube are ideal environments to widely disseminate these fake images and videos.

To combat this threat, CG manipulation-detection software will need to become more sophisticated and useful in the future. This technology, along with robust training and clear guidelines about what is acceptable, will enable media organizations to hold the line against willful image manipulation, thus maintaining their credibility and reputation as purveyors of the truth. The challenges to create new technologies are:

  1. 1.

    The algorithms used to fabricate images/video are based on convolutional neural network, widely used in object recognition. In Deep learning approach, features are automatically learned from training samples rather than being manually designed. However deep learning – based approaches are using mostly supervised learning. Although deep-learning-based approaches are promising, they are not yet mature in digital image forensics; a considerable amount of work remains to be done in this area.

  2. 2.

    Lack of sharing datasets, maintenance, and availability. Coming to a world where everything is connected (IoT) there is a need to collect data from streaming devices, such as Roku or AppleTV and Unmanned Aerial Vehicle (UAV) and variations of computer-generated images using new deep learning architectures.

    We propose to investigate techniques based on perceptual judgments to detect image/video manipulation produced by deep learning architectures. The main objectives of this study are:

    • To develop techniques to make a distinction between computer generated and photographic faces based on facial expressions analysis. The hypothesis is that facial emotions expressed by humans and facial expressions generated from fake faces are different. Humans can produce a large variety of facial expressions with a high range of intensities.

    • To develop entropy based technique for forgery detection in CG human faces. The hypotheses is that natural images have some special properties different from the other types of images.

2 Related Work

Given the need for automated real-time verification of the digital image/video content, several techniques have been presented by researchers. There are two major categories of digital image treatment detection approaches: active approaches and passive approaches. Active approaches involve various kinds of watermarks or fingerprints of the image content and embedding them into the digital image [6]. With rising number of images used in social networks, it is impossible to require all the digital images on the internet to be watermarked before distribution. Therefore, passive forensics approaches have become a more popular choice.

Passive approaches detect changes in digital image by analyzing specific inherent clues or patterns that occur during the modification stage of digital images. Passive approaches do not rely on any prior or preset information and they have a broader application in image forensics. These techniques are successfully applied for tracking true and false news. In [7] the traces are classified in three groups: traces left in image acquisition, traces left in image storage, and traces left in image editing. Recently new category becomes popular images generated by computer graphics software.

For forensics specifically on faces, some methods have been proposed to distinguish computer generated faces from natural ones [8] and to detect face retouching [9]. In biometry, two pre-trained deep CNNs, VGG19 and AlexNet are proposed to detect morphed faces [10]. In [11] the authors proposed detection of two different face swapping manipulations using a two-stream network: one stream detects low-level inconsistencies between image patches while the other stream explicitly detects tampered faces.

Researchers from the Technical University of Munich have developed a deep learning algorithm that potentially identifies forged videos of face swaps on the internet. They trained the algorithm using a large set of face swaps that they made themselves, creating the largest database of these kinds of images available. They then trained the algorithm, called XceptionNet, to detect the face swaps [12].

In [13] different algorithms to detect and classify original and manipulated video are presented. In fact, this is difficult task for humans and computers alike, especially when the videos are compressed and have low resolution, as it often happens on social media. The authors also present a large-scale video dataset called “Face Forensics”.

Some forgery detection methods also use statistical features to detect forgery. This technology is based on methods using natural image statistics. Natural images have some special properties different from the other types of images [14]. In [15] CG faces and real faces are discriminated by analyzing the variation of facial expressions in a video by analyzing sets of feature points.

3 Facial Emotion

In the experiments, the software FaceXpress which proposed in [16] is used to recognize the emotion for each frame within the FaceForansic dataset. The software starts with deting the face using Viola-Jones detector, followed by indicating 116 face landmarks using a multi resolution tracker Active Shape Model (ASM) tracker [17]. FaceXpress detects facial triangulation points using Active Shape Model tracker (see Fig. 1). Attributes are obtained by measuring the length of among the detected facial triangulation points. For some attributes such as, mouth width, mouth height, and the distance of the midpoint of eye gap to eyebrow midpoints, are obtained using Mahalanobis distance. The other attributes such as, the domain of vertical edge in the forehead and domain of horizontal edge in the mid forehead are obtained by filtering with Gauss core. The tracking point’s location is used to compute the changes in facial regions such as eye brow wrinkles, forehead wrinkles, wrinkles in cheeks, distance eye to eyebrows, and vertical and horizontal measures of mouth [18] (see Fig. 2). Finally, a support vector machine (SVM) is used identified the detected facial emotion among the seven universal emotions; surprise, anger, happiness, sadness, fear, disgust, and neutral.

Fig. 1.
figure 1

Facial landmarks detection using ASM tracker

Fig. 2.
figure 2

Facial attributes that are used to detect the regions of interest

4 Entropy Based Histogram

Histogram processing includes image altering by modifying its histogram. To make the histogram of an image flat, normalization process is performed on both original and altered images from FaceForensics dataset (see Sect. 5.1). This process is called contrast enhancement where the function of intensity transformation based on information such as compression, description, and segmentation, are extracted. To compute the histogram of an image, the following discrete function is applied for intensity levels within [0, L−1] range.

$$ h(r_{k} ) \, = n_{k} $$
(1)
  • \( r_{k} \) is the intensity value,

  • \( n_{k} \) is the number of pixels in the image with intensity \( r_{k} \),

  • \( h(n_{k} ) \) is the histogram of the digital image with Gray level \( r_{k} \).

The total number of pixels is used for normalizing the image histogram by assuming an M × N image. This normalization computation is related to \( r_{k} \) probability of occurrence in the image. The equation to normalize the histogram is given below:

$$ p(r_{k} ) = \frac{{n_{k} }}{MN},\quad K = 0,1,2, \ldots ..L - 1 $$
(2)

\( p(n_{k} ) \) computes the probability of occurrence estimation of image level \( r_{k} \).

The summation of all normalized histogram components should be equal to 1 [19]. The histograms for same frames in both original and altered videos are different (see Fig. 3).

Fig. 3.
figure 3

Results of applying image histogram on both original and altered frames

Histograms of original frames have heavy tailed distribution. In case of altered frames, the histograms are sharper due to the tiny values of images vertical and horizontal edges. Also, an image mean information or entropy is determined from the images histogram. The purpose of computing the images entropy is to find its automatic focusing. For any random variable X, with probability density function f(x), the entropy definition is:

$$ H(X) = - E[log\;f(X)] = - \int {f(x)log\;f(x)dx} $$
(3)

The range of the variable is divided into n intervals (lk, uk), k = 1, 2,…., n. The relation between the above entropy definition and the density that is represented as a histogram is shown in the following equation:

$$ H(X) = - \mathop \sum \limits_{k = 1}^{n} \int_{{l_{k} }}^{{u_{k} }} {f(x)log \;f(x)dx} $$
(4)

The relation between kth bin of a histogram to the kth term of the above summation with width is represented in the following equation:

$$ w_{k} = u_{k} - l_{k} $$
(5)

The bin probabilities pk, k = 1, 2, …. n is defined as:

$$ p_{k} = \int_{{l_{k} }}^{{u_{k} }} {f(x)dx} $$
(6)

Which can be approximated as wkf (xk), where:

f(xk) is the area of a rectangle,

xk is the interval (lk, uk) value,

To the kth integral for Eq. (4) can be approximated as wkf(xk)log(xk), this expression is used in term of bin probabilities to rewrite the entropy as:

$$ H(X) = - \mathop \sum \limits_{k = 1}^{n} p_{k } log (p_{k} / w_{k} ) $$
(7)

The above expression is given for a discrete distribution by Harris [20] and for a histogram by Rich and Tracy [21], if wk = 1. When wk is constant and not equal to 1, we used:

$$ H(X) = - \mathop \sum \limits_{k = 1}^{n} p_{k } \;log\;p_{k} + log\;w $$
(8)

5 Results

5.1 FaceForensic Video Dataset

In this paper, we used the faceForansics video dataset [13] which consists of about 500.000 faces frames from around 1004 videos was collected from YouTube. The dataset has been manipulated using state-of-the-art face editing approach including classification and segmentation. The original face2face reenactment approach is used where the mouth interiors is selected from a mouth database depending on the target expression.

5.2 Facial Emotion

The FaceXpress software produces a csv file contains the recognized emotion for each frame for original and altered videos. To evaluate the differences in emotion between the original and altered videos, the mean square error (MSE) is the mean square error between the original and the altered video [22] from FaceForensic dataset. The metric MSE between the produced emotions stored in the csv files for both original and altered videos is computed. For each frame in the original and altered video, the difference in emotion was squared and averaged. Table 1 shows the result of computing MSE for some videos’ frames in the FaceForensic dataset, the results are dreadful and noticeable. By applying the FaceXpress software on some FaceForensic’s videos, the results show a clear difference between emotions express in original and altered videos (see Fig. 4).

Table 1. Results of MSE calculating
Fig. 4.
figure 4

Result of applying FaceXpress on some FaceForensic videos

5.3 Entropy Based Histogram

Another measurement for image quality evaluation is computing the value of Entropy for both original and altered videos. We applied the entropy formula represented in Eq. 8, on both original and altered videos for the same frames. Table 2 shows that the entropy values for the altered frames are reduced comparing with their values for original videos.

Table 2. Results of computing Entropy value for three selected frames

6 Conclusion

In this paper, we applied two different methods to test the quality and differences in emotions in FaceForensic original and altered videos dataset. In the first method, we used FaceXpress software to recognize the emotions in the videos for comparison. The differences in emotions between both original and altered videos are calculated using MSE measurement. The results of MSE values concluded that the differences in emotion are clear and noticeable between both original and altered videos. In the second method, we compute the Entropy values that are generated from the frame’s histogram to test the quality of the videos. The result for the second method showed that the Entropy values for the altered videos are reduced comparing with their value for the original videos. Histograms of original frames have heavy tailed distribution, while in case of altered frames; the histograms are sharper due to the tiny values of images vertical and horizontal edges.