A Machine Learning Based Approach for Deepfake Detection in Social Media Through Key Video Frame Extraction


In the last few years, with the advent of deepfake videos, image forgery has become a serious threat. In a deepfake video, a person’s face, emotion or speech are replaced by someone else’s face, different emotion or speech, using deep learning technology. These videos are often so sophisticated that traces of manipulation are difficult to detect. They can have a heavy social, political and emotional impact on individuals, as well as on the society. Social media are the most common and serious targets as they are vulnerable platforms, susceptible to blackmailing or defaming a person. There are some existing works for detecting deepfake videos but very few attempts have been made for videos in social media. The first step to preempt such misleading deepfake videos from social media is to detect them. Our paper presents a novel neural network-based method to detect fake videos. We applied a key video frame extraction technique to reduce the computation in detecting deepfake videos. A model, consisting of a convolutional neural network (CNN) and a classifier network, is proposed along with the algorithm. The Xception net has been chosen over two other structures—InceptionV3 and Resnet50—for pairing with our classifier. Our model is a visual artifact-based detection technique. The feature vectors from the CNN module are used as the input of the subsequent classifier network for classifying the video. We used the FaceForensics++ and Deepfake Detection Challenge datasets to reach the best model. Our model detects highly compressed deepfake videos in social media with a very high accuracy and lowered computational requirements. We achieved 98.5% accuracy with the FaceForensics++ dataset and 92.33% accuracy with a combined dataset of FaceForensics++ and Deepfake Detection Challenge. Any autoencoder generated video can be detected by our model. Our method has detected almost all fake videos if they possess more than one key video frame. The accuracy reported here is for detecting fake videos when the number of key video frames is one. The simplicity of the method will help people to check the authenticity of a video. Our work is focused, but not limited, to addressing the social and economical issues due to fake videos in social media. In this paper, we achieve the high accuracy without training the model with an enormous amount of data. The key video frame extraction method reduces the computations significantly, as compared to existing works.


Image and video forgery are posing a threat to the society in today’s world. People can artificially create any audio or video clip. Artificial intelligence, mainly machine learning, manipulates images and videos in such a way that they are often visually indistinguishable from real ones [1,2,3]. There are some prevalent techniques which are widely used to manipulate images/videos. Some are computer graphic based (e.g. Photoshop, GIMP, and Canva) and the rest are content changing. Deepfake, a deep learning-based method, is a serious contender among the content changing video falsification techniques. The term “deepfake” originates from the words “deep learning” and “fake”. Use of deep learning networks (DNN) has made the process of creating convincing fake images and videos increasingly easier and faster. It is a technique in which a video or image of a person is manipulated by the image of another person using deep learning methods [4, 5].

In today’s life, social network/media play a significant role. They can affect someone’s mental health [6] and social status, though it shouldn’t be that way [7, 8]. Mobile or mini cameras let people take pictures or videos anywhere, anytime. Commercial photo editing tools [9, 10] allow anyone to create fake images/videos. So, amid multimedia forgery, we need some counter measures to protect our privacy and identity, especially in social media where a person is vulnerable [11]. In social media, when images or videos are uploaded, they get compressed and resized. So the techniques applicable to uncompressed videos might not work for highly compressed videos. In this paper, we are proposing a novel method to detect deepfake videos in compressed social media.

The rest of this paper is organized as follows: “Deepfake is a social and economical issue” presents the motivation of our work. “Novel contributions of the current paper” focuses on the novel contributions of this paper. “Related prior works” is a review of related works in this field. Our detailed work for deepfake detection is described in “The proposed novel method for deepfake detectionThe proposed novel method for deepfake detection”. “The proposed method” presents the theoretical perspective. “Experimental validation” discusses experiments and results. “Conclusions and future works” states the conclusions of this paper with some discussions on the directions on future works.

Deepfake is a Social and Economical Issue

Our face is our identity. People remember someone as per their face. So, when image and video forgery come into play, face manipulation becomes the most targeted one. In the last two decades, face forgery in multimedia has increased enormously. Among the reported works, an image based approach was used by Breglera et al. [12] in 1997 to generate a video. Face replacement of the actor without changing the expression was presented by Garrido et al. [13]. Real time expression transfer by Theis et al. [14] in 2015 is also important. Suwajanakorn’s et al. [15] work on lip syncing to help people to understand how serious is video forgery.

Recent advances in deep learning changed the whole scenario of multimedia forgery. In 2017, a Reddit user, named Deepfake, created some fake videos using deep learning networks. Use of convolution auto encoders [16] and generative adversarial networks (GAN) [17] made the manipulated image/video so sophisticated that the synthesized videos are often visually indistinguishable from real ones. Multimedia forgery is now rampant. Today, smartphone applications to manipulate images are easily available to anybody. Some of these applications are the following: FaceApp, AgingBooth, Meitu, MSQRD, Reflect-Face Swap, and Face Swap Live.

The ability to distort reality has exceeded acceptable limits with deepfake technology. This disruptive technological change affects the truth. Many are intended to be funny, but others are not. They could be a threat to national security, democracy, and an individual’s identity [18, 19]. A deepfake video can defame a person and invade their privacy [20, 21]. People have started to lose faith in the news or images/videos brought to them by the media. This can create political tension or violence. It can ruin any politician’s career or a teenager’s dream. The corporate world is interested in protecting their businesses from fraud or stock manipulation [22]. But deepfake has a dual nature. Advancement of deepfake video technology can be used with a positive approach. A hearing-impaired person who cannot follow a telephonic conversation, can converse in a phone or smartphone with the help of an app which assists in generating lip movements according to the audio. It can also help in making realistic multilingual movies. Unfortunately, negative uses are prevailing.

To stop catastrophic consequences by deepfake videos, Facebook, Microsoft, AWS, Partnership on AI, and some academic institutions came together to organize the Deepfake Detection Challenge and started building the Deepfake Detection Challenge (DFDC) dataset for research purposes. Google, in collaboration with Jigsaw announced another dataset FaceForensics++ (FF++) for deepfake video detection. Figure 1 shows a deepfake video frame from Facebook.

Fig. 1

source Facebook

Deepfakes created by Facebook to fight against a disinformation disaster

Novel Contributions of the Current Paper

In this paper we are proposing a novel technique for detecting deepfake videos in social media using a classifier with lower computational requirements. Eventually it can be applied as an edge fake video detecting tool. Our classifier network applies mainly autoencoder generated videos.

First we tried to detect each frame with our previously proposed Algorithm 1 [23]. Then, instead of detecting each and every frame, we applied our newly proposed Algorithm 2 where a key video frame extraction technique has been utilized. These two algorithms are discussed in detail in Sect. 7. As our detection method is based on finding changes of visual artifacts in a frame due to deepfake forgery, we assume that key frames contain all the artifacts. This simple assumption helps us to propose an algorithm with a good accuracy and much lower computational cost. Lower computational cost means that the algorithm can be deployed at the edge since the limited memory in a smart phone will not be a barrier.

The proposed network consists of two modules: (1) a convolutional neural network (CNN) for frame feature extraction, and (2) a classifier network consisting of GlobalAveragePooling2D and fully connected layers. To choose the best CNN module we followed our previous work [23]. We experimented mainly with three networks-Xception [24], InceptionV3 [25], and Resnet50 [26], and finally chose Xception network as the feature extractor. We chose only these three networks over other available CNN modules because of their smaller size. During selection of the CNN module, we used only the FF++ dataset [27]. Then we applied the newly proposed technique over the chosen CNN module with a larger dataset consisting of the FF++ and DFDC (partial) datasets. A system level overview of the network is shown in Fig. 2. Our model works on the proposed algorithm.

Fig. 2

System level overview of the proposed network

The Problem Addressed in the Current Paper

The problem addressed in this paper lies in the very nature of how a deepfake video is created. Deepfake videos are very sophisticated and are created using different deep learning techniques—by autoencoder and generative adversarial networks (GANs). It is practically impossible for a human to distinguish between a real and a forged video when these are uploaded in social media.

Social media videos are highly compressed. Our goal is to apply our proposed technique with lower computational burden which can detect these deepfake videos at any compression level in social media, and which are generated mostly by an autoencoder. Our main goal is to lower the computations as much as possible, so that in the future we can apply the algorithm at edge devices. People can check the authenticity of videos with a limited memory device, such as a smart phone. In our previous paper [23], we tested our model only on the FF++ dataset [27]. In this paper we extended our work by testing the network with an additional dataset, namely DFDC [28].

The Challenges in Solving the Problem

In social media, uploaded images/videos are highly compressed. People check their social media account from smartphones or tablets, along with computers. The existing solutions to detect deepfake images/videos are mostly for uncompressed data and the models are not suitable for social media videos. As people check their social media account from smart phones, the model should be of smaller size too. The challenge was to solve three problems simultaneously—detecting deepfake videos, and to develop a model applicable to compressed video and also a lighter version of it. In our previous paper [23] we tried to solve the first two problems. In this paper we address the third problem too by using a computer vision technique for lower computational effort.

The Solution Proposed in the Current Paper

To address the above mentioned challenges we propose a novel technique of applying key video frame approach to our previously proposed neural network based model [23] which can detect deepfake videos in social media at any compression level. We limit the data by extracting only key video frames from each video. It reduces the number of frames from videos to be checked for authenticity, largely without compromising accuracy significantly. During detection, instead of checking each frame for authenticity, we check the visual artifact changes only for the key video frames. This reduces the computations. As the detection process consists of fewer data computations, this approach is one step forward to apply our network at the edge. We propose a lightweight approach compared to the highly computationally expensive existing works. Our main contributions are as follows:

  • An algorithm of lower complexity to detect deepfake videos.

  • Initially we used three different CNN networks, smaller in size, for feature extraction. They are (1) Xception, (2) Inception V3, and (3) Resnet50. We compared them and finally selected the Xception network [24] as our CNN module. The detailed work has been discussed in our previous work [23]. In Xception, introduced by Chollet in 2016, depthwise separable convolution has been used which made it accurate and cheaper.

  • The feature vector from the CNN is used as the input of a GlobalAveragePooling2D layer with a dropout layer, followed by a fully connected (FC) with a dropout layer and lastly a classification softmax layer. This classifier has been used to detect video.

  • Our novelty here is to combine a well known method of computer vision to our simple classifier model for detecting deepfake videos. The existing works have very complex structures [29,30,31] for detection. Our main goal is to reduce the computational cost of detection without excessively sacrificing accuracy.

  • A novel technique of training without a very large training dataset is reported.

  • For training and testing, we primarily used the FF++ and DFDC data sets. We used the compressed deepfake and original videos of the former. These compressed videos have two different compression levels—one is low loss and the other is high loss. The data sets are a good representation of social media scenarios. To obtain a better generalized model, we added another dataset. But to limit the training time we used 1/3 of the DFDC dataset. We compressed the DFDC dataset at three compression levels with the H264 video compressor. Finally we trained our network with this mixed dataset.

The Novelty of the Solution Proposed

Our goal is to obtain a model for detecting social media deepfake videos/images suitable for use at the edge. To achieve that goal we apply a computer vision technique and a simple classifier with the Xception network. The proposed algorithm reduces the computation during detection. By applying the key video frame extraction during data processing we reduced the number of frames significantly and still received a high accuracy. We tried to reduce the data to be processed, as at any edge device memory is a limiting factor.

Related Prior Works

Identifying manipulated and falsified contents is technically demanding and challenging. In the past 2 decades in media forensics, substantial work to detect image and video forgery has been done. Most of the solutions proposed for video forensics are for easy manipulations, such as copy-move [32], dropped or duplicated frames [33], or varying interpolation [34]. But use of auto-encoders or GANs has made image/video forgery sophisticated. These computer generated forged videos are hard to detect with previously existing detection techniques. Stacked auto-encoders, CNNs, long-short term memory (LSTM) networks, or GANS have been explored in detection models to detect video manipulation. Some of the existing works are summarized in Table 1.

Table 1 A Comparative perspective with existing works on deepfake video detection

Among deep learning solutions, some are temporal feature based and some are based on visual artifacts. In visual-artifact based works, videos are processed frame-by-frame. Each frame contains different features which generate various inconsistencies in the manipulated region of an image. These features are first extracted and then used as input to a deep learning classifier as CNN models can detect these artifacts. The classifiers are ResNet152 [26], VGG16 [41], Inception V3 [25], DenseNet etc. Certain works are associated with detection techniques based on eye blinking rate [37], noting the difference between head pose [42] of an original video and a fake video, and detecting the artifacts of eyes, teeth and face [40]. The human blinking pattern has also been used in another recent paper [43]. A general capsule network based method has been proposed to detect manipulated images and videos [29]. A VGG19 [41] network has been used for latent feature extraction along with a capsule network to detect different spoofs, replay attacks etc. Two inception modules along with two classic convolution layers followed by maxpooling layers have been explored in [38]. This approach is at a mesoscopic level. Audio and video parts of a video clip have been used in getting emotion embedding to detect fake videos [44]. A comparative study among different approaches has been provided in [45] where the authors evaluated existing techniques.

Other than the visual artifact based works, there is another parallel type of work that is prevalent. These are based on temporal features of a video. A combined network of a CNN and LSTM architectures has been explored [36]. The CNN module extracts features from the input image sequence. The feature vector is fed into the LSTM network. Here, the long-short term memory generates a sequence detector from the feature vector. Finally, a fully connected layer classifies the video as manipulated or real. Another combined network was used in a different paper to classify forged videos [35]. A DenseNet structure combined with recurrent neural network (RNN) has been used. A blockchain based approach to detect forged videos has been proposed by Hasan and Salah [46]. Each video is linked to a smart contract and it has a hierarchical relation to its child video. A video is called pristine if the original smart contract is traced. Unique hashes help to store the data to the InterPlanetary File System (IPFS) peer-to-peer network. This model claims to be extendable to audio or images. In [47], the ownership of a video has been stated by detecting fake video and distinguishing it from real video. Spatio-temporal features has also been used in detecting deepfake videos [48].

In another recent work [30], deepfake videos are detected using a Convolution-LSTM network. Visual counterfeits have been analyzed. A triplet architecture has been used in detecting deepfake videos at high compression levels [31]. Sharp multiple instance learning has lately been used in detecting partial face attacks in deepfake videos [49]. The performance of the detectors has been improved by clustering face and non-face images and removing the latter [50].

From Table 1, it is clear that there is not much work done for compressed videos which are predominantly used in social media. The FF++ dataset [27] has a huge collection of deepfake videos at different compression levels. The success of XceptionNet on the FF++ dataset motivated us to find a better deep neural model with higher accuracy and not limited to a particular dataset. We assume a video is called manipulated when at least one frame of the video is forged. This simple assumption along with our technique make our video classifier less computation intensive.

Why are Deepfakes Hard to Detect?

There are two main ways to create deepfake videos—by autoencoders and by GANs. Both are deep learning methods. Inconsistencies in these videos are not recognizable by bare eyes and thus are hard to detect. Before going into the detecting algorithm, we will discuss these two methods.

By autoencoders: the creation of a deep fake video consists of three steps - extraction, training, and creation. In the extraction process all frames are extracted from a video clip and faces are identified and aligned. An auto-encoder is a combination of an encoder and a decoder. When an image is given to the encoder as input, a latent face, or base vector of lower dimension is created. This vector is then fed into the decoder part of the auto-encoder and the input image is then reconstructed. The shape of the network, like the number of layers and nodes decides the quality of the picture. The description of the network is saved as weights. The training stage is shown in Fig. 3a. During training those weights are optimized. To make a deepfake video, two sets of autoencoders are needed. One for the original face and the second one for the target video. During training, both encoders share weights for the common features of the source and target faces. There are two separate decoders for the two sets of images. As common features for both image sets are created by a shared encoder, the encoder learns common features by itself and this explains the name. After training is complete, a latent face from image A is passed to decoder B. As a result, decoder B tries to recreate image B from the relative information of image A. Creation of deepfake video frames is shown in Fig. 3b. This whole process is repeated for all frames or images to make a deepfake video.

Fig. 3

Deepfake video creation by autoencoder

By Generative Adversarial Networks (GANs): Creation of deepfake videos using GANs is popular but a GAN is difficult to train. A GAN consists of two neural networks. They are the generator (G) and the discriminator (D). G acts as the encoder of the autoencoder. But as there is a minmax game between G and D, G wants to surpass D by always trying to make a better picture. After good training, the generator generates images of such a quality that the discriminator can’t distinguish between real and fake images.

There are certain inconsistencies added to the forged video when the fake video is created using autoencoders. This is because both videos are shot in various environments with different devices. During training the encoder learns the standard deviation (spread) and center (mean) of the latent space distribution separately. Once the latent space vector is created, the decoder comes into play. It tries to recreate the same image as the source by distorting one of the centroids with the standard deviation and adds a random error or noise. The decoder finally generates the image but not exactly as the source. During creation of the fake image, sometimes the tone of skin color does not match well or the edge of spectacles does not fit at the exact position of the nose or ear. We took advantage of these discrepancies in our model as accurately as possible by extensive data processing. Therefore, when the feature extractor extracts features, it gives an accurate feature vector.

The Proposed Novel Method for Deepfake Detection

Our main goal in this work is to obtain a model which can detect deepfake videos in social media and show the promise to be extended eventually to a model for using at the edge. In this paper, we initially followed our previous paper [23] to choose the Xception network paired with our classifier among ResNet50, InceptionV3 and Xception modules. The framework of our proposed method is shown in Fig. 4.

Fig. 4

A detailed representation of the proposed model

The final framework consists of a Xception network and a GlobalAveragePooling2D layer with a dropout layer followed by a fully connected layer with 1024 nodes with dropout and followed by a softmax layer for classification. The CNN module extracts the spatial features and those feature vectors are fed into the classifier part. Finally, a classification result comes out from the classifier. To create a model which does not overfit easily, averagepooling and dropout layers are added to the network accordingly.

The FF++ dataset consists of two different compressed level videos so it is suitable for our experiments. For DFDC, we changed the compression of videos in three different levels. Since in social media compressed data is used, frame extraction is done in the compressed video only. No decompression has been done.

Key Video Frame Extraction In a video, there are many elements that don’t change in consecutive frames. So, processing each frame and checking for authenticity waste a lot of resources. A key frame or intra-frame or i-frame represents a frame that indicates the beginning or ending of any transition. Subsequent frames contain only the difference in information. To make the model computationally less complex, we extracted only key video frames from videos. As our work mainly focuses on visual artifacts that change with forgery, we assume that only dealing with key frames will be good enough for our model to detect a deepfake video. Figure 5 shows the key frames from a 20 s video.

Fig. 5

The generated key video frames from a 20 s video

Data processing Data processing plays a significant role in our work. For the first part of our work we followed the same techniques as before [23]. For our newly proposed work, after extracting the key video frames from each video we perform additional data processing. To increase the accuracy of the model we detect all faces and crop the faces from each frame. Finally all frames are normalized and resized as per the input requirement of the CNN module. Imagesize is kept at (299, 299, 3) for InceptionV3 and Xception net and (224, 224, 3) for ResNet50. The data processing diagram is shown in Fig. 6.

Fig. 6

The proposed flow of video processing

Xception Network Xception net has been used as feature extractor in our model. It is the extension of Inception architecture but replacing spatial convolution with depthwise separable convolution . The difference between Inception V3 and Xception is the order of \(3\times 3\) spatial channel wise convolution and cross-channel correlation mapping point wise \(1\times 1\) convolutions. The original Xception network has 36 convolution layers which are structured in 14 blocks. Each block, except the first and last, has a linear residual connection. It extracts features from all frames and gives a 2048-dimensional feature vector for each frame. It goes to the classifier network. Figure 7 shows the original Xception network introduced by Chollet.

Fig. 7

Xception architecture used as CNN module in the proposed work

Classification network Figure 8 shows the classifier network. As the classification network, we chose a combination of layers to get better accuracy. The layers are a GlobalAveragePooling2D layer followed by a dropout layer of 0.5, a fully connected layer with 0.5 dropout and “relu” activation and finally a “softmax” layer which essentially classifies the detected video as real or manipulated. Figure 9 shows how the classifier works.

Fig. 8

Classifier network architecture

Fig. 9

Classifier network work flow

The Proposed Method

In the previous Section, the proposed novel technique to detect deepfake videos in social media has been mentioned. The theoretical perspective along with the algorithm have been discussed in the current Section.

A Theoretical Perspective

Depthwise Separable Convolution

There are three elements in a convolution operation:

  • Input image

  • Feature detector or Kernel or Filter

  • Feature map

The Kernel, or Filter, or Feature Detector is a small matrix of numbers. When it is passed over the input image, new feature maps are generated from the convolution operation between the filter value and the pixel value of the input image at each point (xy) as the following expression:

$$\begin{aligned} (I * h)(x,y) = \int _{0}^{x} \int _{0}^{y} I(x-i, y-j)h(i,j) \mathrm{{d}}i \mathrm{{d}}j, \end{aligned}$$

where I is the input image and h is the kernel.

The complexity of the convolution operation is expressed as \(N \times D_{G}^{2} \times D_{K}^{2} \times M\) where \(D_{F} \times D_{F} \times M\) is the size of the input image and the filter size is \(D_{K} \times D_{K} \times M\). M is the number of channels in the input image. The size of the feature matrix is \(D_{G} \times D_{G} \times M\).

The complexity is decreased in depthwise separable convolution. It divides the convolution operation in two parts: (1) depthwise convolution–filtering stage, and (2) pointwise convolution–combination stage. In depthwise convolution the complexity is \(M \times D_{G}^{2} \times D_{K}^{2}\) while for pointwise convolution it is \(N \times D_{G}^{2} \times D_{K}^{2} \times M\). The overall complexity is can be estimated as the following:

$$\begin{aligned} \mathrm{{Total complexity}} &= M \times D_{G}^{2} \times D_{K}^{2} + N \times D_{G}^{2} \nonumber \\&\times D_{K}^{2} \times M \end{aligned}$$

The relative complexities of the two convolutions is the following:

$$\begin{aligned} \dfrac{\mathrm{{Complexity Depthwise Separable Conv.}}}{\mathrm{{Complexity Standard Conv.}}} = \dfrac{1}{N} + \dfrac{1}{D_{K}^{2}} \end{aligned}$$

It is evident from Eq. (3) that the complexity of standard convolution is much higher than the depthwise separable convolution. It means that the Xception Network provides much faster and cheaper convolution than standard convolution.

Global Average Pooling Layer

It helps to reduce the number of parameters and eventually to minimize overfitting. It down-samples by computing the mean or average of the width and height dimensions of the input. For the Global Average Pooling layer there are no parameters to learn. It takes the average of each feature map for each category of a classification problem and returns a vector which is directly fed into the next layer. It is more robust as it sums up the spatial information.

Dropout Layer

It is very common for a deep network to overfit. The dropout layer stops the overfitting of a neural network. The least square loss for a single layer linear network with activation function \(f(x) = x\) is expressed as the following [51, 52]:

$$\begin{aligned} E_{N} = \frac{1}{2} \left( t - \sum _{i=1}^{n}w'_{i}I_{i} \right) ^{2}. \end{aligned}$$

The least square error of that network with a dropout layer is expressed as the following [51, 52]:

$$\begin{aligned} E_{D} = \frac{1}{2} \left( t - \sum _{i=1}^{n} \delta _{i}w_{i}I_{i}, \right) ^{2}. \end{aligned}$$

where \(\delta _{i} \thicksim \mathrm{{Bernoulli}}(p)\). The expectation of the gradient of the dropout network is expressed as:

$$\begin{aligned} E \left[ \frac{\partial {E_{D}}}{\partial {w_{i}}} \right]= & {} {} -tp_{i}I_{i} + w_{i}p_{i}^{2}I_{i}^{2} + w_{i}Var(\delta _{i})I_{i}^2 \nonumber \\&+ \sum _{j=1,j\ne 1}^{n}w_{j}p_{i}p_{j}I_{i}I_{j}, \end{aligned}$$
$$\begin{aligned}= & {} \frac{\partial {E_{N}}}{\partial {w_{i}}} + w_{i}p_{i}(1-p_{i})I_{i}^2. \end{aligned}$$

In the above expressions, \(w' = p*w\), so the expectation of the gradient with dropout becomes equal to the gradient of a regularized linear network:

$$\begin{aligned} E_{R} = \frac{1}{2} \left( t - \sum _{i=1}^{n}p_{i}w_{i}I_{i} \right) ^{2} + \sum _{i=1}^{n}p_{i}(1-p_{i})w_{i}^2I_{i}^{2} . \end{aligned}$$

The above Eq. (8) results in a maximum for \(p=0.5\).

Soft-Max Layer

To predict the class of the video—pristine or manipulated, a softmax layer is used at the end of the network. It takes an M-dimensional vector and creates another vector of the same size but with values ranging from 0 to 1 making the sum of the values to 1. If the probability distribution of two classes provided by the softmax layer is \(P(y_{k})\) for input \(y_{k}\) to the softmax layer, the output \({\hat{y}}\) from the softmax layer can be predicted by the following expression:

$$\begin{aligned} {\hat{y}} = \mathrm{{arg max}}_{k=2} P\left( y_{k} \right) . \end{aligned}$$

Training Loss

During training, we minimize the categorical cross entropy loss to get optimal parameters of the network to best predict the class. It is the measure of performance of a classification model whose output is a probability, and ranges from 0 to 1. In binary classification like our case the cross entropy loss is expressed as:

$$\begin{aligned} {\mathcal {L}} = - \left( y\log ({\hat{y}})+(1-y)\log (1-{\hat{y}}) \right) . \end{aligned}$$

We use the Adam optimizer to minimize the loss stated in Eq. (10). One mini batch is processed at each iteration to get optimal parameters. After several epochs, when the loss function \({\mathcal {L}}\) is optimized, network parameters are learned to their optimal value.

Details of the Proposed Algorithms

In our initial Algorithm 1 deepfake videos are checked frame by frame [23]. The accuracy obtained is high. But we had to process a very large number of frames. Our proposed algorithm originates from this necessity. In the current paper, we also propose Algorithm 2 in detecting deepfake videos to reduce the computation.

  • The novelty of our first algorithm is to make the complexity of detecting forged video small. The time complexity is \({\mathbf {O}}(n)\) where n is the number of frames extracted from the video.

  • The reason for proposing Algorithm 2 is to further reduce the complexity. As key frame extraction reduces the number of extracted frames from a video largely, the time complexity is reduced too.


Experimental Validation

In this Section we report the experiments and corresponding results. We start with the details of the dataset, experimental parameters and finally analyze the results.


There are several datasets available for image manipulation but not for video, such as the “Dresden image database” [53], the MFCC_F200 dataset [54], the First IEEE Image Forensics Challenge Dataset, and the Wild Web Dataset [55]. Recently, two major datasets have been created for deepfake detection research. Google produced a deepfake detection data set teaming up with Jigsaw. The dataset has been added in the FaceForensics benchmark of the Technical University of Munich and the University Federico II of Naples in September, 2019. In the same year, Facebook Inc. in collaboration with Microsoft AI, Microsoft Corp and academic institutions such as Cornell, MIT, University of Oxford, etc. has started the Deepfake Detection Challenge (DFDC) [56]. The FF++ [57] and DFDC data sets [28] are a good start to establish models for deepfake video detection.

Table 2 Dataset details for initial work
Table 3 Dataset details for final work

For selecting the CNN module, we trained the network with only FF++ data at compression level c = 23, as shown in Table 2. But to make a generalized model, during the final part of our work we trained our neural network model with mixed compression level dataset and we had to construct the dataset as in Table 3. To train our model we construct a mixed dataset consisting of DFDC and FF++ dataset. We focused on the compressed dataset as any image/video loses clarity or in other terms shows losses as compression increases. FF++ has two different compression levels videos. It represents a realistic scenario for social media. We changed the compression level of DFDC dataset in making our dataset. The dataset details are shown in Table 3.

  • FaceForensics++ (FF++) dataset For our work, we used 2000 deepfake videos and 2000 original videos at different compression levels. As the videos uploaded in social media are compressed, FaceForensics++ dataset is a good representative of the social media scenario. We used two compression level video sets—one with quantization parameter or constant rate factor 23 and the other at 40.

  • Deepfake detection challenge (DFDC) dataset We used part of 470GB dataset—5765 manipulated videos and 5773 original videos. We changed the compression levels to three levels \(c=15\), \(c=23\), and \(c=40\). We kept the number of videos for low level compression \(c=15\) minimum. As loss increases with the compression level, we wanted to train our network more on \(c=23\) and \(c=40\) than \(c=15\) videos. We changed the compression levels with an H.264 encoder using FFmpeg software [58].

We constructed our data set with 7773 pristine and 7765 forged videos of different compression levels. We kept the number of videos almost the same for each class—manipulated and real, to negate any kind of preference or bias in data. We kept 600 mixed compression original and manipulated videos aside for testing the accuracy of the model and the rest are used for training and validation. Figure 10a shows the key video frames from a 10 s fake video in DFDC dataset and Fig. 10b shows key video frames of a 24 s fake video in the FaceForensics++ dataset.

Fig. 10

Key Video frames from different length videos

Experimental Setup

In this section, we discuss the implementation set up. The whole work consists of two parts. In the first part, we chose our feature extractor and in the second part we introduced a unique way to detect deepfake videos using a key frame extraction technique to reduce the computations. The first part was trained on a smaller dataset.

Transfer learning We used transfer learning for better accuracy and to save training time. A pre-trained model approach was taken. Initially Resnet50, InceptionV3 and Xception net have been chosen as the feature extractor which are trained on Imagenet dataset. So they already have learned to detect basic and general features of images as they were trained on 1000 classes of 1,000,000 images. Lower level layers extract basic features like lines or edges whereas middle or higher layers extract more complex and abstract features and features defining classification.

To train our network first we trained the classifier keeping the weights of feature extractor frozen and then fine tuned the whole network from end-to-end. We repeated the whole process for our three CNN modules and finally chose the Xception network as our feature extractor. The overall work flow is presented in Fig. 11. Figure 12 shows the steps followed in our work for training and validation. Table 4 shows the number of frames used for training and validation purposes our work.

Fig. 11

End-to-end work flow

Fig. 12

The specific steps followed for testing and validation purposes

Table 4 Frames details for training and validation

Implementation details We implemented our proposed framework in Keras with the TensorFlow backend. FFmpeg is used to clip the videos and 68-landmarks in the dlib package for face detection. For training we used a Tesla T4 GPU with 64GB Memory. A GeForce RTX 2060 is used to evaluate the model.

Experimental Verification

Initially we verified our model with unseen data from FF++. Our model with Xception net gave the best accuracy among all three CNN modules. The accuracy for compression level c = 23 was better than that of c = 40. Once we finalized our feature extractor, we trained the model applying our newly proposed Algorithm 2 along with our customized dataset from FF++ and DFDC and verified it with unseen data from FF++ and DFDC test dataset. We changed the compression levels of the DFDC test dataset to represent social media videos.


Figure 13a shows the accuracy vs different CNN networks. It is clear that the Xception net performed better with our classifier. Once the feature extractor was chosen, we moved to the final work. Fig. 14b–d show the output of different layers of Xception net for the key video frame (Fig. 14a).

Fig. 13

Results of the first experiment with only FaceForensics ++ dataset

Fig. 14

Sample view of CNN layer output

In our initial experiment we tested 200 videos of FF++ unseen data and chose our feature extractor. We obtained 96% accuracy for the videos with compression level \(c=23\). Then we tested the model with our newly proposed Algorithm 2 and two sets of data—first with the same 200 videos from FF++ we used for testing initially and achieved 98.5% accuracy. Then we tested the model with 600 mixed compression videos from FF++ and DFDC test dataset. Most of them are highly compressed (c = 40). We were able to achieve accuracy of 92.33% even with high loss videos. The results are shown in Fig. 15.

Fig. 15

Comparison of results between two experiments

Fig. 16

Test accuracy calculation for final experiment

Analysis of Results

As our case is a binary classification, to visualize the performance of our model we define our confusion matrix as in Table 5.

Table 5 Confusion matrix—definition of TP, TN, FP and FN

The first performance metric we can derive from the confusion matrix is accuracy as defined in Eq. (11):

$$\begin{aligned} \mathrm{{Accuracy}} = \left( \dfrac{\mathrm{{TP}} + \mathrm{{TN}}}{\mathrm{{TP}} + \mathrm{{TN}} + \mathrm{{FP}} + \mathrm{{FN}}} \right) . \end{aligned}$$

To calculate the accuracy of the model we follow Table 5. The detailed results for calculating test accuracy are shown in Table 6.

Table 6 Data to calculate test accuracy

For the initial part, all three CNN modules have been paired with our classifier network. Xception net combined with our classifier gave the best accuracy of 96.00% for compression level \(c = 23\). Test accuracy for both experiments is reported in Table 7. The number of FN is lowest in the case of Xception. It is 1%, leading test accuracy close to train and validation accuracy.

Table 7 Test accuracy

Test accuracy for our model applying Algorithm 2 is shown in Fig. 16. Precision, Recall and F1-score for our model are defined by the following expressions:

$$\begin{aligned} \mathrm{{Precision}} &= \left( \dfrac{\mathrm{{TP}}}{\mathrm{{TP}} + \mathrm{{FP}}} \right) \end{aligned}$$
$$\begin{aligned} \mathrm{{Recall}} &= \left( \dfrac{\mathrm{{TP}}}{\mathrm{{TP}} + \mathrm{{FN}}} \right) \end{aligned}$$
$$\begin{aligned} \mathrm{{F1-score}}= & {} \left( \dfrac{2}{\frac{1}{\mathrm{{Precision}}}+\frac{1}{\mathrm{{Recall}}}} \right) \end{aligned}$$

For our classification model, the above metrics are calculated using Table 6 and are shown in Fig. 17 and Table 8.

Fig. 17

Metric calculation

Table 8 Accuracy, precision, recall and F1-score calculation for test videos

Our Algorithm 1 is applicable to any length video as we process frames at 24 fps. In most fake videos, since only the face is changed the number of key frames is low. Our Algorithm 2 can detect fake videos even if only one key frame is extracted from the testing video but its accuracy increases enormously (almost all results were correct) if the video has more than one key frame. Our reported accuracy for Algorithm 2 considers all cases even if the video contains only one key frame. That is why we report accuracy as 92.33%.

If the video is very hazy, our model might not produce accurate results. The number of FN is mostly contributed by the hazy fake videos. The number of hazy videos in the training and validation dataset was not sufficient for our model to learn and resulted in uncertain prediction for those videos.

The accuracy of our model was lower when we added DFDC dataset with FF++ dataset because we train our model with only partial DFDC dataset. We believe that, as DFDC dataset is a vast dataset, training our model with the entire dataset would have resulted in better accuracy.


We compared our results with existing results [27, 30, 31]. The detailed comparison is shown in Table 9. Figure 18 shows a comparative view between our and other existing works. We achieved 98.5% and 92.33% accuracy with our proposed algorithm for two different test dataset videos. We believe incorporating more manipulated videos with different performers, and different light and noise condition in the training dataset will increase the performance of our model.

Fig. 18

Accuracy comparison

Table 9 Performance comparison of Xception network paired with proposed classifier

Conclusions and Future Works

In this paper we present a deep learning based approach to detect deepfake videos in social media with low computational burden. Our final goal is to achieve an edge based fake video detecting method. We believe that the proposed algorithm is the first founding step for achieving that.

Initially we chose three CNN modules as feature extractor and finally selected the Xception network as the feature extractor of our model. First we classified pristine and manipulated videos proposing an algorithm which processes every frame of a video. We worked with a specific compression level c = 23 as it is in the mid range low loss compression level. We achieved a high accuracy based on our algorithm. The complexity of the algorithm is proportional to the number of frames extracted from the video. To make the number of computations smaller, we proposed a second algorithm where we utilized the key video frame technique of computer vision to reduce the number of computations. We evaluated with a much bigger unseen dataset and were able to achieve good accuracy for highly compressed high loss data.

In this paper:

  • We proposed a deep neural method for detecting social media deepfake videos.

  • Introduced an algorithm which cuts down the computational burden significantly.

  • We avoided training with enormous amounts of data even though we accommodated a large number of videos.

  • We achieved high accuracy even for highly compressed video.

We evaluated our algorithm with the well established FF++ and DFDC datasets. As our algorithm reduces the computations significantly, we believe that it can be deployed at edge device with appropriate modifications.

Embedded deep learning is a growing field. Serious demand for various application domains is pushing today’s cloud dependent deep learning area. As our algorithm detects fake videos by detecting key video frames, it substantially reduces the computation. So, it is one step forward to deploy a video detecting model at an edge device. But due to the memory limitation deep neural network structure is large to fit at the edge devices. So, reducing the run-time memory and the model size would be a great effort as the future work. Our immediate future research is focused on deepfake detection in other media like National IDs.


  1. 1.

    Zhu J, Park T, Isola P, Efros AA. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings IEEE international conference on computer vision (ICCV). 2017. pp. 2242–2251.

  2. 2.

    Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J, Aila T. Analyzing and improving the image quality of stylegan. In: Proceedings of IEEE/CVF conference on computer vision and pattern recognition (CVPR); 2020. p. 8107–8116.

  3. 3.

    Zhao Y, Chen C. Unpaired image-to-image translation via latent energy transport. 2020. arXiv:2012.00649.

  4. 4.

    Faceswap. Web. https://github.com/deepfakes/faceswap. Accessed 19 Jan 2021.

  5. 5.

    DFaker. Web. https://github.com/dfaker/df. Accessed 19 Jan 2021.

  6. 6.

    Chou HTG, Edge N. They are happier and having better lives than i am: the impact of using Facebook on perceptions of others’ lives. Cyberpsychol Behav Soc Netw. 2012;15(2):117–21.

    Article  Google Scholar 

  7. 7.

    Baccarella C, Wagner T, Kietzmann J, McCarthy I. Social media? It’s serious! understanding the dark side of social media. Eur. Manag. J. 2018;36:431–8.

    Article  Google Scholar 

  8. 8.

    Allcott H, Gentzkow M. Social media and fake news in the 2016 election. J. Econ. Perspect. 2017;31(2):211–36.

    Article  Google Scholar 

  9. 9.

    Faceswap. Web. https://faceswap.dev/. Accessed 19 Jan 2021.

  10. 10.

    DeepFaceLab. Web. https://github.com/iperov/DeepFaceLab. Accessed 19 Jan 2021.

  11. 11.

    Deepfake Video. Web. https://edition.cnn.com/2019/06/11/tech/zuckerberg-deepfake/index.html. Accessed 19 Jan 2021.

  12. 12.

    Bregler C, Covell M, Slaney M. Video rewrite: driving visual speech with audio. In: SIGGRAPH'97: Proceedings of the 24th annual conference on computer graphics and interactive techniques; 1997. p. 353–360.

  13. 13.

    Garrido P, Valgaerts L, Rehmsen O, Thormaehlen T, Perez P, Theobalt C. Automatic Face Reenactment. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, 2014. p. 4217–4224.

  14. 14.

    Thies J, Zollhofer M, Niessner M, Valgaerts L, Stamminger M, Theobalt C. Realtime expression transfer for facial reenactment. In: Proceedings of ACM SIGGRAPH Asia 2015 ACM transactions on graphics (TOG), Art No. 183, vol. 34; 2015. Accessed 19 Jan 2021.

  15. 15.

    Suwajanakorn S, Seitz SM, Kemelmacher-Shlizerman I. Synthesizing Obama: learning lip sync from audio. ACM Trans Graph (TOG). 2017;36:4.

    Article  Google Scholar 

  16. 16.

    Tewari A, Zollhofer M, Kim H, Garrido P, Bernard F, Perez P, Theobalt C. MoFA: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. In: Proceedings of IEEE international conference on computer vision (ICCV). 2017. pp. 3735–3744.

  17. 17.

    Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Proceedings of advances in neural information processing systems. 2014. p. 2672–2680

  18. 18.

    Chesney R, Citron D. Deepfakes: a looming crisis for national security, democracy and privacy? Web. 2018. https://www.lawfareblog.com/deepfakes-looming-crisis-national-security-democracy-and-privacy

  19. 19.

    Kietzmann J, Lee LW, McCarthy IP, Kietzmann TC. Deepfakes: trick or treat? Bus Horiz. 2020;63(2):135–46.

    Article  Google Scholar 

  20. 20.

    Reid S. The deepfake dilemma: reconciling privacy and first amendment protections. Univ Pa J Const Law. 2020. https://ssrn.com/abstract=3636464 (Forthcoming).

  21. 21.

    Gerstner E. Face/off: “DeepFake” face swaps and privacy laws. Def Couns J. 2020;87:1.

    Google Scholar 

  22. 22.

    Westerlund M. The emergence of Deepfake technology: a review. Technol Innov Manag Rev. 2019;9:39–52. https://doi.org/10.22215/timreview/1282. Accessed 19 Jan 2021.

    Article  Google Scholar 

  23. 23.

    Mitra A, Mohanty SP, Corcoran P, Kougianos E. A novel machine learning based method for deepfake video detection in social media. In: Proceedings of the 6th IEEE international symposium on smart electronic systems (iSES). 2020. (In Press).

  24. 24.

    Chollet F. Xception: deep learning with depthwise separable convolutions. CoRR. 2016. arxiv:1610.02357. Accessed 19 Jan 2021.

  25. 25.

    Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. 2015. arXiv:1512.00567. Accessed 19 Jan 2021.

  26. 26.

    He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR). 2016. p. 770–778. Accessed 19 Jan 2021.

  27. 27.

    Rossler A, Cozzolino D, Verdoliva L, Riess C, Thies J, Niessner M. FaceForensics++: learning to detect manipulated facial images. In: Proceedings of IEEE international conference on computer vision. 2019. p. 1–11. Accessed 19 Jan 2021.

  28. 28.

    Dolhansky B, Bitton J, Pflaum B, Lu J, Howes R, Wang M, Ferrer CC. The deepfake detection challenge dataset. 2020. https://github.com/deepfakes/faceswap0

  29. 29.

    Nguyen HH, Yamagishi J, Echizen I. Capsule-forensics: Using capsule networks to detect forged images and videos. In: Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP); 2019. p. 2307–2311.

  30. 30.

    Hashmi MF, Ashish BKK, Keskar AG, Bokde ND, Yoon JH, Geem ZW. An exploratory analysis on visual counterfeits using conv-lstm hybrid architecture. IEEE Access. 2020;8:101293–308. Accessed 19 Jan 2021.

    Article  Google Scholar 

  31. 31.

    Kumar A, Bhavsar A, Verma R. Detecting deepfakes with metric learning. In: 2020 8th international workshop on biometrics and forensics (IWBF). 2020. p. 1–6.

  32. 32.

    D’Amiano L, Cozzolino D, Poggi G, Verdoliva L. A PatchMatch-based dense-field algorithm for video copy-move detection and localization. IEEE Trans Circuits Syst Video Technol. 2019;29(3):669–82. Accessed 19 Jan 2021.

    Article  Google Scholar 

  33. 33.

    Gironi A, Fontani M, Bianchi T, Piva A, Barni M. A video forensic technique for detecting frame deletion and insertion. In: Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP); 2014. p. 6226–6230.

  34. 34.

    Ding X, Yang G, Li R, Zhang L, Li Y, Sun X. Identification of motion-compensated frame rate up-conversion based on residual signals. IEEE Trans Circuits Syst Video Technol. 2018;28(7):1497–512.

    Article  Google Scholar 

  35. 35.

    Sabir E, Cheng J, Jaiswal A, AbdAlmageed W, Masi I, Natarajan P. Recurrent convolutional strategies for face manipulation detection in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops; 2019. p. 80–87. Accessed 19 Jan 2021.

  36. 36.

    Güera D, Delp EJ. Deepfake video detection using recurrent neural networks. In: Proceedings of 15th IEEE international conference on advanced video and signal based surveillance; 2018. p. 1–6. Accessed 19 Jan 2021.

  37. 37.

    Li Y, Chang M, Lyu S. In Ictu Oculi: Exposing AI created fake videos by detecting eye blinking. In: Proceedings of IEEE international workshop on information forensics and security. 2018. p. 1–7. Accessed 19 Jan 2021.

  38. 38.

    Afchar D, Nozick V, Yamagishi J, Echizen I. MesoNet: a compact facial video forgery detection network. In: Proceedings of IEEE international workshop on information forensics and security (WIFS). 2018. p. 1–7.

  39. 39.

    Li Y, Lyu S. Exposing deepfake videos by detecting face warping artifacts. CoRR. 2018. https://github.com/deepfakes/faceswap. Accessed 19 Jan 2021.

  40. 40.

    Matern F, Riess C, Stamminger M. Exploiting visual artifacts to expose deepfakes and face manipulations. In: Proceedings of IEEE winter applications of computer vision workshops; 2019. p. 83–92. Accessed 19 Jan 2021.

  41. 41.

    Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2014. arXiv:1409.1556. Accessed 19 Jan 2021.

  42. 42.

    Yang X, Li Y, Lyu S. Exposing deep fakes using inconsistent head poses. In: Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP). 2019. p. 8261–8265. Accessed 19 Jan 2021.

  43. 43.

    Jung T, Kim S, Kim K. DeepVision: deepfakes detection using human eye blinking pattern. IEEE Access. 2020;8:83144–54. https://github.com/deepfakes/faceswap. Accessed 19 Jan 2021.

    Article  Google Scholar 

  44. 44.

    Mittal T, Bhattacharya U, Chandra R, Bera A, Manocha D. Emotions don’t lie: an audio-visual deepfake detection method using affective cues. In: Proceedings of the 28th ACM international conference on multimedia; 2020. p. 2823–2832.

  45. 45.

    Korshunov P, Marcel S. Deepfakes: a new threat to face recognition? Assessment and detection. 2018. Accessed 19 Jan 2021.

  46. 46.

    Hasan HR, Salah K. Combating deepfake videos using blockchain and smart contracts. IEEE Access. 2019;7:41596–606. Accessed 19 Jan 2021.

    Article  Google Scholar 

  47. 47.

    Sethi L, Dave A, Bhagwani R, Biwalkar A. Video security against deepfakes and other forgeries. J Discrete Math Sci Cryptogr. 2020;23:349–63.

    Article  Google Scholar 

  48. 48.

    Ganiyusufoglu I, Ngo LM, Savov N, Karaoglu S, Gevers T. Spatio-temporal features for generalized detection of deepfake videos. 2020.

  49. 49.

    Li X, Lang Y, Chen Y, Mao X, He Y, Wang S, Xue H, Lu Q. Sharp multiple instance learning for deepfake video detection. In: Proceedings of the 28th ACM international conference on multimedia. 2020. p. 1864–1872.

  50. 50.

    Charitidis P, Kordopatis-Zilos G, Papadopoulos S, Kompatsiaris I. Investigating the impact of pre-processing and prediction aggregation on the deepfake detection task; 2020. arXiv:2006.07084

  51. 51.

    Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15(1):1929–58.

    MathSciNet  MATH  Google Scholar 

  52. 52.

    Baldi P, Sadowski PJ. Understanding dropout. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, editors. Advances in neural information processing systems, vol. 26; 2013. p. 2814–22. Accessed 19 Jan 2021.

  53. 53.

    Gloe T, Böhme R. The “Dresden image database” for benchmarking digital image forensics. In: Proceedings of the ACM symposium on applied computing; 2010. p. 1584–1590. Accessed 19 Jan 2021.

  54. 54.

    Amerini I, Ballan L, Caldelli R, Del Bimbo A, Serra G. A sift-based forensic method for copy-move attack detection and transformation recovery. IEEE Trans. Inf. Forensics Secur. 2011;6(3):1099–110.

    Article  Google Scholar 

  55. 55.

    Zampoglou M, Papadopoulos S, Kompatsiaris Y. Detecting image splicing in the wild (web). In: Proceedings of IEEE international conference on multimedia expo workshops (ICMEW). 2015. p. 1–6. Accessed 19 Jan 2021.

  56. 56.

    Dolhansky B, Howes R, Pflaum B, Baram N, Canton-Ferrer C. The deepfake detection challenge (dfdc) preview dataset;2019. arXiv:abs/1910.08854. Accessed 19 Jan 2021.

  57. 57.

    Rössler A, Cozzolino D, Verdoliva L, Riess C, Thies J, Nießner M. Faceforensics: a large-scale video dataset for forgery detection in human faces. 2018. arXiv:abs/1803.09179.

  58. 58.

    Bellard F, Niedermayer M. Ffmpeg. Web (2012). https://github.com/deepfakes/faceswap3. Accessed 19 Jan 2021.

Download references


This article is an extended version of our previous conference paper presented at [23].

Author information



Corresponding author

Correspondence to Saraju P. Mohanty.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest and there was no human or animal testing or participation involved in this research. All data were obtained from public domain sources.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Mitra, A., Mohanty, S.P., Corcoran, P. et al. A Machine Learning Based Approach for Deepfake Detection in Social Media Through Key Video Frame Extraction. SN COMPUT. SCI. 2, 98 (2021). https://doi.org/10.1007/s42979-021-00495-x

Download citation


  • Deepfake
  • Deep learning
  • Key video frame extraction
  • Depthwise separable convolution
  • Convolution neural network (CNN)
  • Transfer learning
  • Social media
  • Compressed video