Sami Seppälä Review of deepfake detection methods Final report Vaasa 2025 School of Technology and Inno- vations Bachelor’s thesis in Energy and Information Technology Automation and Information Technology 2 UNIVERSITY OF VAASA School of Technology and Innovation Author: Sami Seppälä Title of the Thesis: Review of deepfake detection methods : Final report Degree: Bachelor of Engineering Sciences Program: Automation and Information Technology Supervisor: Tomi Pasanen Year: 2025 Page Count: 31 ABSTRACT: This bachelor's thesis reviews methods for detecting deepfake media, examining peer-reviewed literature published between 2020 and 2025. This review also briefly covers deepfake genera- tion methods such as GANs, autoencoders, and diffusion models. Detection methods are grouped into twelve categories, including CNN-based classification and physiological signal anal- ysis. While many methods score above 95% accuracy on standard benchmarks, they struggle when tested on data they were not trained on, and video compression makes detection even harder. Methods based on physiological signal measurement and identity features performed better in robustness testing. The review concludes with recommendations for future research priorities. KEYWORDS: Deepfake, Artificial Intelligence, Deep learning, Information integrity 3 Contents 1 Introduction 6 1.1 Background and significance 6 1.2 Literature search and selection strategy 7 2 Literature review 9 2.1 Technical Foundations of Deepfake Creation 9 2.1.1 Convolutional Neural Networks (CNNs) 9 2.1.2 Generative Adversarial Networks (GANs) 10 2.1.3 Encoder-Decoders, Autoencoders, and VAEs 11 2.1.4 Diffusion models 12 2.2 Deepfake Detection Methods 13 2.2.1 CNN-Based Detection Methods 13 2.2.2 Transformer-Based Approaches 14 2.2.3 Physiological Measurement Methods 14 2.2.4 Multimodal and Audio-Visual Methods 15 2.2.5 Temporal Consistency and Recurrent Approaches 15 2.2.6 Identity-Based and Metric Learning Methods 16 2.2.7 Artifact-Based and Self-Consistency Methods 17 2.2.8 Frame Inference and Prediction Methods 18 2.2.9 Adversarial Training and Robustness Methods 18 2.2.10 Color Space and Frequency Domain Methods 18 2.2.11 Continual and Reinforcement Learning Approaches 19 2.2.12 Dual-Level and Multi-Task Frameworks 19 3 Synthesis of findings 20 3.1 Performance Across Different Approaches 20 3.2 Method Specific Strengths and Limitations 21 3.2.1 CNN-Based Methods 21 3.2.2 Transformer Approaches 21 3.2.3 Multimodal Methods 22 3.2.4 Physiological Signals 22 4 4 Conclusion 23 4.1 Summary of findings 23 4.2 Recommendations for future research 24 References 25 5 Abbreviations General/Technical: AI – Artificial Intelligence AUC – Area Under the Curve CNN – Convolutional Neural Network DCGAN – Deep Convolutional Generative Adversarial Network GAN – Generative Adversarial Network RL – Reinforcement Learning rPPG – Remote Photoplethysmography VAE – Variational Autoencoder Datasets: DFDC – DeepFake Detection Challenge DFDCP – DeepFake Detection Challenge Preview Method-Specific: ADAL – Artifacts-Disentangled Adversarial Learning ADT – Anti-Deepfake Transformer AST – Audio Spectrogram Transformer I2G – Inconsistency Image Generator PCL – Pair-wise self-Consistency Learning SBI – Self-Blended Image VFD – Voice-Face Matching Detection 6 1 Introduction Deepfakes are fake pieces of media, either images, videos, or audio, that are altered in a way that the person or people depicted in the media take on some other individual’s likeness or voice believably, but without participating (Zhang, 2022). Deepfakes can also be used just to alter a person’s facial expression, without changing their likeness to re- semble someone else. They are created using methods based on artificial intelligence and machine learning (Kietzmann, 2020), and they are becoming more believable and easier to create, as the technology is becoming more common and more accessible. Currently, deepfakes present many opportunities for malicious activities, and this threat raises many security concerns (Masood et al., 2023). This has led to many meth- ods being developed for detecting and combating deepfakes, often employing different approaches to the problem. These methods are split between machine learning based methods and feature-based detection methods (Zhang, 2022). Different methods are used for detecting fake images, video, and audio (Nguyen et al., 2022). This review aims to investigate literature regarding methods for detecting deepfakes currently in use or development and briefly summarize deepfakes as a phenomenon and as a piece of technology. It is structured as follows: Section 2 provides the literature re- view, beginning with the technical foundations of deepfake creation before examining current detection methods organized by approach. Section 3 synthesizes the findings, comparing performance across methods and identifying key factors that influence de- tection accuracy. Section 4 concludes with a summary and recommendations for future research. 1.1 Background and significance Manipulated pieces of media have existed since the early days of photography, film, and audio recording. However, recent advancements in artificial intelligence (AI), machine learning, and deep learning have introduced highly sophisticated new tools and tech- niques for creating synthetic media (Rana et al., 2022). These tools and techniques have become increasingly accessible and produce more realistic results (Zhang, 2022), making 7 it possible for almost anyone to create altered images, video, or audio in an instant that is difficult to distinguish from real content. The first appearance of deepfakes happened in the year 2017 when a Reddit user, named “deepfakes”, began posting pornographic videos where the performers’ faces were substituted with celebrity faces, without their consent, using deep learning (Mirsky & Lee, 2022). Later, in 2018, a video of former United States president Barack Obama giving a speech on the matter was published by BuzzFeed; it was created using the same software used to create the videos posted to Reddit. Since these events, large numbers of deepfake videos have begun to emerge, and continue to do so to this day. As a piece of technology, deepfakes have multiple positive applications, and for ex- ample, are being used for enhancements in filmmaking and virtual reality (Yu et al., 2021). Deepfake technology has legitimate uses, but it is far more often discussed in the context of misuse, as in privacy violations, political manipulation, and misinformation. Deepfakes have already been used to damage personal privacy and spread political misinformation, and as the technology improves, the potential for abuse grows with it (Zhang, 2022). As deepfake quality improves, both humans and algorithms are finding it harder to tell real from fake (Nguyen et al., 2022). This makes reliable detection methods more important than ever. Addressing this has led to the development of deepfake detection methods that often employ the same deep learning techniques that are used to create them in the first place, such as generative adversarial networks (GANs) and convolutional neural networks (CNNs) (Rana et al., 2022). 1.2 Literature search and selection strategy This literature review integrates findings from peer-reviewed articles published between 2020 and 2025. These articles focus on different detection methods for deepfake media. Relevant sources were retrieved from well-established academic databases and pub- lisher platforms, including IEEE Xplore, arXiv, ACM Digital Library, PubMed, PubMed Cen- tral, ScienceDirect, and SpringerLink, as well as Taylor & Francis, Wiley Online Library, CVF Open Access, MDPI, and AAAI Digital Library. 8 To identify relevant studies, Semantic Scholar and Google Scholar were used to ex- plore databases. Keyword searches included search terms such as “Deepfake detection”, “AI-generated content”, “synthetic media detection”, “deepfake video detection”, “audio deepfake recognition”, “machine learning in deepfake detection”, “GAN-based deepfake detection”, and “deep learning based deepfake detection”. Results were filtered based on publication year, peer-review status, and their relevance to deepfake detection tech- niques. 9 2 Literature review This section reviews recent research on deepfake detection, focusing on detection meth- ods, benchmarking datasets, and key challenges; it also briefly discusses methods for generating deepfakes to provide context for detection approaches. 2.1 Technical Foundations of Deepfake Creation To better understand and develop methods for detecting deepfakes, it is essential to know how synthetic media is created. This section provides a technical overview of the primary machine learning approaches used for deepfake generation. Understanding the creation methods helps identify the characteristic artifacts and patterns that detection algorithms exploit to distinguish fake content. Deepfakes can be categorized by media type and the nature of manipulation. Visual deepfakes can be split into four categories: reenactment, replacement, editing, and syn- thesis (Mirsky & Lee, 2022). An example of reenactment would be the use of a source face or body to drive the expression or pose of a target. Face swapping would be classi- fied as replacement. Editing focuses on methods for facial attribute editing, for example, while synthesis involves generating photorealistic images or videos of individuals who do not exist in real life. Audio deepfakes commonly include text-to-speech synthesis, in which the target’s voice reads written text aloud, and voice conversion, in which the speech of a source speaker is altered to resemble the target’s voice. The following subsections examine the core architectural components that underpin these techniques. 2.1.1 Convolutional Neural Networks (CNNs) Convolutional neural networks (CNNs) are a fundamental deep learning architecture that has become the backbone of computer vision applications (Zhang, 2022), as their ability to capture hierarchical patterns within data makes them highly effective for processing 10 visual information (Mirsky & Lee, 2022). In deepfake generation, CNNs serve as building blocks within larger generative frameworks, providing the computational mechanisms for feature extraction and image reconstruction. A CNN produces a feature map by slid- ing learned filters systematically across the input image, detecting local patterns at each position. The feature map represents the presence and location of specific visual pat- terns across the image (Mirsky & Lee, 2022). Unlike generative adversarial networks (GANs), which constitute a complete adver- sarial training framework, CNNs are most often used as components within deepfake generation pipelines. They are commonly integrated into encoder-decoder architectures, autoencoders, and as the generator or discriminator networks within GANs themselves (Masood et al., 2023; Mirsky & Lee, 2022). This CNN-based approach with GANs was first introduced by Radford et al. as Deep Convolutional GAN (DCGAN) (Radford et al., 2016), shortly after GANs themselves were first introduced (Goodfellow et al., 2014). CNNs can adapt to various deepfake creation tasks, offering clear benefits but also several drawbacks. CNNs excel in image processing efficiency, especially when compared to fully connected networks, and they produce high-fidelity images with realistic features within the deepfake generation context (Masood et al., 2023; Mirsky & Lee, 2022). By stacking multiple convolutional layers, CNNs are able to capture more complex variations in human faces, with each layer learning increasingly complex feature representations from the previous layer (Nguyen et al., 2022). Furthermore, CNNs are able to handle different levels of detail simultaneously – from fine-grained textures to overall facial structure (Nguyen et al., 2022). However, CNN-based methods require large amounts of training data, and are often subject-specific (Masood et al., 2023). They are also prone to visual artifacts in generated content, especially when significant modification is needed (Masood et al., 2023). 2.1.2 Generative Adversarial Networks (GANs) Generative Adversarial Networks (GANs) were first introduced by Goodfellow et al. (Goodfellow et al., 2014) as a framework for generating realistic synthetic data. The framework uses an adversarial training process in which a generator and a discriminator 11 compete. The generator learns to produce synthetic data that mimics real data as closely as possible to fool the discriminator, while the discriminator is trained to distinguish be- tween fake and real samples. This process improves the generator’s ability to produce convincing synthetic outputs that the discriminator struggles to differentiate from real data. Once training is complete, the discriminator is no longer required, and the gener- ator can synthesize new data independently (Mirsky & Lee, 2022). In the context of deepfake creation, GANs have become the dominant tool for face manipulation. The generator network learns to map facial features from a source to a target person, while the discriminator ensures the output appears realistic. Popular deepfake applications use variants of GANs, such as CycleGAN (Zhu et al., 2020) and StyleGAN (Karras et al., 2019), which have been specifically adapted for manipulation tasks and facial synthesis. The adversarial training process makes GANs particularly effective for deepfake gen- eration because the discriminator acts as a quality control mechanism, pushing the gen- erator to synthesize better samples of fake faces that eventually fool the automated training system and potentially a human observer presented with images created by the generator. 2.1.3 Encoder-Decoders, Autoencoders, and VAEs Encoder-decoders are a fundamental architecture used in deepfake generation. Specifi- cally, they are a type of neural network architecture designed to learn efficient represen- tations of data through compression and reconstruction; the minimum structural re- quirement is that they consist of at least two networks: the encoder and the decoder. The encoder compresses the input data into a lower-dimension latent representation – a compact numerical form capturing essential features, while the decoder produces the output from this compressed representation (Mirsky & Lee, 2022). An autoencoder is a specific type of encoder-decoder network that aims to recon- struct the input as output. To facilitate reconstruction, the encoder and decoder need to be dimensionally compatible (Minaee et al., 2020; Mirsky & Lee, 2022). Autoencoders were used in the original 2017 Reddit deepfake generation network, where a shared 12 encoder was paired with person-specific decoders. These components were trained in parallel as two separate autoencoders. This enabled the encoder to map the features of its inputs, such as pose and expression, into a shared latent space. Face swapping was then achieved by encoding the source person’s face with the shared encoder, and recon- structing it with the target person’s decoder (Mirsky & Lee, 2022). The shared encoder learns identity-independent facial features, while each decoder learns person-specific reconstruction, thus enabling identity transfer while preserving expression and pose. Variational autoencoders (VAEs) are an advanced form of autoencoders that include probabilistic elements by learning distributions in latent space rather than fixed points (Mirsky & Lee, 2022). First introduced in 2013 (Kingma & Welling, 2013), VAEs did away with mapping features to specific coordinates, instead encoding to a distribution – typi- cally Gaussian – characterized by mean and variance. This probabilistic approach enables smooth interpolation between expressions and generation of natural variations not pre- sent in training data, thus producing more realistic deepfakes as a result (Pei et al., 2024). 2.1.4 Diffusion models Diffusion models represent a more recent approach to deepfake generation. They are probabilistic generative models that create realistic high-quality content by reversing a noise-infusion process, gradually transforming random noise into realistic images or vid- eos (Ayodele R. Akinyele et al., 2024; Chen et al., 2024). These models work through forward and reverse processes; first adding noise to data over multiple steps, then learn- ing to denoise the data and generate new realistic content (Ayodele R. Akinyele et al., 2024). This approach enables high-quality synthesis with greater stability and control compared to earlier methods, such as GANs and VAEs, and has been widely adopted in computer vision and audio generation applications (Chen et al., 2024). However, diffu- sion models are computationally expensive due to their multi-step generation process, which is a considerable limit regarding their scalability (Ayodele R. Akinyele et al., 2024). 13 2.2 Deepfake Detection Methods The following subsections present detection methods organized by technical approach, with consistent reporting of accuracy metrics to enable comparison in Section 3. Detec- tion methods are typically evaluated in two settings: within-dataset, where training and testing use different portions of the same dataset, and cross-dataset, where the model is tested on entirely different data than it was trained on. 2.2.1 CNN-Based Detection Methods Convolutional Neural Networks have become the foundation of most detection ap- proaches. Researchers commonly use CNN architectures such as XceptionNet and Effi- cientNet (Dolhansky et al., 2020), ResNet (T. Zhao et al., 2021), and Inception-ResNet (Singh et al., 2021) as backbone networks, the base model that extracts features before classification, for binary classification tasks, distinguishing real from fake content. To improve detection capabilities, several studies have combined CNNs with special- ized modules. The multi-attentional network by Zhao et al. used multiple spatial atten- tion heads, components that identify the most relevant regions in an image, to focus on local discriminative features while enhancing textural information from shallow features (H. Zhao et al., 2021). This approach achieved 97.60% accuracy and 99.29% AUC (area under the curve, a classification metric where 1.0 is perfect and 0.5 is random chance) on high-quality FaceForensics++ data, demonstrating state-of-the-art performance (H. Zhao et al., 2021). Raza et al. proposed a hybrid combining a VGG16 with another CNN architecture, achieving 95% precision and 94% accuracy (Raza et al., 2022). For spatiotemporal analysis, some researchers turned to 3D CNNs. Almestekawy et al. employed an enhanced 3D CNN with spatiotemporal attention in a Siamese architecture, which processes two inputs through identical networks to compare them, combining texture features with deep learning. This approach improved accuracy by 7.9% and achieved AUC scores of up to 97.51% in the same-dataset scenarios and 95.44% in cross- dataset evaluation (Almestekawy et al., 2024). 14 2.2.2 Transformer-Based Approaches Transformer architectures have emerged as alternatives to CNNs. Khan et al. developed a hybrid transformer network with early feature fusion, combining XceptionNet and Ef- ficientNet-B4 as feature extractors trained jointly as a single system with a BERT-styled transformer. Their model achieved 95.00% accuracy on face swapping and 90.00% on NeuralTextures in FaceForensics++ (Khan & Dang-Nguyen, 2022). Wang et al. took a different approach with the Anti-Deepfake Transformer (ADT), which modeled both global and local information via attention-based modules, multi- forensic modules, and variant residual connections. By capturing both fine details and overall structure, this approach addressed the limitations of CNN methods that rely too heavily on local texture information, achieving state-of-the-art performance in cross-da- taset evaluation (P. Wang et al., 2022). Transformers have also shown promise for audio deepfakes. Channing et al. evaluated transformer-based models, including AST and Wav2Vec-based architectures. AST achieved 85% accuracy on FakeAVCeleb with 0.985 AUC, while Wav2Vec reached 81% accuracy with 0.990 AUC; both of these transformer models outperforming traditional methods such as Gradient Boosting Decision Trees (Channing et al., 2024). 2.2.3 Physiological Measurement Methods Physiological signals present in videos provide distinctive cues for detection. Hernandez- Ortega et al. developed DeepFakesON-Phys using remote photoplethysmography (rPPG) to analyze subtle color changes in skin that reveal blood flow patterns (Hernandez- Ortega et al., 2022). Their method employed a Convolutional Attention Network to ex- tract spatial and temporal information from video frames. Single-frame detection achieved over 98% AUC on both Celeb-DF v2 and DFDC databases, and when combining scores from consecutive frames, the system reached 100% accuracy on Celeb-DF v2 (Hernandez-Ortega et al., 2022). The strength of these methods lies in exploiting physiological signals that are difficult for current deepfake generation techniques to replicate (Juefei-Xu et al., 2022). However, 15 they require careful extraction of physiological features and can be sensitive to video quality and lighting conditions. 2.2.4 Multimodal and Audio-Visual Methods Several approaches have leveraged consistency between audio and visual modalities. Cheng et al. proposed voice-face matching based on the principle that individuals exhibit high homogeneity between voice and face, while deepfakes often involve mismatched identities (Cheng et al., 2023). Their VFD method first trained on a large general dataset, then refined the model on deepfake-specific data, achieving 86.11% AUC on FakeAVCe- leb and outperforming baselines by nearly 2% (Cheng et al., 2023). Taking a different approach, Wang et al. developed ART-AVDF using articulatory rep- resentation learning, employing an audio encoder to extract articulatory features and a lip encoder trained through self-supervised learning, where the model learns patterns from unlabeled data (Y. Wang & Huang, 2024). Their system integrated a multimodal joint-fusion module to exploit inherent audio-visual consistency, achieving significant performance improvements over comparable models. Muppalla et al. combined audio-visual features with fine-grained deepfake classifica- tion, categorizing samples into four types based on modality-specific labels (Muppalla et al., 2023). Using Capsule networks and Swin Transformers, their approach achieved 99.20% accuracy with Capsule Forensics on FakeAVCeleb. They tested both feature fu- sion, combining raw data representations, and score fusion, combining final classifica- tion outputs. 2.2.5 Temporal Consistency and Recurrent Approaches Temporal analysis exploits inconsistencies across frames in deepfake videos. Chintha et al. combined convolutional latent representations with bidirectional recurrent struc- tures and entropy-based cost functions. Their XcepTemporal model achieved 100% ac- curacy on FaceForensics++ for both frame-level and video-level detection, enabling iden- tification of both spatial and temporal signatures of deepfakes (Chintha et al., 2020). 16 To address the specific challenge of compressed deepfakes, Hu et al. proposed a two- stream method that analyzes frame-level and temporal-level features. The frame-level stream gradually pruned the network to prevent overfitting, as in learning noise patterns specific to training data rather than general features, to compression artifacts, while the temporality-level stream extracted temporal correlation features. This approach outper- formed state-of-the-art methods on compressed videos (Hu, Liao, Wang, et al., 2022). Liu et al. developed a detector that leverages temporal consistency to distinguish be- tween clean and perturbed videos, achieving 100% detection accuracy for weakly adver- sarial deepfakes. Their work also introduced a framework for generating high-quality ad- versarial deepfake videos using optical flow to restrict the temporal coherence of adver- sarial perturbations (Liu et al., 2023). Caldelli et al. took a more direct approach, utilizing optical flow fields, a technique which tracks pixel movement between consecutive frames, to detect motion dissimilar- ities in video sequences. Their method achieved 97% accuracy for uncompressed videos (C0), 91% for lightly compressed (C23), and 76% for heavily compressed video (C40) in the same-forgery scenarios. Notably, this approach demonstrated superior robustness in cross-forgery scenarios compared to frame-based methods (Caldelli et al., 2021). 2.2.6 Identity-Based and Metric Learning Methods Identity-aware approaches characterize individuals through biometric traits. Cozzolino et al. introduced ID-Reveal, which learned temporal facial features specific to how indi- viduals move while talking through metric learning, training the model to measure sim- ilarity between examples, and adversarial training. What makes this approach notable is that it required only real videos for training, not fake data. ID-Reveal achieved more than 15% average improvement in accuracy for facial reenactment on highly compressed vid- eos compared to supervised approaches (Cozzolino et al., 2021). Dong et al. identified a different problem: “Implicit Identity Leakage”, where binary classifiers unexpectedly learned identity representations rather than forgery artifacts, hindering generalization. Their ID-unaware Deepfake Detection Model with Artifact 17 Detection Module addressed this issue, achieving 99.70% AUC on FaceForensics++ with ResNet-34 and 93.88% on Celeb-DF with EfficientNet-b4 (Dong et al., 2023). Rather than analyzing faces in isolation, Nirkin et al. detected face swapping by iden- tifying discrepancies between faces and their context. Their method employed a face identification network analyzing the tightly segmented face region, alongside a context recognition network examining surrounding features such as hair, ears, and the neck. This approach achieved state-of-the-art results on FaceForensics++ and Celeb-DF-v2 benchmarks (Nirkin et al., 2022). 2.2.7 Artifact-Based and Self-Consistency Methods Several approaches have focused on detecting manipulation artifacts. Zhao et al. pro- posed pair-wise self-consistency learning (PCL), which detects deepfakes by measuring the inconsistency of source features within forged images. Their method introduced an inconsistency image generator (I2G) for creating training data, improving the average AUC from 96.45% to 98.05% in within-dataset evaluation and from 86.03% to 92.18% in cross-dataset evaluation (T. Zhao et al., 2021). Li et al. tackled the challenge of separating meaningful artifacts from noise with the Artifacts-Disentangled Adversarial Learning (ADAL) framework. Their Multi-scale Feature Separator precisely transmitted artifact features, while Artifacts Cycle Consistency Loss enabled pixel-level supervised training. ADAL achieved 97.71% accuracy and 99.51% AUC on FaceForensics++, though performance dropped to 79.87% accuracy and 84.62% AUC on the more challenging Celeb-DFv2 (X. Li et al., 2023). Shiohara et al. took a novel approach with self-blended images (SBIs), synthetic train- ing data generated by blending pseudo-source and target images from a single pristine image. This technique reproduced common forgery artifacts, such as blending bounda- ries, without requiring actual forged photos for training. The results were promising, out- performing baselines by 4.90% on DFDC and 11.78% on DFDCP in cross-dataset evalua- tion (Shiohara & Yamasaki, 2022). A different strategy came from Li et al., who proposed forensic symmetry using a multi-stream learning architecture with two feature extractors capturing symmetry and 18 similarity features from face patches. Their approach achieved 95.99% accuracy at the image level and 99.43% at the video level for DF-TIMIT, with a maximum AUC of 99.44% at the image level (G. Li et al., 2023). 2.2.8 Frame Inference and Prediction Methods Hu et al. developed FInfer, a frame-inference-based detection framework designed for high-quality deepfakes. The approach learned to reference representations of current and future frames, using an autoregressive model to predict upcoming facial represen- tations from current ones. The key insight being that real videos show higher correlation between predicted and actual frames than fake ones. FInfer achieved 90.47% accuracy and 93.30% AUC on Celeb-DF (Hu, Liao, Liang, et al., 2022). 2.2.9 Adversarial Training and Robustness Methods Adversarial training has shown promise for improving model generalization to unseen manipulations. Wang et al. used additive, spatial-transformed, and blurring-based ad- versarial examples to strengthen detection methods. Their approach with two genera- tors (Two-Gen-BAT) improved EfficientNet accuracy from 81.35% to 84.10% and Xception accuracy from 96.77% to 97.45% on FaceForensics++. More importantly, cross-dataset performance saw significant gains, Xception jumped from 54.88% to 64.85% on Celeb- DF (Z. Wang et al., 2022). 2.2.10 Color Space and Frequency Domain Methods Mo et al. explored a different angle, analyzing differences in color space components to improve discrimination rates. By combining color-space channel recombinations with a channel attention mechanism, their Xception-based model achieved up to 99.10% accu- racy on the same face generation task. Notably, the model maintained 98.71% accuracy even with JPEG compression factor of 100, demonstrating strong robustness (Mo et al., 2022). 19 2.2.11 Continual and Reinforcement Learning Approaches As new deepfake methods emerge, detectors need to adapt without forgetting previous ones. Li et al. (2023) addressed this through continual learning, evaluating XceptionNet, ResNet-50, and various incremental learning strategies, including NSCIL, LRCIL, iCaRL, and LUCIR. Vision Transformer-based methods like DyTox achieved the best results, around 86% average accuracy with a memory budget of 100, outperforming CNN-based methods (C. Li et al., 2023). Nadimpalli et al. took an unconventional approach, formulating deepfake detection as a hybrid of supervised learning and reinforcement learning. Their RL agent selected optimal augmentations for each test sample individually, with classification scores aver- aged to determine the final result. This approach achieved 0.952 AUC on DeeperForen- sics-1.0 and 0.669 on Celeb-DF in cross-dataset evaluation (Nadimpalli & Rattani, 2022). 2.2.12 Dual-Level and Multi-Task Frameworks Pu et al. proposed a dual-level collaborative framework that tackles frame-level and video-level forgeries simultaneously, using a joint loss function optimizing both the AUC score and error rate. The key advantage of this multitask structure is that frame-level and video-level detection reinforce each other. Their AUC-based loss function also han- dled imbalanced data better than focal loss, resulting in improved robustness to video quality variations and stronger cross-dataset generalization (Pu et al., 2022). 20 3 Synthesis of findings 3.1 Performance Across Different Approaches Detection performance varied substantially across methods and evaluation scenarios. Within-dataset evaluation typically yielded excellent results, with many CNN-based ap- proaches achieving over 95% accuracy on FaceForensics++ (H. Zhao et al., 2021). How- ever, methods optimized for specific datasets often experienced dramatic performance degradation when tested on unseen data (Cozzolino et al., 2021). Factors that explain these differences in performance can be identified as: • Dataset quality and manipulation diversity • Compression and video quality effects • Temporal vs. spatial analysis trade-offs • Artifact-specific vs. generic feature learning Studies using high-quality datasets like Celeb-DF, which contains more realistic deep- fakes, showed lower detection rates compared to studies that used FaceForensics++ (Nadimpalli & Rattani, 2022). Wang et al. found that when trained on FaceForensics++ and tested on Celeb-DF, Xception baseline performance dropped to 54.88%, though ad- versarial training improved this to 64.85% (Z. Wang et al., 2022). This pattern is more indicative of the quality gap between training and testing data rather than inherent method limitations. Detection accuracy suffered significantly with increasing compression. Caldelli et al. demonstrated that optical flow methods maintained 97% accuracy on uncompressed videos (C0) but dropped to 91% for C23 and 76% for C40 compression levels (Caldelli et al., 2021). Similarly, Zhao et al. noted their multi-attentional network achieved 97.60% accuracy on high-quality FaceForensics++ but only 88.69% on low-quality versions (H. Zhao et al., 2021). Physiological measurement approaches proved particularly robust to compression, with DeepFakesON-Phys maintaining over 98% AUC even on compressed data (Hernandez-Ortega et al., 2022). Methods emphasizing temporal consistency generally showed better cross-manipu- lation generalization but required more computational resources. Chintha et al.’s 21 recurrent approach achieved perfect 100% accuracy on FaceForensics++ (Chintha et al., 2020), while optical flow-based methods demonstrated superior cross-forgery robust- ness (Caldelli et al., 2021). However, single-frame methods like the multi-attentional net- work traded some temporal robustness for computational efficiency while still achieving competitive performance (H. Zhao et al., 2021). Methods learning manipulation-specific artifacts often excelled within their training domain but struggled with novel forgery techniques. In contrast, approaches like self- blended images that learned generic forgery patterns showed improved cross-dataset generalization, outperforming baselines by 4.90% on DFDC and 11.78% on DFDCP (Shiohara & Yamasaki, 2022). Similarly, identity-based methods like ID-Reveal achieved more than 15% improvement on compressed videos by focusing on identity-level fea- tures rather than manipulation-specific artifacts (Cozzolino et al., 2021). 3.2 Method Specific Strengths and Limitations 3.2.1 CNN-Based Methods CNN-based methods form the backbone of many detection systems; however, they do display some fundamental limitations. Wang et al. noted that CNNs’ overreliance on lo- cal texture information hindered generalization to unseen data (P. Wang et al., 2022). This limitation manifested as the “Implicit Identity Leakage” phenomenon identified by Dong et al., in which binary classifiers unexpectedly learned identity representations ra- ther than forgery artifacts. The ID-unaware approach addressing this issue improved cross-dataset AUC from 91.15% to 93.88% on Celeb-DF (Dong et al., 2023). 3.2.2 Transformer Approaches Transformer-based methods showed promise for improved generalization by modeling both global and local information (P. Wang et al., 2022). Khan et al.’s hybrid transformer achieved 95% accuracy across multiple FaceForensics++ subsets (Khan & Dang-Nguyen, 2022). For audio deepfakes, transformers like AST and Wav2Vec substantially 22 outperformed traditional methods, though they lacked the interpretability of hand- crafted feature approaches (Channing et al., 2024). 3.2.3 Multimodal Methods Audio-visual approaches leveraged cross-modal consistency but did not always surpass single-modality detection (Muppalla et al., 2023). Cheng et al.’s voice-face matching achieved 86.11% AUC on FakeAVCeleb, representing a notable improvement over vision- only baselines. However, these methods faced challenges when both modalities were manipulated simultaneously (Cheng et al., 2023). The effectiveness of multimodal ap- proaches depended critically on the availability of synchronized, high-quality audio-vis- ual data. 3.2.4 Physiological Signals Biological signal-based detection exploits features that are difficult for current deepfake techniques to replicate (Juefei-Xu et al., 2022). Hernandez-Ortega et al.’s rPPG-based method achieved over 98% AUC, demonstrating the power of physiological cues (Hernandez-Ortega et al., 2022). However, these approaches required careful feature ex- traction and could be sensitive to video quality and lighting conditions. 23 4 Conclusion 4.1 Summary of findings This review covered detection approaches ranging from CNN-based classifiers to physi- ological signal analysis. Most of these methods work well in controlled settings but ap- plying them to real-world content is a different matter. Within-dataset evaluation consistently achieves high detection accuracy, with many methods often surpassing 95% accuracy on standard benchmarks, such as FaceForen- sics++. However, cross-dataset generalization remains an issue, as methods trained on one dataset frequently experience dramatic performance degradation when tested on unseen data. In other words, many detectors learn to recognize artifacts from specific datasets or manipulation methods. They fail to learn general signs of tampering. Another persistent challenge is presented by video compression. Detection accuracy suffers greatly as compression increases, with some methods losing over 20 percentage points between uncompressed and heavily compressed video. This poses problems for real-world deployments of these approaches, as many deepfakes are distributed via so- cial media platforms that apply compression to shared content to reduce the content’s data footprint. Malicious actors can also use this to their advantage and reduce the video quality on purpose to evade deepfake detection systems. Physiological approaches, especially remote photoplethysmography (rPPG), handled compression well. These methods detect subtle signals like heartbeat patterns in skin color. Current deepfake generation technologies cannot replicate these signals reliably. Identity-based methods that focus on behavioral consistency instead of manipulation artifacts showed improved performance on compressed video. Approaches using syn- thetic training data designed to capture generic manipulation patterns, such as self- blended images, showed better cross-dataset generalization than methods trained on specific manipulation types. Methods that extend beyond single-frame spatial analysis each address part of the problem but introduce their own tradeoffs. Temporal analysis methods traded high com- putational cost for better cross-manipulation performance when compared to frame- 24 level approaches. Transformers can capture both global and local detail, which may help them avoid the texture-dependence problem that limits CNNs. Multimodal approaches leverage anomalies in audio-visual consistency and are effective in situations where only one modality is manipulated but are less effective when both are faked simultaneously. A single approach is rarely enough to cover all scenarios where detection takes place. 4.2 Recommendations for future research Current literature leaves several gaps that require further investigation. Poor cross-da- taset generalization limits practical applications. This needs more attention in future work. Evaluation should be done using only previously unseen manipulation types rather than held-out samples from the same distribution. Diffusion-based generation is a newer problem. Most current detectors were built to catch GAN or autoencoder outputs, and it is unclear how well they handle diffusion-generated content. Real-world deployment considerations receive only minor attention in current re- search. Most studies evaluate on curated datasets under controlled conditions, which leaves questions regarding computational efficiency, latency requirements and integra- tion into current content moderation pipelines unexplored. There is also a practical need for methods that can run locally on a phone or laptop, without needing to send every video to a cloud server for analysis. On a broader level, generation and detection methods are locked in an arms race. As detection methods become more sophisticated, the generation methods evolve to evade them. A long-term solution would likely require content authenticity verification at the point of creation, in addition to the detection methods discussed in this paper. Finally, the interpretability of detection decisions remains underdeveloped. Most cur- rent methods function as black boxes, providing binary classifications without explana- tions on how they reached that conclusion. This would enhance trust in these methods in real-world use, such as content moderation or legal proceedings. As deepfakes continue to evolve, detection methods must develop at a similar pace. If detection does not keep up, the credibility of all digital media is at risk, and with it, the public’s trust in what they see or hear online. 25 References Almestekawy, A., Zayed, H. H., & Taha, A. (2024). Deepfake detection: Enhancing perfor- mance with spatiotemporal texture and deep learning feature fusion. Egyptian Informatics Journal, 27, 100535. https://doi.org/10.1016/j.eij.2024.100535 Ayodele R. Akinyele, Frederick Ogunseye, Adewale Asimolowo, Geoffrey Munyaneza, Oluwatosin Mudele, & Oluwole Olakunle Ajayi. (2024). Advancements in dif- fusion models for high-resolution image and short form video generation. GSC Advanced Research and Reviews, 21(2), 508–520. https://doi.org/10.30574/gscarr.2024.21.2.0441 Caldelli, R., Galteri, L., Amerini, I., & Del Bimbo, A. (2021). Optical Flow based CNN for detection of unlearnt deepfake manipulations. Pattern Recognition Letters, 146, 31–37. https://doi.org/10.1016/j.patrec.2021.03.005 Channing, G., Sock, J., Clark, R., Torr, P., & Witt, C. S. de. (2024). Toward Robust Real- World Audio Deepfake Detection: Closing the Explainability Gap (No. ar- Xiv:2410.07436). arXiv. https://doi.org/10.48550/arXiv.2410.07436 Chen, M., Mei, S., Fan, J., & Wang, M. (2024). Opportunities and challenges of diffusion models for generative AI. National Science Review, 11(12), nwae348. https://doi.org/10.1093/nsr/nwae348 Cheng, H., Guo, Y., Wang, T., Li, Q., Chang, X., & Nie, L. (2023). Voice-Face Homogeneity Tells Deepfake. ACM Trans. Multimedia Comput. Commun. Appl., 20(3), 76:1- 76:22. https://doi.org/10.1145/3625231 Chintha, A., Thai, B., Sohrawardi, S. J., Bhatt, K., Hickerson, A., Wright, M., & Ptucha, R. (2020). Recurrent Convolutional Structures for Audio Spoof and Video Deepfake 26 Detection. IEEE Journal of Selected Topics in Signal Processing, 14(5), 1024–1037. IEEE Journal of Selected Topics in Signal Processing. https://doi.org/10.1109/JSTSP.2020.2999185 Cozzolino, D., Rössler, A., Thies, J., Nießner, M., & Verdoliva, L. (2021). ID-Reveal: Identity- aware DeepFake Video Detection. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 15088–15097. https://doi.org/10.1109/ICCV48922.2021.01483 Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, M., & Ferrer, C. C. (2020). The DeepFake Detection Challenge (DFDC) Dataset (No. arXiv:2006.07397). arXiv. https://doi.org/10.48550/arXiv.2006.07397 Dong, S., Wang, J., Ji, R., Liang, J., Fan, H., & Ge, Z. (2023). Implicit Identity Leakage: The Stumbling Block to Improving Deepfake Detection Generalization. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3994–4004. https://doi.org/10.1109/CVPR52729.2023.00389 Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative Adversarial Networks (No. arXiv:1406.2661). arXiv. https://doi.org/10.48550/arXiv.1406.2661 Hernandez-Ortega, J., Tolosana, R., Fierrez, J., & Morales, A. (2022). DeepFakes Detection Based on Heart Rate Estimation: Single- and Multi-frame. In C. Rathgeb, R. To- losana, R. Vera-Rodriguez, & C. Busch (Eds.), Handbook of Digital Face Manipula- tion and Detection (pp. 255–273). Springer International Publishing. https://doi.org/10.1007/978-3-030-87664-7_12 27 Hu, J., Liao, X., Liang, J., Zhou, W., & Qin, Z. (2022). FInfer: Frame Inference-Based Deepfake Detection for High-Visual-Quality Videos. Proceedings of the AAAI Con- ference on Artificial Intelligence, 36(1), Article 1. https://doi.org/10.1609/aaai.v36i1.19978 Hu, J., Liao, X., Wang, W., & Qin, Z. (2022). Detecting Compressed Deepfake Videos in Social Networks Using Frame-Temporality Two-Stream Convolutional Network. IEEE Transactions on Circuits and Systems for Video Technology, 32(3), 1089– 1102. IEEE Transactions on Circuits and Systems for Video Technology. https://doi.org/10.1109/TCSVT.2021.3074259 Juefei-Xu, F., Wang, R., Huang, Y., Guo, Q., Ma, L., & Liu, Y. (2022). Countering Malicious DeepFakes: Survey, Battleground, and Horizon. International Journal of Computer Vision, 130(7), 1678–1734. https://doi.org/10.1007/s11263-022-01606-8 Karras, T., Laine, S., & Aila, T. (2019). A Style-Based Generator Architecture for Generative Adversarial Networks (No. arXiv:1812.04948). arXiv. https://doi.org/10.48550/arXiv.1812.04948 Khan, S. A., & Dang-Nguyen, D.-T. (2022). Hybrid Transformer Network for Deepfake De- tection. International Conference on Content-Based Multimedia Indexing, 8–14. CBMI 2022: International Conference on Content-based Multimedia Indexing. https://doi.org/10.1145/3549555.3549588 Kietzmann, J. (2020). Deepfakes: Trick or treat? Business Horizons, 63(2), 135. https://doi.org/10.1016/j.bushor.2019.11.006 Kingma, D. P., & Welling, M. (2013). Auto-Encoding Variational Bayes (No. ar- Xiv:1312.6114). arXiv. https://doi.org/10.48550/arXiv.1312.6114 28 Li, C., Huang, Z., Paudel, D. P., Wang, Y., Shahbazi, M., Hong, X., & Van Gool, L. (2023). A Continual Deepfake Detection Benchmark: Dataset, Methods, and Essentials. 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 1339–1349. https://doi.org/10.1109/WACV56688.2023.00139 Li, G., Zhao, X., & Cao, Y. (2023). Forensic Symmetry for DeepFakes. IEEE Transactions on Information Forensics and Security, 18, 1095–1110. IEEE Transactions on Infor- mation Forensics and Security. https://doi.org/10.1109/TIFS.2023.3235579 Li, X., Ni, R., Yang, P., Fu, Z., & Zhao, Y. (2023). Artifacts-Disentangled Adversarial Learning for Deepfake Detection. IEEE Transactions on Circuits and Systems for Video Tech- nology, 33(4), 1658–1670. IEEE Transactions on Circuits and Systems for Video Technology. https://doi.org/10.1109/TCSVT.2022.3217950 Liu, H., Zhou, W., Chen, D., Fang, H., Bian, H., Liu, K., Zhang, W., & Yu, N. (2023). Coherent adversarial deepfake video generation. Signal Processing, 203, 108790. https://doi.org/10.1016/j.sigpro.2022.108790 Masood, M., Nawaz, M., Malik, K. M., Javed, A., Irtaza, A., & Malik, H. (2023). Deepfakes generation and detection: State-of-the-art, open challenges, countermeasures, and way forward. Applied Intelligence, 53(4), 3974–4026. https://doi.org/10.1007/s10489-022-03766-z Minaee, S., Boykov, Y., Porikli, F., Plaza, A., Kehtarnavaz, N., & Terzopoulos, D. (2020). Image Segmentation Using Deep Learning: A Survey (No. arXiv:2001.05566). ar- Xiv. https://doi.org/10.48550/arXiv.2001.05566 Mirsky, Y., & Lee, W. (2022). The Creation and Detection of Deepfakes: A Survey. ACM Computing Surveys, 54(1), 1–41. https://doi.org/10.1145/3425780 29 Mo, S., Lu, P., & Liu, X. (2022). AI-Generated Face Image Identification with Different Co- lor Space Channel Combinations. Sensors (Basel, Switzerland), 22(21), 8228. https://doi.org/10.3390/s22218228 Muppalla, S., Jia, S., & Lyu, S. (2023). Integrating Audio-Visual Features for Multimodal Deepfake Detection (No. arXiv:2310.03827). arXiv. https://doi.org/10.48550/ar- Xiv.2310.03827 Nadimpalli, A. V., & Rattani, A. (2022). On Improving Cross-dataset Generalization of Deepfake Detectors (No. arXiv:2204.04285). arXiv. https://doi.org/10.48550/ar- Xiv.2204.04285 Nguyen, T. T., Nguyen, Q. V. H., Nguyen, D. T., Nguyen, D. T., Huynh-The, T., Nahavandi, S., Nguyen, T. T., Pham, Q.-V., & Nguyen, C. M. (2022). Deep learning for deepfakes creation and detection: A survey. Computer Vision and Image Unders- tanding, 223, 103525. https://doi.org/10.1016/j.cviu.2022.103525 Nirkin, Y., Wolf, L., Keller, Y., & Hassner, T. (2022). DeepFake Detection Based on Discre- pancies Between Faces and Their Context. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10), 6111–6121. IEEE Transactions on Pattern Ana- lysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2021.3093446 Pei, G., Zhang, J., Hu, M., Zhang, Z., Wang, C., Wu, Y., Zhai, G., Yang, J., Shen, C., & Tao, D. (2024). Deepfake Generation and Detection: A Benchmark and Survey (No. ar- Xiv:2403.17881; Version 4). arXiv. https://doi.org/10.48550/arXiv.2403.17881 Pu, W., Hu, J., Wang, X., Li, Y., Hu, S., Zhu, B., Song, R., Song, Q., Wu, X., & Lyu, S. (2022). Learning a deep dual-level network for robust DeepFake detection. Pattern Re- cognition, 130, 108832. https://doi.org/10.1016/j.patcog.2022.108832 30 Radford, A., Metz, L., & Chintala, S. (2016). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (No. arXiv:1511.06434). ar- Xiv. https://doi.org/10.48550/arXiv.1511.06434 Rana, M. S., Nobi, M. N., Murali, B., & Sung, A. H. (2022). Deepfake Detection: A Syste- matic Literature Review. IEEE Access, 10, 25494–25513. https://doi.org/10.1109/ACCESS.2022.3154404 Raza, A., Munir, K., & Almutairi, M. (2022). A Novel Deep Learning Approach for Deepfake Image Detection. Applied Sciences, 12(19), Article 19. https://doi.org/10.3390/app12199820 Shiohara, K., & Yamasaki, T. (2022). Detecting Deepfakes with Self-Blended Images. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18699–18708. 2022 IEEE/CVF Conference on Computer Vision and Pattern Re- cognition (CVPR). https://doi.org/10.1109/CVPR52688.2022.01816 Singh, R. K., Sarda, P. V., Aggarwal, S., & Vishwakarma, D. K. (2021). Demystifying deepfakes using deep learning. 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), 1290–1298. https://doi.org/10.1109/ICCMC51019.2021.9418477 Wang, P., Liu, K., Zhou, W., Zhou, H., Liu, H., Zhang, W., & Yu, N. (2022). ADT: Anti- Deepfake Transformer. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2899–2903. https://doi.org/10.1109/ICASSP43922.2022.9746888 31 Wang, Y., & Huang, H. (2024). Audio–visual deepfake detection using articulatory repre- sentation learning. Computer Vision and Image Understanding, 248, 104133. https://doi.org/10.1016/j.cviu.2024.104133 Wang, Z., Guo, Y., & Zuo, W. (2022). Deepfake Forensics via an Adversarial Game. IEEE Transactions on Image Processing, 31, 3541–3552. IEEE Transactions on Image Processing. https://doi.org/10.1109/TIP.2022.3172845 Yu, P., Xia, Z., Fei, J., & Lu, Y. (2021). A Survey on Deepfake Video Detection. IET Biometrics, 10(6), 607–624. https://doi.org/10.1049/bme2.12031 Zhang, T. (2022). Deepfake generation and detection, a survey. Multimedia Tools and Applications, 81(5), 6259–6276. https://doi.org/10.1007/s11042-021-11733-y Zhao, H., Wei, T., Zhou, W., Zhang, W., Chen, D., & Yu, N. (2021). Multi-attentional Deepfake Detection. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2185–2194. https://doi.org/10.1109/CVPR46437.2021.00222 Zhao, T., Xu, X., Xu, M., Ding, H., Xiong, Y., & Xia, W. (2021). Learning Self-Consistency for Deepfake Detection. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 15003–15013. https://doi.org/10.1109/ICCV48922.2021.01475 Zhu, J.-Y., Park, T., Isola, P., & Efros, A. A. (2020). Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks (No. arXiv:1703.10593). arXiv. https://doi.org/10.48550/arXiv.1703.10593