Fighting AI With AI: Can Deepfake Detection Keep Up With Rapidly Evolving Generative Models?
By Ankit Prasad
In early 2025, a Hong Kong finance employee fell victim to an elaborate scam during a seemingly routine video conference call. Using hyper-realistic deepfake technology, fraudsters generated multiple fake participants, tricking the employee into transferring $25.6 million across 15 transactions. This case underscores how far deepfakes have evolved — from basic face swaps to real-time, multi-modal deception.
As generative AI continues to push the boundaries of realism, deepfake detection faces a critical question: Can it keep up?
Surprisingly, while generation requires increasingly complex and costly computing power, detection is evolving in a more resource-efficient direction. This asymmetry may ultimately tilt the balance in favour of detection, ensuring that AI-powered safeguards remain one step ahead of malicious deepfake creation.
Deepfake Evolution: From GANs to Diffusion Models
The term "deepfake" first emerged in 2017 when early face-swapping technology was used to manipulate videos. Since then, deepfake generation has progressed from Generative Adversarial Networks (GANs) to Diffusion Models, such as Stable Diffusion, which can produce hyper-realistic content. These advancements have significantly reduced the telltale signs of fakery, such as unnatural eye movements or lighting mismatches. However, achieving this level of realism demands exponentially greater computational resources.
The Hong Kong case highlights a new frontier in deepfake fraud — real-time generation of multiple fake participants in video calls. This level of sophistication requires immense computing power to ensure consistency across modalities like video and audio, making real-time deepfake production an expensive and resource-intensive challenge.
The Challenge of Real-Time Detection
While real-time generation is becoming more feasible with powerful AI architectures, real-time detection remains difficult. Many detection systems rely on batch analysis of datasets, which is impractical for live applications like video calls or social media streams. Identifying deepfakes in real time demands faster processing, robustness to compressed media, and minimal false positives.
A promising solution lies in dynamic frame-based inference, which processes only essential frames based on confidence thresholds. However, achieving reliable real-time detection remains an ongoing challenge that requires further innovation in both algorithms and hardware efficiency.
Resource Efficiency in Detection
Unlike deepfake generation, detection methods are becoming more resource-efficient. Early deepfake detectors relied on Convolutional Neural Networks (CNNs) trained on limited datasets. However, as deepfake sophistication grew, researchers found that expanding training datasets was more effective than designing new architectures from scratch.
For instance, fine-tuning detection models with additional images significantly improved their accuracy against newer deepfakes without requiring expensive computational resources. This strategy demonstrates how dataset diversification enables detectors to adapt efficiently to evolving threats.
Another key approach is sequential model waterfalls, where lightweight models handle simpler cases, engaging more complex models only when necessary. This layered strategy optimises both speed and computational efficiency, making detection scalable and practical.
Fine-Tuning: Adapting Quickly with Less
A universal deepfake detector initially achieved over 95 per cent accuracy but dropped to 60 per cent when tested against newer face-swapped imagery. Rather than designing a new model, researchers restored accuracy by fine-tuning it with data from new images. This highlights how efficient adaptation outpaces deepfake complexity, enabling detection to remain one step ahead with minimal resource demands.
As deepfake technology shifts from GANs to Diffusion models, the gap between generation and detection widens further. Detection systems leverage simpler techniques — such as transfer learning and dataset expansion — to maintain high accuracy while avoiding the massive computational costs required for increasingly realistic deepfakes.
Hybrid Approaches: Smarter, Not Harder
Detection methods are also becoming smarter through hybrid techniques. For instance, transformers combined with GAN-based detectors can analyze frequency-domain inconsistencies, catching subtle signs like unnatural lighting or noise patterns. Another approach, Multiple Collaborative GANs (MCGAN) with transfer learning, analyses multimedia content simultaneously, flagging mismatched lip movements or unnatural facial expressions. By integrating multiple techniques, these hybrid models enhance detection capabilities while requiring fewer resources compared to the increasingly complex architectures needed for generation.
Multimodal Analysis: Turning Complexity Against Itself
Deepfake generation across multiple modalities — video, audio, and even text — demands significant computing power to ensure consistency. Detection systems capitalise on this by employing multimodal analysis, which examines discrepancies between visual and auditory elements. Rather than developing separate models for each modality, these approaches leverage cross-modal relationships to detect inconsistencies efficiently. Multimodal techniques improve accuracy while reducing computational costs, making them a practical solution for combating increasingly sophisticated deepfake threats.
Widening the Resource Gap: The Future of Detection
As AI advances, the resource gap between deepfake generation and detection continues to widen. Several factors drive this trend:
- Real-time generation is resource-heavy, while detection benefits from dynamic inference and sequential processing.
- Dataset expansion improves detection without proportional increases in computing power.
- Transfer learning enables quick adaptation to emerging deepfake techniques.
- Collaborative frameworks distribute workloads, making detection more scalable.
- Lightweight models handle simpler cases, reserving advanced techniques for high-risk scenarios.
Conclusion: Efficiency Will Prevail
While deepfake generation advances in complexity, it also becomes more computationally demanding. In contrast, detection is learning to do more with less, leveraging transfer learning, multimodal analysis, and lightweight inference techniques to identify fakes efficiently. This widening resource gap suggests a promising future: while creating hyper-realistic deepfakes will require ever-greater investment, detecting them will become faster, cheaper, and more accessible.
In the fight against deepfakes, efficiency, not brute force, will be the key to victory.
(The author is the Founder and CEO, Bobble AI)
Disclaimer: The opinions, beliefs, and views expressed by the various authors and forum participants on this website are personal and do not reflect the opinions, beliefs, and views of ABP Network Pvt. Ltd.
technology