A Multi-Branch Deepfake Detection Framework Using Spatial, Noise, And Frequency Domain Features With Attention Fusion | International Journal of Science, Engineering and Technology

Authors: Lakshay Bhardwaj, Rishabh Jain, Kritika, Ritesh Kumar

Abstract: Deepfakes have emerged as a major challenge in digital media forensics because modern generative models can produce highly realistic fake facial content that is difficult to distinguish from authentic media. Their rapid growth has increased the risk of misinformation, identity misuse, and security threats in online environments, motivating the need for reliable forensic datasets and detection frameworks [1], [7], [10]. Existing deepfake detection methods often perform well only under controlled settings and show limited robustness when evaluated on unseen manipulations or datasets. Many CNN-based methods achieve high accuracy on known datasets but fail to generalize to unseen data. Prior works based on compact CNNs, frequency-aware learning, and multi-branch detection have shown promising performance, but cross-dataset generalization remains a major challenge [2]–[5], [8]. To address this issue, this work proposes a multi-branch deepfake detection framework that jointly learns from spatial appearance information, residual noise traces, and frequency-domain artifacts. The spatial branch uses a pretrained EfficientNet-B0 backbone to capture facial inconsistencies [6], the noise branch extracts forensic residual cues using SRM-based filtering inspired by image manipulation detection methods [9], and the frequency branch analyzes the log magnitude spectrum obtained through FFT transformation to reveal spectral anomalies commonly associated with forged content [3]. An attention-based fusion module combines these complementary representations and adaptively emphasizes the most discriminative branch for each sample, following the motivation of prior multi-domain and multi-branch approaches [4], [5]. The model is trained and evaluated on the FaceForensics++ dataset using frame-level samples derived from video sequences [1]. Experimental results show that the proposed framework achieves a final test accuracy of 63.75%, demonstrating that multi-domain feature fusion is effective for improving deepfake detection performance. The results further indicate that attention-guided fusion helps the classifier exploit complementary forensic evidence beyond conventional RGB-only models.

DOI: http://doi.org/10.5281/zenodo.20199710