Deepfake Voice Detection Using CNN And LSTM

25 Apr

Authors: Baby Saral G, Vibin Dev Anand, Hariharan A, Nehaansh Ladiwala

Abstract: Voice synthesis has come a long way. In just a few years, AI-generated speech has gone from obviously robotic to genuinely convincing—convincing enough that distinguishing a real recording from a synthetic one is no longer something a casual listener can reliably do. That shift creates a real problem for voice authentication systems, digital media verification, and any context where the authenticity of an audio recording actually matters. In this work, we explored whether a hybrid CNN–BiLSTM architecture, operating on Mel spectrogram representations, could reliably separate genuine human speech from AI-generated audio. The idea was to let convolutional layers pick up on spectral irregularities that synthesis systems tend to leave behind, while bidirectional LSTM layers model how those patterns evolve across time—something a frame-by-frame analysis would miss entirely. We trained and evaluated the model using the Fake-or-Real (FoR) dataset, which contains around 17,870 labelled clips split evenly between authentic recordings and outputs from various neural TTS systems. On the held-out test partition, we observed an overall accuracy of 99.02%, with precision, recall, and F1-score each sitting at 0.98, and an AUC of 0.9998. Honestly, the AUC was higher than we anticipated going in. To check whether the recurrent component was actually helping, we also ran a CNN-only version under identical conditions—it dropped to 96.4% accuracy, which suggests the temporal modelling is doing something the convolutional layers alone cannot. We describe the full pipeline here, from raw audio input through to REST API deployment, and we try to be upfront about where this approach currently falls short.

DOI: https://doi.org/10.5281/zenodo.19757403