Deepfake Audio Detection in Voice Authentication: A Spectral and CNN-Based Comprehensive Review

Ali Osman Mohammed Salih; Abdelmajid Hassan Mansour Emam; Alwalid Bashier Gism Elseed Ahmed; Mahmoud Khalifa; Abdelrazig Suliman; Nissrein Babiker Mohammed Babiker

doi:10.48084/etasr.13400

Authors

Ali Osman Mohammed Salih Department of Information Systems and Cyber Security, College of Computing and Information Technology, University of Bisha, Bisha 61922, P.O Box: 551, Saudi Arabia
Abdelmajid Hassan Mansour Emam Department of Information Technology, College of Computing and Information Technology - Khulais, University of Jeddah, 2841 - Ad Duf, Khulays 25535 - 7419, Saudi Arabia
Alwalid Bashier Gism Elseed Ahmed Department of Computer Science and Artificial Intelligence, College of Computing and Information Technology, University of Bisha, Bisha 61922, P.O Box: 551, Saudi Arabia
Mahmoud Khalifa Department of Computer Science and Artificial Intelligence, College of Computing and Information Technology, University of Bisha, Bisha 61922, P.O Box: 551, Saudi Arabia
Abdelrazig Suliman Department of Information Systems and Cyber Security, College of Computing and Information Technology, University of Bisha, Bisha 61922, P.O Box: 551, Saudi Arabia
Nissrein Babiker Mohammed Babiker Department of Information Systems and Cyber Security, College of Computing and Information Technology, University of Bisha, Bisha 61922, P.O Box: 551, Saudi Arabia

Volume: 15 | Issue: 6 | Pages: 29824-29832 | December 2025 | https://doi.org/10.48084/etasr.13400

Received: 14 July 2025 | Revised: 29 August 2025 | Accepted: 9 September 2025 | Online: 30 October 2025

Corresponding author: Ali Osman Mohammed Salih

Abstract

As voice authentication systems become increasingly integral to critical domains such as banking, smart assistants, and remote identity verification, they face escalating threats from AI-generated audio, commonly referred to as deepfakes. These synthetic voices, produced through advanced text-to-speech and voice conversion technologies, can convincingly imitate human speech, thereby undermining the reliability and security of authentication frameworks. This study provides a comprehensive review of spectral-based techniques for deepfake audio detection, highlighting the roles of spectrograms, Mel-Frequency Cepstral Coefficients (MFCC), and Constant-Q Transform (CQT) in exposing time-frequency anomalies. The integration of Convolutional Neural Network (CNN)-based spoof detection modules before identity verification is identified as a critical architectural strategy to enhance system resilience. This review also outlines the prevailing challenges, including vulnerability due to emerging generative models, limited interpretability of deep learning classifiers, and decreased robustness under realistic or noisy conditions. To advance the field, this study emphasizes promising research directions such as hybrid modeling approaches, adversarial training techniques, and the development of multilingual open-access deepfake audio datasets. By critically synthesizing existing research, this review aims to inform the design of more robust, generalizable, and transparent voice authentication systems capable of surviving the evolving landscape of audio-based threats.

Keywords:

audio deepfakes, voice authentication, spoof detection, spectral features, CNN, ASV spoof

References

Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li, "Spoofing and countermeasures for speaker verification: A survey," Speech Communication, vol. 66, pp. 130–153, Feb. 2015. DOI: https://doi.org/10.1016/j.specom.2014.10.005

H. Shi, X. Shi, S. Dogan, S. Alzubi, T. Huang, and Y. Zhang, "Benchmarking Audio Deepfake Detection Robustness in Real-world Communication Scenarios." arXiv, 2025. DOI: https://doi.org/10.23919/EUSIPCO63237.2025.11226601

T. Kinnunen et al., "The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection," in Interspeech 2017, Aug. 2017, pp. 2–6. DOI: https://doi.org/10.21437/Interspeech.2017-1111

S. Barrington, R. Barua, G. Koorma, and H. Farid, "Single and Multi-Speaker Cloned Voice Detection: From Perceptual to Learned Features," in 2023 IEEE International Workshop on Information Forensics and Security (WIFS), Nürnberg, Germany, Dec. 2023, pp. 1–6. DOI: https://doi.org/10.1109/WIFS58808.2023.10374911

A. Pianese, D. Cozzolino, G. Poggi, and L. Verdoliva, "Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models," in Proceedings of the 2024 ACM Workshop on Information Hiding and Multimedia Security, Baiona, Spain, Jun. 2024, pp. 289–294. DOI: https://doi.org/10.1145/3658664.3659662

J. Yi et al., "SceneFake: An initial dataset and benchmarks for scene fake audio detection," Pattern Recognition, vol. 152, Aug. 2024, Art. no. 110468. DOI: https://doi.org/10.1016/j.patcog.2024.110468

S. Sarkar, A. Gupta, A. Ghosh, and S. Ganesan, "DeepFake Classification Using Fine-Tuned Wave2Vec2.0," in Artificial Intelligence and Speech Technology, vol. 2390, A. Sharma and R. Rani, Eds. Springer Nature Switzerland, 2025, pp. 67–78. DOI: https://doi.org/10.1007/978-3-031-91340-2_6

X. Zhang, J. Yi, C. Wang, C. Y. Zhang, S. Zeng, and J. Tao, "What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection," Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 17, pp. 19569–19577, Mar. 2024. DOI: https://doi.org/10.1609/aaai.v38i17.29929

P. Balasubramanian et al., "Generative AI for cyber threat intelligence: applications, challenges, and analysis of real-world case studies," Artificial Intelligence Review, vol. 58, no. 11, Aug. 2025, Art. no. 336. DOI: https://doi.org/10.1007/s10462-025-11338-z

A. Nautsch et al., "ASVspoof 2019: spoofing countermeasures for the detection of synthesized, converted and replayed speech," IEEE Transactions on Biometrics, Behavior, and Identity Science, vol. 3, no. 2, pp. 252–265, Apr. 2021. DOI: https://doi.org/10.1109/TBIOM.2021.3059479

L. M. D. Souza, R. C. Guido, R. C. Contreras, M. S. Viana, and M. A. D. S. Bongarti, "Improving Voice Spoofing Detection Through Extensive Analysis of Multicepstral Feature Reduction," Sensors, vol. 25, no. 15, Aug. 2025, Art. no. 4821. DOI: https://doi.org/10.3390/s25154821

O. A. Shaaban and R. Yildirim, "Audio Deepfake Detection Using Deep Learning," Engineering Reports, vol. 7, no. 3, Mar. 2025, Art. no. e70087. DOI: https://doi.org/10.1002/eng2.70087

B. Zhang, H. Cui, V. Nguyen, and M. Whitty, "Audio Deepfake Detection: What Has Been Achieved and What Lies Ahead," Sensors, vol. 25, no. 7, Mar. 2025, Art. no. 1989. DOI: https://doi.org/10.3390/s25071989

D. Dagar and D. K. Vishwakarma, "A literature review and perspectives in deepfakes: generation, detection, and applications," International Journal of Multimedia Information Retrieval, vol. 11, no. 3, pp. 219–289, Sep. 2022. DOI: https://doi.org/10.1007/s13735-022-00241-w

L. Verdoliva, "Media Forensics and DeepFakes: An Overview," IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 5, pp. 910–932, Aug. 2020. DOI: https://doi.org/10.1109/JSTSP.2020.3002101

A. Naitali, M. Ridouani, F. Salahdine, and N. Kaabouch, "Deepfake Attacks: Generation, Detection, Datasets, Challenges, and Research Directions," Computers, vol. 12, no. 10, Oct. 2023, Art. no. 216. DOI: https://doi.org/10.3390/computers12100216

A. Firc, K. Malinka, and P. Hanáček, "Evaluation framework for deepfake speech detection: a comparative study of state-of-the-art deepfake speech detectors," Cybersecurity, vol. 8, no. 1, Aug. 2025, Art. no. 50. DOI: https://doi.org/10.1186/s42400-024-00346-1

J. Bateman, Deepfakes and Synthetic Media in the Financial System: Assessing Threat Scenarios. Carnegie - Endowment for International Peace, 2020.

T. Hidar, A. A. E. Kalam, and Y. Lamtigui, "Securing Biometric Authentication Systems: A Hybrid Methodology for DeepFake Detection and Response," in Proceedings of the 4th International Conference on Advances in Communication Technology and Computer Engineering (ICACTCE’24), vol. 1312, 2025, pp. 369–380. DOI: https://doi.org/10.1007/978-3-031-94620-2_32

A. Raza, K. Munir, and M. Almutairi, "A Novel Deep Learning Approach for Deepfake Image Detection," Applied Sciences, vol. 12, no. 19, Sep. 2022, Art. no. 9820. DOI: https://doi.org/10.3390/app12199820

Z. Wu et al., "ASVspoof: The Automatic Speaker Verification Spoofing and Countermeasures Challenge," IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 4, pp. 588–604, Jun. 2017. DOI: https://doi.org/10.1109/JSTSP.2017.2671435

G. Ali, J. Rashid, M. Rameez Ul Hussnain, M. Usman Tariq, A. Ghani, and D. Kwak, "Beyond the Illusion: Ensemble Learning for Effective Voice Deepfake Detection," IEEE Access, vol. 12, pp. 149940–149959, 2024. DOI: https://doi.org/10.1109/ACCESS.2024.3457866

M. Masood, M. Nawaz, K. M. Malik, A. Javed, A. Irtaza, and H. Malik, "Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward," Applied Intelligence, vol. 53, no. 4, pp. 3974–4026, Feb. 2023. DOI: https://doi.org/10.1007/s10489-022-03766-z

J. L. Narayana Budati, A. Bharathi Jadam, and R. Malleboyina, "Explainable AI for Deepfake Detection: A Grad-CAM Approach to Video Forensics," in 2025 6th International Conference for Emerging Technology (INCET), Belgaum, India, May 2025, pp. 1–7. DOI: https://doi.org/10.1109/INCET64471.2025.11140952

N. Mansoor and A. I. Iliev, "Explainable AI for DeepFake Detection," Applied Sciences, vol. 15, no. 2, Jan. 2025, Art. no. 725. DOI: https://doi.org/10.3390/app15020725

W. A. Jbara, N. A. H. K. Hussein, and J. H. Soud, "Deepfake Detection in Video and Audio Clips: A Comprehensive Survey and Analysis," Mesopotamian Journal of CyberSecurity, vol. 4, no. 3, pp. 233–250, Dec. 2024. DOI: https://doi.org/10.58496/MJCS/2024/025

R. Rini, "Deepfakes and the Epistemic Backstop," Philosophers’ Imprint, vol. 20, no. 24, pp. 1–16, 2020.

S. Pascual, A. Bonafonte, and J. Serrà, "SEGAN: Speech Enhancement Generative Adversarial Network," in Interspeech 2017, Aug. 2017, pp. 3642–3646. DOI: https://doi.org/10.21437/Interspeech.2017-1428

R. K. Bhukya, A. Raj, and A. Kumar, "ASVSpoof 2021: Detecting Spoofed Utterances Through Hybrid Features," APSIPA Transactions on Signal and Information Processing, vol. 14, no. 1, 2025. DOI: https://doi.org/10.1561/116.20250026

E. B. Da Conceicao Mahangue, A. K. Sharma, and K. Gupta, "Systematic Review on Detection of Deepfake Attack using Machine Learning," in 2025 12th International Conference on Computing for Sustainable Global Development (INDIACom), Delhi, India, Apr. 2025, pp. 1–5. DOI: https://doi.org/10.23919/INDIACom66777.2025.11115308

B. Chettri, E. Benetos, and B. L. T. Sturm, "Dataset Artefacts in Anti-Spoofing Systems: A Case Study on the ASVspoof 2017 Benchmark," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 3018–3028, 2020. DOI: https://doi.org/10.1109/TASLP.2020.3036777

M. Li, Y. Ahmadiadli, and X. P. Zhang, "A Comparative Study on Physical and Perceptual Features for Deepfake Audio Detection," in Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia, Lisboa, Portugal, Oct. 2022, pp. 35–41. DOI: https://doi.org/10.1145/3552466.3556523

E. Şahin, N. N. Arslan, and D. Özdemir, "Unlocking the black box: an in-depth review on interpretability, explainability, and reliability in deep learning," Neural Computing and Applications, vol. 37, no. 2, pp. 859–965, Jan. 2025. DOI: https://doi.org/10.1007/s00521-024-10437-2

W. Huang, Y. Gu, Z. Wang, H. Zhu, and Y. Qian, "SpeechFake: A Large-Scale Multilingual Speech Deepfake Dataset Incorporating Cutting-Edge Generation Methods," in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 2025, pp. 9985–9998. DOI: https://doi.org/10.18653/v1/2025.acl-long.493

H. Muckenhirn, M. Magimai-Doss, and S. Marcel, "End-to-End convolutional neural network-based voice presentation attack detection," in 2017 IEEE International Joint Conference on Biometrics (IJCB), Denver, CO, USA, Oct. 2017, pp. 335–341. DOI: https://doi.org/10.1109/BTAS.2017.8272715

K. Sriskandaraja, V. Sethu, and E. Ambikairajah, "Deep Siamese Architecture Based Replay Detection for Secure Voice Biometric," in Interspeech 2018, Sep. 2018, pp. 671–675. DOI: https://doi.org/10.21437/Interspeech.2018-1819

A. Gomez-Alanis, A. M. Peinado, J. A. Gonzalez, and A. M. Gomez, "A Light Convolutional GRU-RNN Deep Feature Extractor for ASV Spoofing Detection," in Interspeech 2019, Sep. 2019, pp. 1068–1072. DOI: https://doi.org/10.21437/Interspeech.2019-2212

C. I. Lai, N. Chen, J. Villalba, and N. Dehak, "ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual neTworks." arXiv, 2019. DOI: https://doi.org/10.21437/Interspeech.2019-1794

J. Jung, H. Shim, H. S. Heo, and H. J. Yu, "Replay attack detection with complementary high-resolution information using end-to-end DNN for the ASVspoof 2019 Challenge." arXiv, Jul. 17, 2019. DOI: https://doi.org/10.21437/Interspeech.2019-1991

J. Zhan, Z. Pu, W. Jiang, J. Wu, and Y. Yang, "Detecting Spoofed Speeches via Segment-Based Word CQCC and Average ZCR for Embedded Systems," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 41, no. 11, pp. 3862–3873, Nov. 2022. DOI: https://doi.org/10.1109/TCAD.2022.3197531

M. Westerlund, "The Emergence of Deepfake Technology: A Review," Technology Innovation Management Review, vol. 9, no. 11, pp. 39–52, Jan. 2019. DOI: https://doi.org/10.22215/timreview/1282

T. Kirchengast, "Deepfakes and image manipulation: criminalisation and control," Information & Communications Technology Law, vol. 29, no. 3, pp. 308–323, Sep. 2020. DOI: https://doi.org/10.1080/13600834.2020.1794615

G. Lavrentyeva, S. Novoselov, E. Malykh, A. Kozlov, O. Kudashev, and V. Shchemelinin, "Audio Replay Attack Detection with Deep Learning Frameworks," in Interspeech 2017, Aug. 2017, pp. 82–86. DOI: https://doi.org/10.21437/Interspeech.2017-360

T. Kinnunen et al., "t-DCF: a Detection Cost Function for the Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification." arXiv, 2018. DOI: https://doi.org/10.21437/Odyssey.2018-44

K. Borodin, V. Kudryavtsev, G. Mkrtchian, and M. Gorodnichev, "Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry," Engineering, Technology & Applied Science Research, vol. 14, no. 6, pp. 18409–18414, Dec. 2024. DOI: https://doi.org/10.48084/etasr.8906

M. K. Z. Bajwa, A. Castiglione, and C. Pero, "Mel Spectrogram-Based CNN Framework for Explainable Audio Deepfake Detection," in Advanced Information Networking and Applications, vol. 252, L. Barolli, Ed. Springer Nature Switzerland, 2025, pp. 407–416. DOI: https://doi.org/10.1007/978-3-031-87784-1_37

T. M. Wani, S. A. A. Qadri, D. Comminiello, and I. Amerini, "Detecting Audio Deepfakes: Integrating CNN and BiLSTM with Multi-Feature Concatenation," in Proceedings of the 2024 ACM Workshop on Information Hiding and Multimedia Security, Baiona, Spain, Jun. 2024, pp. 271–276. DOI: https://doi.org/10.1145/3658664.3659647

A. Hamza et al., "Deepfake Audio Detection via MFCC Features Using Machine Learning," IEEE Access, vol. 10, pp. 134018–134028, 2022. DOI: https://doi.org/10.1109/ACCESS.2022.3231480