Enhancing Arabic Speaker Recognition with ECAPA-TDNN

Mahmoud Ayman; Fahad A. Aloufi

doi:10.48084/etasr.13902

Authors

Mahmoud Ayman Research and Innovation Department, T2 Company, Riyadh, Saudi Arabia
Fahad A. Aloufi Department of Cybersecurity, College of Computer, Qassim University, Qassim, Saudi Arabia | Research and Innovation Department, T2 Company, Riyadh, Saudi Arabia

Volume: 16 | Issue: 3 | Pages: 36747-36752 | June 2026 | https://doi.org/10.48084/etasr.13902

Received: 6 August 2025 | Revised: 15 September 2025, 27 October 2025, 8 March 2026, 29 March 2026, 30 March 2026, and 2 April 2026 | Accepted: 3 April 2026 | Online: 6 June 2026

Corresponding author: Mahmoud Ayman

Abstract

This paper presents a fine-tuned Emphasized Channel Attention, Propagation and Aggregation - Time Delay Neural Network (ECAPA-TDNN) model for Arabic speaker recognition, with a focus on enhancing performance in noisy environments. The model was trained on the Voice of Celebrities 1 (VoxCeleb1) and VoxCeleb2 corpora combined with Arabic data from the Qatar Computing Research Institute (QCRI) Aljazeera Speech Resource (QASR), and was evaluated on the VoxCeleb1 test protocol (Vox1-O), the Arab Celebrity (ArabCeleb) dataset, a held-out QASR test split, and an in-house Arabic dataset of authentic recordings. Through targeted fine-tuning and data augmentation techniques, the proposed approach reduces the Equal Error Rate (EER) on Arabic datasets and improves robustness to noise, while maintaining satisfactory performance on English datasets. These findings indicate that careful adaptation can support the development of more balanced multilingual speaker verification systems, particularly for underrepresented languages such as Arabic.

Keywords:

ECAPA-TDNN, speaker verification, speaker embeddings, noise

References

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, "X-Vectors: Robust DNN Embeddings for Speaker Recognition," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, Apr. 2018, pp. 5329–5333.

B. Desplanques, J. Thienpondt, and K. Demuynck, "ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification," in Interspeech 2020, Shanghai, China, Oct. 2020, pp. 3830–3834.

A. Nagrani, J. S. Chung, and A. Zisserman, "VoxCeleb: A Large-Scale Speaker Identification Dataset," in Interspeech 2017, Stockholm, Sweden, Aug. 2017, pp. 2616–2620.

J. S. Chung, A. Nagrani, and A. Zisserman, "VoxCeleb2: Deep Speaker Recognition," in Interspeech 2018, Hyderabad, India, Sept. 2018, pp. 1086–1090.

A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, "Voxceleb: Large-scale speaker verification in the wild," Computer Speech & Language, vol. 60, Mar. 2020, Art. no. 101027.

N. R. Koluguri, T. Park, and B. Ginsburg, "TitaNet: Neural Model for Speaker Representation with 1D Depth-Wise Separable Convolutions and Global Context," in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, May 2022, pp. 8102–8106.

H. Wang, S. Zheng, Y. Chen, L. Cheng, and Q. Chen, "CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking," in Interspeech 2023, Dublin, Ireland, Aug. 2023, pp. 5301–5305.

W. Zhu et al., "SpeechNAS: Towards Better Trade-Off Between Latency and Accuracy for Large-Scale Speaker Verification," in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, Dec. 2021, pp. 1102–1109.

J.-H. Kim, J. Heo, H. Shim, and H.-J. Yu, "Extended U-Net for Speaker Verification in Noisy Environments," in Interspeech 2022, Incheon, South Korea, Sept. 2022, pp. 590–594.

M. Ravanelli and Y. Bengio, "Speaker Recognition from Raw Waveform with SincNet," in 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, Dec. 2018, pp. 1021–1028.

W. Helali, Ζ. Hajaiej, and A. Cherif, "Real Time Speech Recognition based on PWP Thresholding and MFCC using SVM," Engineering, Technology & Applied Science Research, vol. 10, no. 5, pp. 6204–6208, Oct. 2020.

S. Nawaz et al., "Cross-modal Speaker Verification and Recognition: A Multilingual Perspective," in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, June 2021, pp. 1682–1691.

K. Nam, Y. Kim, J. Huh, H.-S. Heo, J. Jung, and J. S. Chung, "Disentangled Representation Learning for Multilingual Speaker Recognition," in Interspeech 2023, Dublin, Ireland, Aug. 2023, pp. 5316–5320.

H. Zhang, L. Wang, K. A. Lee, M. Liu, J. Dang, and H. Chen, "Meta-Learning for Cross-Channel Speaker Verification," in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, June 2021, pp. 5839–5843.

S. G. Kruthika, C. N. Trisiladevi, and P. Mahesha, "Voice Comparison Approaches for Forensic Application: A Review," in 2023 Third International Conference on Secure Cyber Computing and Communication (ICSCCC), Jalandhar, India, May 2023, pp. 797–802.

A. Akram, M. Stanojevic, M. Ehghaghi, and J. Novikova, "Zero-Shot Multi-Lingual Speaker Verification in Clinical Trials." arXiv, Apr. 2024.

S. Bianco et al., "ArabCeleb: Speaker Recognition in Arabic," in AIxIA 2021 – Advances in Artificial Intelligence, vol. 13196, S. Bandini, F. Gasparini, V. Mascardi, M. Palmonari, and G. Vizzari, Eds. Cham: Springer International Publishing, 2022, pp. 338–347.

H. Mubarak, A. Hussein, S. A. Chowdhury, and A. Ali, "QASR: QCRI Aljazeera Speech Resource A Large Scale Annotated Arabic Speech Corpus," in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 2274–2285.

D. Snyder, G. Chen, and D. Povey, "MUSAN: A Music, Speech, and Noise Corpus." arXiv, Oct. 2015.

M. Ravanelli et al., "Open-Source Conversational AI with SpeechBrain 1.0," Journal of Machine Learning Research, vol. 25, no. 333, pp. 1–11, 2024.