Development of a Robust Neural Network-Based VAD System under Low Signal-to-Noise Ratio Conditions

Aigul Kulakayeva; Bekbolat Medetov; Ainur Zhetpisbayeva; Aigul Nurlankyzy

doi:10.48084/etasr.14960

Authors

Aigul Kulakayeva Department of Radio Engineering, Electronics and Telecommunications, International Information Technology University, Almaty, Kazakhstan https://orcid.org/0000-0002-0143-085X
Bekbolat Medetov Department of Radio Engineering, Electronics and Telecommunications, L. N. Gumilyov Eurasian National University, Astana, Kazakhstan https://orcid.org/0000-0002-5594-8435
Ainur Zhetpisbayeva Department of Radio Engineering, Electronics and Telecommunications, L. N. Gumilyov Eurasian National University, Astana, Kazakhstan https://orcid.org/0000-0002-4525-5299
Aigul Nurlankyzy Department of Electronics, Telecommunications and Space Technologies, Satbayev University, Almaty, Kazakhstan | Department of Cybersecurity, International Information Technology University, Almaty, Kazakhstan

Volume: 15 | Issue: 6 | Pages: 30377-30386 | December 2025 | https://doi.org/10.48084/etasr.14960

Received: 20 September 2025 | Revised: 10 October 2025 | Accepted: 21 October 2025 | Online: 27 November 2025
Corresponding author: Aigul Nurlankyzy

Abstract

This study investigates the problem of developing and evaluating robust Voice Activity Detection (VAD) systems under low Signal-to-Noise Ratio (SNR) conditions, which presents a significant challenge for modern telecommunications and voice interface systems, especially in noisy acoustic environments. This study is important due to the limited investigation of contemporary hybrid neural network architectures for VAD in low-resource languages such as Kazakh, particularly across a wide range of SNR levels, including extreme values below -10 dB. The central research question is which modern hybrid neural network architecture offers the best balance between accuracy and computational efficiency for speech detection in the Kazakh language under severe noise conditions. This study developed and tested five architectures, CNN+BiGRU, CNN+GRU, CNN+LSTM, CNN+BiLSTM, and CNN+TDNN, based on the KSC2 corpus, augmented with synthetic noise across an SNR range from -18 dB to +30 dB, with separate analyses at fixed levels of 10 dB and -10 dB. MFCC features were used as input, and training/testing was performed using noise samples from the ESC-50 dataset. Experimental results demonstrated that the CNN+BiGRU, CNN+GRU, and CNN+LSTM architectures achieved the highest F1-score (99.6%) and maintained robustness at SNR levels above -12 dB, whereas CNN+TDNN provided comparable quality with minimal computational complexity and the shortest training time (164 s). The analysis under fixed SNR levels revealed the limited generalization capabilities of the models when trained on a single noise level, highlighting the necessity of incorporating a wide SNR range in training. In conclusion, the hybrid architectures CNN+BiGRU and CNN+TDNN are recommended for deployment in VAD systems for the Kazakh language in highly noisy environments.

Keywords:

Voice Activity Detection (VAD), Low Signal-to-Noise Ratio (SNR), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), BiGRU, BiLSTM

References

R. M. Patil and C. M. Patil, "Unveiling the State-of-the-Art: A Comprehensive Survey on Voice Activity Detection Techniques," in 2024 Asia Pacific Conference on Innovation in Technology (APCIT), MYSORE, India, Jul. 2024, pp. 1–5. DOI: https://doi.org/10.1109/APCIT62007.2024.10673721

Y. Lu and P. C. Loizou, ''A geometric approach to spectral subtraction,'' Speech Communication, vol. 50, no. 6, pp. 453–466, June 2008. DOI: https://doi.org/10.1016/j.specom.2008.01.003

Y. Hu and P. C. Loizou, ''A comparative intelligibility study of single-microphone noise reduction algorithms,'' The Journal of the Acoustical Society of America, vol. 122, no. 3, pp. 1777–1786, Sept. 2007. DOI: https://doi.org/10.1121/1.2766778

M. A. Hasby, A. G. Putrada, and F. Dawani, "The Quality Comparison of WebRTC and SIP Audio and Video Communications with PSNR," Indonesian Journal on Computing, vol. 6, no. 1, pp. 73–84, Apr. 2021.

R. Çolak and R. Akdenіz, ''A Novel Voice Activity Detection for Multi-Channel Noise Reduction,'' IEEE Access, vol. 9, pp. 91017–91026, 2021. DOI: https://doi.org/10.1109/ACCESS.2021.3086364

Z. Zhu, L. Zhang, K. Pei, and S. Chen, ''A robust and lightweight voice activity detection algorithm for speech enhancement at low signal-to-noise ratio,'' Digital Signal Processing, vol. 141, Sept. 2023, Art. no. 104151. DOI: https://doi.org/10.1016/j.dsp.2023.104151

S. M. Kim, ''Auditory Device Voice Activity Detection Based on Statistical Likelihood-Ratio Order Statistics,'' Applied Sciences, vol. 10, no. 15, Jan. 2020, Art. no. 5026. DOI: https://doi.org/10.3390/app10155026

F. Liu and A. Demosthenous, ''A Computation Efficient Voice Activity Detector for Low Signal-to-Noise Ratio in Hearing Aids,'' in 2021 IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), Dec. 2021, pp. 524–528. DOI: https://doi.org/10.1109/MWSCAS47672.2021.9531915

Y. Iqbal et al., "A Hybrid Speech Enhancement Technique Based on Discrete Wavelet Transform and Spectral Subtraction," IEEE Access, vol. 13, pp. 39765–39781, 2025. DOI: https://doi.org/10.1109/ACCESS.2025.3546434

B. G. Nagaraja, G. T. Yadava, P. Kabballi, and C. M. Patil, "VAD system under uncontrolled environment: A solution for strengthening the noise robustness using MMSE-SPZC," International Journal of Speech Technology, vol. 27, no. 2, pp. 309–317, Jun. 2024. DOI: https://doi.org/10.1007/s10772-024-10104-w

M. Aliouat and M. Djendi, "A new deep learning forward BSS (D-FBSS) algorithm for acoustic noise reduction and speech enhancement," Applied Acoustics, vol. 230, Feb. 2025, Art. no. 110413. DOI: https://doi.org/10.1016/j.apacoust.2024.110413

U. Shrawankar and V. Thakare, ''Voice Activity Detector and Noise Trackers for Speech Recognition System in Noisy Environment,'' International Journal of Advancements in Computing Technology, vol. 2, no. 4, pp. 107–114, Oct. 2010. DOI: https://doi.org/10.4156/ijact.vol2.issue4.11

T. Hughes and K. Mierle, ''Recurrent neural networks for voice activity detection,'' in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Feb. 2013, pp. 7378–7382. DOI: https://doi.org/10.1109/ICASSP.2013.6639096

X. Miao, I. McLoughlin, and Y. Yan, "A New Time-Frequency Attention Mechanism for TDNN and CNN-LSTM-TDNN, with Application to Language Identification," in Interspeech 2019, Sep. 2019, pp. 4080–4084. DOI: https://doi.org/10.21437/Interspeech.2019-1256

Y. Tan and X. Ding, ''Heterogeneous Convolutional Recurrent Neural Network with Attention Mechanism and Feature Aggregation for Voice Activity Detection,'' APSIPA Transactions on Signal and Information Processing, vol. 13, no. 1, 2024. DOI: https://doi.org/10.1561/116.00000158

I. Han, C. N. Om, and U. I. Kim, ''A gated recurrent unit based robust voice activity detector,'' Multimedia Tools and Applications, vol. 83, no. 14, pp. 41939–41949, Apr. 2024. DOI: https://doi.org/10.1007/s11042-023-17123-w

N. Wilkinson and T. Niesler, ''A Hybrid CNN-BiLSTM Voice Activity Detector,'' in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), June 2021, pp. 6803. DOI: https://doi.org/10.1109/ICASSP39728.2021.9415081

D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, ''Deep Neural Network Embeddings for Text-Independent Speaker Verification,'' in Interspeech 2017, Aug. 2017, pp. 999–1003. DOI: https://doi.org/10.21437/Interspeech.2017-620

S. Braun and I. Tashev, ''On training targets for noise-robust voice activity detection,'' in 2021 29th European Signal Processing Conference (EUSIPCO), Dec. 2021, pp. 421–425. DOI: https://doi.org/10.23919/EUSIPCO54536.2021.9616082

A. Mnassri, M. Bennasr, and C. Adnane, ''A Robust Feature Extraction Method for Real-Time Speech Recognition System on a Raspberry Pi 3 Board'', Engineering, Technology, & Aplied Science Research, vol. 9, no. 2, pp. 4066–4070, Apr. 2019. DOI: https://doi.org/10.48084/etasr.2533

B. G. Nagaraja and G. T. Yadava, ''Enhancing Voice Activity Detection in Noisy Environments Using Deep Neural Networks,'' Circuits, Systems, and Signal Processing, vol. 44, no. 7, pp. 5220–5234, July 2025. DOI: https://doi.org/10.1007/s00034-025-03055-3

Y. Hu and P. C. Loizou, ''Subjective comparison and evaluation of speech enhancement algorithms,'' Speech Communication, vol. 49, no. 7, pp. 588–601, July 2007. DOI: https://doi.org/10.1016/j.specom.2006.12.006

P. Cherukuru and M. B. Mustafa, "CNN-based noise reduction for multi-channel speech enhancement system with discrete wavelet transform (DWT) preprocessing," PeerJ Computer Science, vol. 10, Feb. 2024, Art. no. e1901. DOI: https://doi.org/10.7717/peerj-cs.1901

N. K. Singh, and Y. J. Chanu, "Robust Voice Activity Detection Algorithm based on Long Term Dominant Frequency and Spectral Flatness Measure," International Journal of Image, Graphics and Signal Processing, vol. 9, no. 8, pp. 50–58, Aug. 2017. DOI: https://doi.org/10.5815/ijigsp.2017.08.06

J. Ramírez, J. M. Górriz, J. C. Segura, "Voice Activity Detection. Fundamentals and Speech Recognition System Robustness," in Robust Speech Recognition and Understanding, 2007. DOI: https://doi.org/10.5772/4740

J. Sohn, N. S. Kim, W. Sung, ''A Statistical Model-Based Voice Activity Detector,'' IEEE Signal Processing Letters, 1999.

"Recommendation G.729: A silence compression scheme for G.729 optimized for terminals conforming to Recommendation V.70." International Telecommunication Union ITU), [Online]. Available: https://www.itu.int/rec/T-REC-G.729-199610-S!AnnB/en.

"Specification # 26.194." Third Generation Partnership Project (3GPP) - European Telecommunication Standards Institute (ETSI), [Online]. Available: https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=1428.

B. Karan, J. Jansen van Vüren, F. de Wet, and T. Niesler, "A Transformer-Based Voice Activity Detector," in Proceedings Interspeech 2024, Kos, Greece, 2024, pp. 3819–3823, https://doi.org/10.21437/Interspeech.2024-1019. DOI: https://doi.org/10.21437/Interspeech.2024-1019

S. Kumar et al., ''Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness.'' arXiv, June 12, 2024.

D. Wang, X. Xiao, N. Kanda, T. Yoshioka, and J. Wu, ''Target Speaker Voice Activity Detection with Transformers and Its Integration with End-To-End Neural Diarization,'' in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), June 2023, pp. 1–5. DOI: https://doi.org/10.1109/ICASSP49357.2023.10095185

Y. Zhao and B. Champagne, ''An Efficient Transformer-Based Model for Voice Activity Detection,'' in 2022 IEEE 32nd International Workshop on Machine Learning for Signal Processing (MLSP), Dec. 2022, pp. 1–6. DOI: https://doi.org/10.1109/MLSP55214.2022.9943501

C. C. Wang, E. L. Yu, J. W. Hung, S. C. Huang, and B. Chen, ''SincQDR-VAD: A Noise-Robust Voice Activity Detection Framework Leveraging Learnable Filters and Ranking-Aware Optimization.'' arXiv, Aug. 28, 2025.

S. Mussakhojayeva, Y. Khassanov, and H. A. Varol, "KSC2: An Industrial-Scale Open-Source Kazakh Speech Corpus," Interspeech 2022, Incheon, Korea, 2022, pp. 1367–1371. DOI: https://doi.org/10.21437/Interspeech.2022-421

K. J. Piczak, ''ESC: Dataset for Environmental Sound Classification,'' in Proceedings of the 23rd ACM international conference on Multimedia, Brisbane, Australia, Oct. 2015, pp. 1015–1018. DOI: https://doi.org/10.1145/2733373.2806390

A. Nurlankyzy et al., "The dependence of the effectiveness of neural networks for recognizing human voice on language," Eastern-European Journal of Enterprise Technologies, vol. 1, no. 9 (127), pp. 72–81, Feb. 2024. DOI: https://doi.org/10.15587/1729-4061.2024.298687

B. Medetov et al., ''Evaluating the effectiveness of a voice activity detector based on various neural networks,'' Eastern-European Journal of Enterprise Technologies, vol. 133, no. 5, pp. 19-28, Jan. 2025. DOI: https://doi.org/10.15587/1729-4061.2025.321659