Acoustic Signal Enhancement Using Deep Neural Networks
Received: 14 February 2025 | Revised: 8 May 2025 | Accepted: 12 May 2025 | Online: 2 August 2025
Corresponding author: Shibani Kar
Abstract
The presence of background noise in acoustic signals, such as speech, audio, and sound signals, degrades listening quality and causes hearing fatigue to the listener. Standard methods offer better signal enhancement under high SNR conditions. Deep neural networks employed in image processing and speech recognition have demonstrated significant performance improvements. This motivates the usage of deep neural networks for denoising speech signals corrupted with multiple noises under low SNR conditions (0 dB). This study applied two different types of deep neural networks, convolutional neural networks and deep generative networks, to remove background noise from speech signals under low SNR conditions. The noise reduction networks were trained to estimate the noise signal present, which was then subtracted to obtain the denoised speech signal. Two convolutional neural network architectures, the UNet and the Convolutional Encoder-Decoder network (CED), and two deep generative networks, Vector Quantized Variational Autoencoders (VQVAE) and Variational Autoencoders (VAE), were trained on STFT magnitude features of noisy signal frames. Four objective quality measures were used to determine the quality of the enhanced speech, namely Perceptual Evaluation of Speech Quality (PESQ), Short Time Objective Intelligibility (STOI), Segmental Signal to Noise Ratio (SSNR), and improvement in SNR. Spectral subtraction and logMMSE methods were used to evaluate the performance of these networks in two datasets. The results of the comparative analysis support the superiority of CED for signal denoising and enhancement of speech signals for multiple noises under low SNR conditions, with a much smaller number of model parameters compared to other methods for both seen and unseen noise conditions.
Keywords:
background noise estimation, speech signal, deep neural networks, deep generative networksDownloads
References
S. R. Park and J. W. Lee, "A Fully Convolutional Neural Network for Speech Enhancement," in Interspeech 2017, Aug. 2017, pp. 1993–1997. DOI: https://doi.org/10.21437/Interspeech.2017-1465
N. Krishnamurthy and J. H. L. Hansen, "Babble Noise: Modeling, Analysis, and Applications," IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 7, pp. 1394–1407, Sep. 2009. DOI: https://doi.org/10.1109/TASL.2009.2015084
S. Boll, "Suppression of acoustic noise in speech using spectral subtraction," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, no. 2, pp. 113–120, Apr. 1979. DOI: https://doi.org/10.1109/TASSP.1979.1163209
Y. Ephraim and D. Malah, "Speech enhancement using a minimum mean-square error log-spectral amplitude estimator," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 33, no. 2, pp. 443–445, Apr. 1985. DOI: https://doi.org/10.1109/TASSP.1985.1164550
K. Paliwal, K. Wójcicki, and B. Schwerin, "Single-channel speech enhancement using spectral subtraction in the short-time modulation domain," Speech Communication, vol. 52, no. 5, pp. 450–475, May 2010. DOI: https://doi.org/10.1016/j.specom.2010.02.004
Y. Ephraim and D. Malah, "Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 6, pp. 1109–1121, Dec. 1984. DOI: https://doi.org/10.1109/TASSP.1984.1164453
D. E. Tsoukalas, J. N. Mourjopoulos, and G. Kokkinakis, "Speech enhancement based on audible noise suppression," IEEE Transactions on Speech and Audio Processing, vol. 5, no. 6, pp. 497–514, Nov. 1997. DOI: https://doi.org/10.1109/89.641296
N. Virag, "Single channel speech enhancement based on masking properties of the human auditory system," IEEE Transactions on Speech and Audio Processing, vol. 7, no. 2, pp. 126–137, Mar. 1999. DOI: https://doi.org/10.1109/89.748118
Y. Hu and P. C. Loizou, "A comparative intelligibility study of single-microphone noise reduction algorithms," The Journal of the Acoustical Society of America, vol. 122, no. 3, pp. 1777–1786, Sep. 2007. DOI: https://doi.org/10.1121/1.2766778
Y. Hu and P. C. Loizou, "Subjective comparison and evaluation of speech enhancement algorithms," Speech Communication, vol. 49, no. 7–8, pp. 588–601, Jul. 2007. DOI: https://doi.org/10.1016/j.specom.2006.12.006
A. Azmat, I. Ali, W. Ariyanti, M. G. L. Putra, and T. Nadeem, "Environmental Noise Reduction based on Deep Denoising Autoencoder," Engineering, Technology & Applied Science Research, vol. 12, no. 6, pp. 9532–9535, Dec. 2022. DOI: https://doi.org/10.48084/etasr.5239
N. Alamdari, A. Azarang, and N. Kehtarnavaz, "Improving deep speech denoising by Noisy2Noisy signal mapping," Applied Acoustics, vol. 172, Jan. 2021, Art. no. 107631. DOI: https://doi.org/10.1016/j.apacoust.2020.107631
V. Srinivasarao and U. Ghanekar, "Speech enhancement - an enhanced principal component analysis (EPCA) filter approach," Computers & Electrical Engineering, vol. 85, Jul. 2020, Art. no. 106657. DOI: https://doi.org/10.1016/j.compeleceng.2020.106657
D. S. Williamson and D. Wang, "Time-Frequency Masking in the Complex Domain for Speech Dereverberation and Denoising," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 7, pp. 1492–1501, Jul. 2017. DOI: https://doi.org/10.1109/TASLP.2017.2696307
Y. Xu, J. Du, L. R. Dai, and C. H. Lee, "An Experimental Study on Speech Enhancement Based on Deep Neural Networks," IEEE Signal Processing Letters, vol. 21, no. 1, pp. 65–68, Jan. 2014. DOI: https://doi.org/10.1109/LSP.2013.2291240
D. Wang and J. Chen, "Supervised Speech Separation Based on Deep Learning: An Overview," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702–1726, Oct. 2018. DOI: https://doi.org/10.1109/TASLP.2018.2842159
S. Pascual, J. Serrà, and A. Bonafonte, "Time-domain speech enhancement using generative adversarial networks," Speech Communication, vol. 114, pp. 10–21, Nov. 2019. DOI: https://doi.org/10.1016/j.specom.2019.09.001
A. Azarang and N. Kehtarnavaz, "A review of multi-objective deep learning speech denoising methods," Speech Communication, vol. 122, pp. 1–10, Sep. 2020. DOI: https://doi.org/10.1016/j.specom.2020.04.002
C. Valentini-Botinhao, "Noisy speech database for training speech enhancement algorithms and TTS models." University of Edinburgh. School of Informatics. Centre for Speech Technology Research (CSTR), 2017.
Y. Hu and P. C. Loizou, "Evaluation of Objective Quality Measures for Speech Enhancement," IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 1, pp. 229–238, Jan. 2008. DOI: https://doi.org/10.1109/TASL.2007.911054
X. Dong and D. S. Williamson, "Towards real-world objective speech quality and intelligibility assessment using speech-enhancement residuals and convolutional long short-term memory networks," The Journal of the Acoustical Society of America, vol. 148, no. 5, pp. 3348–3359, Nov. 2020. DOI: https://doi.org/10.1121/10.0002702
C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, "A short-time objective intelligibility measure for time-frequency weighted noisy speech," in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA, Mar. 2010, pp. 4214–4217. DOI: https://doi.org/10.1109/ICASSP.2010.5495701
O. Ronneberger, P. Fischer, and T. Brox, "U-Net: Convolutional Networks for Biomedical Image Segmentation," in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, vol. 9351, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, Eds. Springer International Publishing, 2015, pp. 234–241. DOI: https://doi.org/10.1007/978-3-319-24574-4_28
D. P. Kingma and M. Welling, "An Introduction to Variational Autoencoders," Foundations and Trends® in Machine Learning, vol. 12, no. 4, pp. 307–392, 2019. DOI: https://doi.org/10.1561/2200000056
A. Van Den Oord, O. Vinyals, and Koray Kavukcuoglu, "Neural Discrete Representation Learning," in Advances in Neural Information Processing Systems, 2017, vol. 30.
Downloads
How to Cite
License
Copyright (c) 2025 Shibani Kar, Vishwajeet Mukherjee

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.
