Enhancement and Reconstruction of Dysphonic Kannada Speech Using a Generative Adversarial Network and a SepFormer Model

P. Rajeswari; N. Shankaraiah; S. Rathnakara

doi:10.48084/etasr.14812

Authors

P. Rajeswari JSS Science and Technology University, Manasagangotri, Mysuru, Karnataka, India
N. Shankaraiah S.J. College of Engineering, JSS Science and Technology University, Manasagangotri, Mysuru, Karnataka, India https://orcid.org/0000-0003-2810-3872
S. Rathnakara S.J. College of Engineering, JSS Science and Technology University, Manasagangotri, Mysuru, Karnataka, India https://orcid.org/0000-0001-7156-0468

Volume: 15 | Issue: 6 | Pages: 29097-29102 | December 2025 | https://doi.org/10.48084/etasr.14812

Received: 15 September 2025 | Revised: 16 October 2025 | Accepted: 24 October 2025 | Online: 19 November 2025

Corresponding author: P. Rajeswari

Abstract

Human speech is the most effective form of communication, enabling individuals to convey their thoughts, ideas, and emotions clearly to others. However, many individuals suffer from different types of speech disorders, among which a common speech disorder is dysphonia. This speech disorder not only hampers everyday interactions but also affects the overall quality of life for an individual. Many researchers have worked in this field to develop various modern tools to convert dysphonic speech into normal speech. In spite of its impact, limited emphasis has been placed on addressing the challenges of dysphonia in languages other than English. This paper presents innovative, ensemble learning-based methods designed to improve dysphonic speech signals in Kannada, one of the most widely spoken languages in South India. In this paper, new deep learning methods, such as the Generative Adversarial Network (GAN) and the SepFormer model, are used for enhancement and reconstruction of Kannada dysphonic speech signals. Compared to the GAN, SepFormer provides better results in terms of objective evaluation metrics.

Keywords:

dysphonia, Generative Adversarial Network (GAN), SepFormer, speech enhancement

References

P. C. Loizou, Speech Enhancement: Theory and Practice, Second Edition, 2nd ed. Boca Raton, FL, USA: CRC Press, 2013.

K. Paliwal, K. Wójcicki, and B. Shannon, "The importance of phase in speech enhancement," Speech Communication, vol. 53, no. 4, pp. 465–494, Apr. 2011. DOI: https://doi.org/10.1016/j.specom.2010.12.003

I. J. Goodfellow et al., "Generative Adversarial Nets," in Advances in Neural Information Processing Systems, Montreal, Canada, 2014, pp. 2672–2680.

S. Pascual, J. Serrà, and A. Bonafonte, "Time-domain speech enhancement using generative adversarial networks," Speech Communication, vol. 114, pp. 10–21, Nov. 2019. DOI: https://doi.org/10.1016/j.specom.2019.09.001

L. Liu, H. Guan, J. Ma, W. Dai, G. Wang, and S. Ding, "A Mask Free Neural Network for Monaural Speech Enhancement," in Interspeech 2023, Dublin, Ireland, 2023, pp. 2468–2472. DOI: https://doi.org/10.21437/Interspeech.2023-339

H. Schröter, A. Maier, A. N. Escalante-B, and T. Rosenkranz, "Deepfilternet2: Towards Real-Time Speech Enhancement on Embedded Devices for Full-Band Audio," in 2022 International Workshop on Acoustic Signal Enhancement, Bamberg, Germany, 2022, pp. 1–5. DOI: https://doi.org/10.1109/IWAENC53105.2022.9914782

H.-S. Choi, S. Park, J. H. Lee, H. Heo, D. Jeon, and K. Lee, "Real-Time Denoising and Dereverberation wtih Tiny Recurrent U-Net," in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, Canada, 2021, pp. 5789–5793. DOI: https://doi.org/10.1109/ICASSP39728.2021.9414852

S. Zhao, B. Ma, K. N. Watcharasupat, and W.-S. Gan, "FRCRN: Boosting Feature Representation Using Frequency Recurrence for Monaural Speech Enhancement," in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, Singapore, 2022, pp. 9281–9285. DOI: https://doi.org/10.1109/ICASSP43922.2022.9747578

S. S. Shetu, S. Chakrabarty, O. Thiergart, and E. Mabande, "Ultra Low Complexity Deep Learning Based Noise Suppression," in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, South Korea, 2024, pp. 466–470. DOI: https://doi.org/10.1109/ICASSP48485.2024.10448353

J. Wenbin, L. I. U. Peilin, and W. E. N. Fei, "Speech Magnitude Spectrum Reconstruction from MFCCs Using Deep Neural Network," Chinese Journal of Electronics, vol. 27, no. 2, pp. 393–398, Mar. 2018. DOI: https://doi.org/10.1049/cje.2017.09.018

S. S. Shetu, E. A. P. Habets, and A. Brendel, "GAN-Based Speech Enhancement for Low SNR Using Latent Feature Conditioning," in ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India, 2025, pp. 1–5. DOI: https://doi.org/10.1109/ICASSP49660.2025.10890549

S. S. Shetu, E. A. P. Habets, and A. Brendel, "Comparative Analysis of Discriminative Deep Learning-Based Noise Reduction Methods in Low SNR Scenarios," in 2024 18th International Workshop on Acoustic Signal Enhancement, Aalborg, Denmark, 2024, pp. 36–40. DOI: https://doi.org/10.1109/IWAENC61483.2024.10694283

Q. Hu, T. Tan, M. Tang, Y. Hu, C. Zhu, and J. Lu, "General Speech Restoration Using Two-Stage Generative Adversarial Networks," in 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, Seoul, South Korea, 2024, pp. 31–32. DOI: https://doi.org/10.1109/ICASSPW62465.2024.10625840

Y. Duan, J. Ren, H. Yu, and X. Jiang, "GAN-in-GAN for Monaural Speech Enhancement," IEEE Signal Processing Letters, vol. 30, pp. 853–857, 2023. DOI: https://doi.org/10.1109/LSP.2023.3293758

D. Habeeb, A. H. Alhassani, L. N. Abdullah, C. S. Der, and L. K. Q. Alasadi, "Advancements and Challenges: A Comprehensive Review of GAN-based Models for the Mitigation of Small Dataset and Texture Sticking Issues in Fake License Plate Recognition," Engineering, Technology & Applied Science Research, vol. 14, no. 6, pp. 18401–18408, Dec. 2024. DOI: https://doi.org/10.48084/etasr.8870

S. Liu, "Using Transformer Models to Separate and Reduce Noise in Speaker Voice in Noisy Conference Scenarios," in 2024 4th International Signal Processing, Communications and Engineering Management Conference, Montreal, Canada, 2024, pp. 84–90. DOI: https://doi.org/10.1109/ISPCEM64498.2024.00021

M. Ravanelli et al., "SpeechBrain: A General-Purpose Speech Toolkit." arXiv, June 10, 2021.

A. Nazemi, A. Sami, M. Sami, and A. Hussain, "Iterative Speech Enhancement with Transformers," in 3rd COG-MHEAR Workshop on Audio-Visual Speech Enhancement, Kos, Greece, 2024, pp. 65–67. DOI: https://doi.org/10.21437/AVSEC.2024-14

D. de Oliveira, T. Peer, and T. Gerkmann, "Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes," in Interspeech 2022, Incheon, Korea, 2022, pp. 2948–2952. DOI: https://doi.org/10.21437/Interspeech.2022-10781

K. Wang, B. He, and W.-P. Zhu, "TSTNN: Two-Stage Transformer Based Neural Network for Speech Enhancement in the Time Domain," in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, Canada, 2021, pp. 7098–7102. DOI: https://doi.org/10.1109/ICASSP39728.2021.9413740