Enhanced Deep Learning Techniques for Real-Time Speech Emotion Recognition in Multilingual Contexts

Donia Y. Badawood; Fahd M. Aldosari

doi:10.48084/etasr.9229

Authors

Donia Y. Badawood Data Science Department, College of Computing, Umm Al-Qura University, Makkah, Saudi Arabia
Fahd M. Aldosari Computer and Network Engineering Department, College of Computing, Umm Al-Qura University, Makkah, Saudi Arabia (corresponding author)

Volume: 14 | Issue: 6 | Pages: 18662-18669 | December 2024 | https://doi.org/10.48084/etasr.9229

Received: 11 October 2024 | Revised: 26 October 2024 and 30 October 2024 | Accepted: 3 November 2024 | Online: 26 November 2024

Corresponding author: Fahd M. Aldosari

Abstract

Emotion recognition from speech is crucial for advancing human-computer interactions, enabling more natural and empathetic communication. This study proposes a novel Speech Emotion Recognition (SER) framework that integrates Convolutional Neural Networks (CNNs) and transformer-based architectures to capture local and contextual speech features. The model demonstrates strong classification performance, particularly for prominent emotions such as anger, sadness, and happiness. However, challenges persist in detecting less frequent emotions, such as surprise and calm, highlighting areas for improvement. The limitations of current datasets, such as limited linguistic diversity, are discussed. The findings underscore the model's robustness and identify avenues for future enhancement, such as incorporating more diverse datasets and employing techniques such as transfer learning. Future work will explore multimodal approaches and real-time implementation on edge devices to improve the system's adaptability in real-world scenarios.

Keywords:

CNN, deep learning, speech emotion recognition, multilingual, real time

Downloads

Download data is not yet available.

References

T. M. Wani, T. S. Gunawan, S. A. A. Qadri, M. Kartiwi, and E. Ambikairajah, "A Comprehensive Review of Speech Emotion Recognition Systems," IEEE Access, vol. 9, pp. 47795–47814, 2021.

M. Spezialetti, G. Placidi, and S. Rossi, "Emotion Recognition for Human-Robot Interaction: Recent Advances and Future Perspectives," Frontiers in Robotics and AI, vol. 7, Dec. 2020.

D. Issa, M. Fatih Demirci, and A. Yazici, "Speech emotion recognition with deep convolutional neural networks," Biomedical Signal Processing and Control, vol. 59, May 2020, Art. no. 101894.

R. Damaševičius, S. K. Jagatheesaperumal, R. N. V. P. S. Kandala, S. Hussain, R. Alizadehsani, and J. M. Gorriz, "Deep learning for personalized health monitoring and prediction: A review," Computational Intelligence, vol. 40, no. 3, 2024, Art. no. e12682.

M. B. Akçay and K. Oğuz, "Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers," Speech Communication, vol. 116, pp. 56–76, Jan. 2020.

E. K. Zadeh and M. Alaeifard, "Adaptive Virtual Assistant Interaction through Real-Time Speech Emotion Analysis Using Hybrid Deep Learning Models and Contextual Awareness," International Journal of Advanced Human Computer Interaction, vol. 1, no. 1, pp. 1–15, Jul. 2023.

S. Shen, Y. Gao, F. Liu, H. Wang, and A. Zhou, "Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition," in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, Apr. 2024, pp. 10111–10115.

S. N. Atkar, R. Agrawal, C. Dhule, N. C. Morris, P. Saraf, and K. Kalbande, "Speech Emotion Recognition using Dialogue Emotion Decoder and CNN Classifier," in 2023 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC), Salem, India, May 2023, pp. 94–99.

S. Malla, A. Alsadoon, and S. K. Bajaj, "A DFC taxonomy of Speech emotion recognition based on convolutional neural network from speech signal," in 2020 5th International Conference on Innovative Technologies in Intelligent Systems and Industrial Applications (CITISIA), Sydney, Australia, Nov. 2020, pp. 1–10.

Y. Li, P. Bell, and C. Lai, "Fusing ASR Outputs in Joint Training for Speech Emotion Recognition," in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, May 2022, pp. 7362–7366.

Y. X. Xi, Y. Song, L. R. Dai, I. McLoughlin, and L. Liu, "Frontend Attributes Disentanglement for Speech Emotion Recognition," in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, May 2022, pp. 7712–7716.

H. Zhang, M. Mimura, T. Kawahara, and K. Ishizuka, "Selective Multi-Task Learning For Speech Emotion Recognition Using Corpora Of Different Styles," in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, May 2022, pp. 7707–7711.

R. Lotfidereshgi and P. Gournay, "Biologically inspired speech emotion recognition," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, Mar. 2017, pp. 5135–5139.

Z. Yuan, C. L. Philip Chen, S. Li, and T. Zhang, "Disentanglement Network: Disentangle the Emotional Features from Acoustic Features for Speech Emotion Recognition," in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, Apr. 2024, pp. 11686–11690.

T. Kexin, H. Yongming, Z. Guobao, and Z. Lin, "Research on Emergency Parking Instruction Recognition Based on Speech Recognition and Speech Emotion Recognition," in 2019 Chinese Automation Congress (CAC), Hangzhou, China, Nov. 2019, pp. 2933–2937.

K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 770–778.

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image Quality Assessment: From Error Visibility to Structural Similarity," IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, Apr. 2004.

D. Parres and R. Paredes, "Fine-Tuning Vision Encoder–Decoder Transformers for Handwriting Text Recognition on Historical Documents," in Document Analysis and Recognition - ICDAR 2023, San José, CA, USA, 2023, pp. 253–268.

C. Szegedy et al., "Going deeper with convolutions," in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, Jun. 2015, pp. 1–9.

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You Only Look Once: Unified, Real-Time Object Detection," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 779–788.

S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, Jun. 2017.

E. Shelhamer, J. Long, and T. Darrell, "Fully Convolutional Networks for Semantic Segmentation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 640–651, Apr. 2017.

N. Dalal and B. Triggs, "Histograms of Oriented Gradients for Human Detection," in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 2005, vol. 1, pp. 886–893.

Z. Liu et al., "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows," in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, Canada, Oct. 2021, pp. 9992–10002.

K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, "A fast and elitist multiobjective genetic algorithm: NSGA-II," IEEE Transactions on Evolutionary Computation, vol. 6, no. 2, pp. 182–197, Apr. 2002.

M. Sharafi, M. Yazdchi, R. Rasti, and F. Nasimi, "A novel spatio-temporal convolutional neural framework for multimodal emotion recognition," Biomedical Signal Processing and Control, vol. 78, Sep. 2022, Art. no. 103970.

B. Mocanu, R. Tapu, and T. Zaharia, "Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning," Image and Vision Computing, vol. 133, May 2023, Art. no. 104676.

L. Alhinti, S. Cunningham, and H. Christensen, "The Dysarthric Expressed Emotional Database (DEED): An audio-visual database in British English," PLOS ONE, vol. 18, no. 8, 2023, Art. no. e0287971.

P. T. Krishnan, A. N. Joseph Raj, and V. Rajangam, "Emotion classification from speech signal based on empirical mode decomposition and non-linear features," Complex & Intelligent Systems, vol. 7, no. 4, pp. 1919–1934, Aug. 2021.

Vol. 15 (2025)	Vol. 7 (2017)
Vol. 14 (2024)	Vol. 6 (2016)
Vol. 13 (2023)	Vol. 5 (2015)
Vol. 12 (2022)	Vol. 4 (2014)
Vol. 11 (2021)	Vol. 3 (2013)
Vol. 10 (2020)	Vol. 2 (2012)
Vol. 9 (2019)	Vol. 1 (2011)
Vol. 8 (2018)

Enhanced Deep Learning Techniques for Real-Time Speech Emotion Recognition in Multilingual Contexts

Authors

Abstract

Keywords:

Downloads

References

Downloads

How to Cite

Metrics

License

Most read articles by the same author(s)

Real-Time Rain Prediction in Agriculture using AI and IoT: A Bi-Directional LSTM Approach

Optimizing Edge Computing for Activity Recognition: A Bidirectional LSTM Approach on the PAMAP2 Dataset

Optimizing TEM Image Segmentation: Advancements in DRU-Net Architecture with Dense Residual Connections and Attention Mechanisms