Real-Time Speech Emotion Recognition with a CNN-BiLSTM-Attention Deep Learning Model

Hmad Zennou; Raja Ouadad; Mohamed Ouhda; Mohamed Baslam

doi:10.48084/etasr.17495

Authors

Hmad Zennou TIAD, Sultan Moulay Slimane University, Beni Mellal, Morocco
Raja Ouadad LIMATI, Sultan Moulay Slimane University, Beni Mellal, Morocco https://orcid.org/0000-0002-7441-3209
Mohamed Ouhda TIAD, Sultan Moulay Slimane University, Beni Mellal, Morocco
Mohamed Baslam TIAD, Sultan Moulay Slimane University, Beni Mellal, Morocco

Volume: 16 | Issue: 3 | Pages: 35047-35055 | June 2026 | https://doi.org/10.48084/etasr.17495

Received: 12 January 2026 | Revised: 9 February 2026, 28 February 2026, 16 March 2026, 27 March 2026, and 31 March 2026 | Accepted: 1 April 2026 | Online: 8 April 2026

Corresponding author: Hmad Zennou

Abstract

Speech Emotion Recognition (SER) aims to automatically identify human emotions from audio signals by leveraging advanced artificial intelligence techniques. Speech contains multiple layers of information, such as prosodic variation, voice quality, and spectral patterns, captured through continuous and spectral features. Selecting the most informative features is crucial to accurately modeling emotional expression. Many SER systems rely primarily on spectral features, such as MFCCs; however, this study combines both MFCC and RMSE features to construct a richer emotional representation. A hybrid CNN-BiLSTM-Attention architecture is proposed, which integrates convolutional layers for extracting local spectral patterns, a bidirectional LSTM for capturing long-range temporal dependencies, and a soft attention mechanism that emphasizes the most relevant segments of speech. Experimental evaluation on the RAVTESS dataset demonstrates that the proposed model achieved 98.10% accuracy, 97.95% precision, 98.02% recall, and a 97.98% F1-score, outperforming baseline CNN-LSTM models. Although the model is lightweight and designed for real-time suitability, explicit inference latency and throughput measurements are reserved for future work. These results confirm that integrating attention improves recognition of emotionally salient cues, yielding a robust and compact framework suitable for practical SER applications.

Keywords:

speech emotion recognition, attention, CNN, BiLSTM, deep learning, human emotions

References

H. Zennou, M. Ouhda, and M. Baslam, "Toward an efficient emotion recognition from facial expressions using ML," presented at the International Conference on Research in Applied Mathematics and Computer Science, 2023.

F. Makhmudov, A. Kutlimuratov, and Y. I. Cho, "Hybrid LSTM–Attention and CNN Model for Enhanced Speech Emotion Recognition," Applied Sciences, vol. 14, no. 23, Dec. 2024, Art. no. 11342.

A. S. Alluhaidan, O. Saidani, R. Jahangir, M. A. Nauman, and O. S. Neffati, "Speech Emotion Recognition through Hybrid Features and Convolutional Neural Network," Applied Sciences, vol. 13, no. 8, Apr. 2023, Art. no. 4750.

M. A. R. Refat et al., "A hybrid LSTM CNN model with efficient channel attention for enhanced human activity recognition using wearable sensors," Discover Applied Sciences, vol. 8, no. 2, Dec. 2025, Art. no. 113.

C. Sun, Y. Zhou, X. Huang, J. Yang, and X. Hou, "Combining wav2vec 2.0 Fine-Tuning and ConLearnNet for Speech Emotion Recognition," Electronics, vol. 13, no. 6, Mar. 2024, Art. no. 1103.

S. Mekruksavanich and A. Jitpattanakul, "Sensor-based Complex Human Activity Recognition from Smartwatch Data using Hybrid Deep Learning Network," in 2021 36th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC), June 2021, pp. 1–4.

S. Tyagi and S. Szénási, "Semantic speech analysis using machine learning and deep learning techniques: a comprehensive review," Multimedia Tools and Applications, vol. 83, no. 29, pp. 73427–73456, Dec. 2023.

A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, "Speech Recognition Using Deep Neural Networks: A Systematic Review," IEEE Access, vol. 7, pp. 19143–19165, 2019.

H. Sheikh, C. Prins, and E. Schrijvers, "Artificial Intelligence: Definition and Background," in Mission AI: The New System Technology, H. Sheikh, C. Prins, and E. Schrijvers, Eds. Springer International Publishing, 2023, pp. 15–41.

Y. Ahn, S. Han, S. Lee, and J. W. Shin, "Speech Emotion Recognition Incorporating Relative Difficulty and Labeling Reliability," Sensors, vol. 24, no. 13, June 2024, Art. no. 4111.

J. Hyeon, Y. H. Oh, Y. J. Lee, and H. J. Choi, "Improving speech emotion recognition by fusing self-supervised learning and spectral features via mixture of experts," Data & Knowledge Engineering, vol. 150, Mar. 2024, Art. no. 102262.

R. Jahangir, Y. W. Teh, F. Hanif, and G. Mujtaba, "Deep learning approaches for speech emotion recognition: state of the art and research challenges," Multimedia Tools and Applications, vol. 80, no. 16, pp. 23745–23812, July 2021.

Y. Zeng, H. Mao, D. Peng, and Z. Yi, "Spectrogram based multi-task audio classification," Multimedia Tools and Applications, vol. 78, no. 3, pp. 3705–3722, Feb. 2019.

A. S. Popova, A. G. Rassadin, and A. A. Ponomarenko, "Emotion Recognition in Sound," in Advances in Neural Computation, Machine Learning, and Cognitive Research, 2018, pp. 117–124.

D. Issa, M. Fatih Demirci, and A. Yazici, "Speech emotion recognition with deep convolutional neural networks," Biomedical Signal Processing and Control, vol. 59, May 2020, Art. no. 101894.

H. Li, W. Ding, Z. Wu, and Z. Liu, "Learning Fine-Grained Cross Modality Excitement for Speech Emotion Recognition." arXiv, 2020.

Mustaqeem and S. Kwon, "CLSTM: Deep Feature-Based Speech Emotion Recognition Using the Hierarchical ConvLSTM Network," Mathematics, vol. 8, no. 12, Nov. 2020, Art. no. 2133.

J. Zhao, X. Mao, and L. Chen, "Speech emotion recognition using deep 1D & 2D CNN LSTM networks," Biomedical Signal Processing and Control, vol. 47, pp. 312–323, Jan. 2019.

S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Qadir, and B. Schuller, "Survey of Deep Representation Learning for Speech Emotion Recognition," IEEE Transactions on Affective Computing, vol. 14, no. 2, pp. 1634–1654, Apr. 2023.

Y. Xia and L. Zhao, "CNN-BLSTM with Attention Model for Speech Emotion Recognition." In Review, Oct. 04, 2023.

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." arXiv, 2020.

W. N. Hsu, B. Bolte, Y. H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, "HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units." arXiv, 2021.

L. Pepino, P. Riera, and L. Ferrer, "Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings." arXiv, 2021.

E. Zhang, R. Trujillo, and C. Poellabauer, "The MERSA Dataset and a Transformer-Based Approach for Speech Emotion Recognition," in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 13960–13970.

S. Han, F. Leng, and Z. Jin, "Speech Emotion Recognition with a ResNet-CNN-Transformer Parallel Neural Network," in 2021 International Conference on Communications, Information System and Computer Engineering (CISCE), May 2021, pp. 803–807.

A. Slimi, H. Nicolas, and M. Zrigui, "Hybrid Time Distributed CNN-transformer for Speech Emotion Recognition:," in Proceedings of the 17th International Conference on Software Technologies, 2022, pp. 602–611.

J. L. Bautista, Y. K. Lee, and H. S. Shin, "Speech Emotion Recognition Based on Parallel CNN-Attention Networks with Multi-Fold Data Augmentation," Electronics, vol. 11, no. 23, Nov. 2022, Art. no. 3935.

S. R. Livingstone and F. A. Russo, "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English," PLOS ONE, vol. 13, no. 5, May 2018, Art. no. e0196391.

M. K. Pichora-Fuller and K. Dupuis, "Toronto emotional speech set (TESS)." Borealis, 2020.

N. Semwal, A. Kumar, and S. Narayanan, "Automatic speech emotion detection system using multi-domain acoustic feature selection and classification models," in 2017 IEEE International Conference on Identity, Security and Behavior Analysis (ISBA), Feb. 2017, pp. 1–6.

K. He, X. Zhang, S. Ren, and J. Sun, "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." arXiv, 2015.

H. Zennou, Y. Taki, M. Ouhda, and M. Baslam, "Efficient Speech Emotion Recognition Using Lightweight CNN-LSTM Fusion," Kexue Tongbao/Chinese Science Bulletin, vol. 69, no. 5, pp. 2007–2019, 2024.

S. R. Livingstone and F. A. Russo, "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)." Zenodo, Dec. 05, 2018.