DeepEmoNet: A Lightweight Context-Aware CNN for Multimodal Emotion Recognition

Sumitra A. Jakhete; Nilima Kulkarni

doi:10.48084/etasr.15258

Authors

Sumitra A. Jakhete Department of Computer Science and Engineering, MIT School of Computing, MIT Art Design and Technology University, Pune, India | Department of Information Technology, SCTR's Pune Institute of Computer Technology, Pune, India
Nilima Kulkarni Department of Computer Science and Engineering, MIT School of Computing, MIT Art Design and Technology University, Pune, India

Volume: 16 | Issue: 2 | Pages: 32844-32854 | April 2026 | https://doi.org/10.48084/etasr.15258

Received: 1 October 2025 | Revised: 16 November 2025, 13 December 2025, and 17 December 2025 | Accepted: 18 December 2025 | Online: 7 February 2026

Corresponding author: Sumitra A. Jakhete

Abstract

Multimodal emotion recognition in real-world environments remains challenging due to occlusions, class imbalance, and the high computational cost of existing deep models. This paper presents DeepEmoNet, a lightweight multimodal Convolutional Neural Network (CNN) designed to integrate facial, gait, scene, and socio-dynamic depth cues through an early-fusion architecture based on Depthwise Separable Convolutions (DSCs). The model aims to achieve robust emotion recognition while maintaining low computational overhead suitable for real-time applications. Experiments on the GroupWalk dataset comprising 3,544 annotated agents across 45 environments demonstrate that DeepEmoNet achieves 91.3% accuracy and 86.5% mean Average Precision (mAP), outperforming Inception V3, ResNet-50, MobileNetV2, and recent multimodal baselines. Extended ablation studies highlight the importance of contextual modalities and early fusion, with four DSC modules offering the best accuracy–efficiency balance. Inference analysis further shows a latency of 14.8 ms/frame (~67 frames per second (FPS)), supporting real-time deployment. Overall, DeepEmoNet offers an efficient, context-aware multimodal CNN framework for emotion recognition in surveillance, smart environments, and human–computer interaction.

Keywords:

affective computing, Convolutional Neural Network (CNN), computational efficiency, deep learning, Depthwise Separable Convolution (DSC)Depthwise Separable Convolution (DSC), emotion recognition, multimodal fusion, real-time inference

References

S. Zhao, G. Jia, J. Yang, G. Ding, and K. Keutzer, "Emotion Recognition From Multiple Modalities: Fundamentals and methodologies," IEEE Signal Processing Magazine, vol. 38, no. 6, pp. 59–73, Nov. 2021. DOI: https://doi.org/10.1109/MSP.2021.3106895

S. A. Jakhete and N. Kulkarni, "A Comprehensive Survey and Evaluation of MediaPipe Face Mesh for Human Emotion Recognition," in 2024 8th International Conference on Computing, Communication, Control and Automation, Pune, India, 2024, pp. 1–8. DOI: https://doi.org/10.1109/ICCUBEA61740.2024.10775188

K. R. Scherer, "What are emotions? And how can they be measured?," Social Science Information, vol. 44, no. 4, pp. 695–729, Dec. 2005. DOI: https://doi.org/10.1177/0539018405058216

S. Kaur and N. Kulkarni, "A Deep Learning Technique for Emotion Recognition Using Face and Voice Features," in 2021 IEEE Pune Section International Conference (PuneCon), Pune, India, 2021, pp. 1–6. DOI: https://doi.org/10.1109/PuneCon52575.2021.9686510

Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, "Multimodal Transformer for Unaligned Multimodal Language Sequences," in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019, pp. 6558–6569. DOI: https://doi.org/10.18653/v1/P19-1656

J. J. Deng and C. H. C. Leung, "Towards Learning a Joint Representation from Transformer in Multimodal Emotion Recognition," in 4th International Conference on Brain Informatics, Online, 2021, pp. 179–188. DOI: https://doi.org/10.1007/978-3-030-86993-9_17

B. Nojavanasghari, T. Baltrušaitis, C. E. Hughes, and L.-P. Morency, "EmoReact: a multimodal approach and dataset for recognizing emotional responses in children," in Proceedings of the 18th ACM International Conference on Multimodal Interaction, Tokyo, Japan, 2016, pp. 137–144. DOI: https://doi.org/10.1145/2993148.2993168

B. Li et al., "Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition," in Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, Canada, 2023, pp. 5923–5934. DOI: https://doi.org/10.1145/3581783.3612053

S. Srivastava, S. A. Si. Lakshminarayan, S. Hinduja, S. R. Jannat, H. Elhamdadi, and S. Canavan, "Recognizing Emotion in the Wild using Multimodal Data," in Proceedings of the 2020 International Conference on Multimodal Interaction, Virtual Event, Netherlands, 2020, pp. 849–857. DOI: https://doi.org/10.1145/3382507.3417970

S. A. Jakhete and N. Kulkarni, "Enhanced Human Emotion Recognition through Multimodal Data using Deep Learning and Late Fusion Technique," International Journal of Engineering, vol. 39, no. 9, pp. 2177–2188, Sept. 2026. DOI: https://doi.org/10.5829/ije.2026.39.09c.09

S. Ullah, Y. Xie, J. Ou, Z. Wang, and W. Tian, "A Robust Lightweight Compound Emotion Recognition Approach Using Depthwise Separable CNN." Research Square, May 08, 2024. DOI: https://doi.org/10.21203/rs.3.rs-4354821/v1

J. Li, Z. Liu, W. Zhou, A. U. Haq, and A. Saboor, "FERmc: Facial expression recognition framework based on multi-branch fusion and depthwise separable convolution," Information Fusion, vol. 124, Dec. 2025, Art. no. 103416. DOI: https://doi.org/10.1016/j.inffus.2025.103416

S. K. D'Mello and A. Graesser, "Multimodal semi-automated affect detection from conversational cues, gross body language, and facial features," User Modeling and User-Adapted Interaction, vol. 20, no. 2, pp. 147–187, June 2010. DOI: https://doi.org/10.1007/s11257-010-9074-4

[14] Y. Wu, Q. Mi, and T. Gao, "A Comprehensive Review of Multimodal Emotion Recognition: Techniques, Challenges, and Future Directions," Biomimetics, vol. 10, no. 7, June 2025, Art. no. 418. DOI: https://doi.org/10.3390/biomimetics10070418

D. Li, Y. Wang, K. Funakoshi, and M. Okumura, "Joyful: Joint Modality Fusion and Graph Contrastive Learning for Multimoda Emotion Recognition," in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 2023, pp. 16051–16069. DOI: https://doi.org/10.18653/v1/2023.emnlp-main.996

W. Ai, Y. Shou, T. Meng, and K. Li, "DER-GCN: Dialog and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialog Emotion Recognition," IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 3, pp. 4908–4921, Mar. 2025. DOI: https://doi.org/10.1109/TNNLS.2024.3367940

M. Palash and B. Bhargava, "EMERSK -Explainable Multimodal Emotion Recognition With Situational Knowledge," IEEE Transactions on Multimedia, vol. 26, pp. 2785–2794, 2024. DOI: https://doi.org/10.1109/TMM.2023.3304015

H. Lian, C. Lu, S. Li, Y. Zhao, C. Tang, and Y. Zong, "A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face," Entropy, vol. 25, no. 10, Oct. 2023, Art. no. 1440. DOI: https://doi.org/10.3390/e25101440

Z. Zhao, Y. Wang, G. Shen, Y. Xu, and J. Zhang, "TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 3771–3782, 2023. DOI: https://doi.org/10.1109/TASLP.2023.3316458

M.-H. Yi, K.-C. Kwak, and J.-H. Shin, "HyFusER: Hybrid Multimodal Transformer for Emotion Recognition Using Dual Cross Modal Attention," Applied Sciences, vol. 15, no. 3, Jan. 2025, Art. no. 1053. DOI: https://doi.org/10.3390/app15031053

Z. Cheng, X. Bu, Q. Wang, T. Yang, and J. Tu, "EEG-based emotion recognition using multi-scale dynamic CNN and gated transformer," Scientific Reports, vol. 14, no. 1, Dec. 2024, Art. no. 31319. DOI: https://doi.org/10.1038/s41598-024-82705-z

M. P. A. Ramaswamy and S. Palaniswamy, "Multimodal emotion recognition: A comprehensive review, trends, and challenges," WIREs Data Mining and Knowledge Discovery, vol. 14, no. 6, Nov. 2024, Art. no. e1563. DOI: https://doi.org/10.1002/widm.1563

A. A. Wafa, M. M. Eldefrawi, and M. S. Farhan, "Advancing multimodal emotion recognition in big data through prompt engineering and deep adaptive learning," Journal of Big Data, vol. 12, no. 1, Aug. 2025, Art. no. 210. DOI: https://doi.org/10.1186/s40537-025-01264-w

S. Woo, M. Zubair, S. Lim, and D. Kim, "Deep multimodal emotion recognition using modality-aware attention and proxy-based multimodal loss," Internet of Things, vol. 31, May 2025, Art. no. 101562. DOI: https://doi.org/10.1016/j.iot.2025.101562

A. Khalane, R. Makwana, T. Shaikh, and A. Ullah, "Evaluating significant features in context-aware multimodal emotion recognition with XAI methods," Expert Systems, vol. 42, no. 1, Jan. 2025, Art. no. e13403. DOI: https://doi.org/10.1111/exsy.13403

S. Zhang, Y. Yang, C. Chen, X. Zhang, Q. Leng, and X. Zhao, "Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects," Expert Systems with Applications, vol. 237, Mar. 2024, Art. no. 121692. DOI: https://doi.org/10.1016/j.eswa.2023.121692

T. Mittal, P. Guhan, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha, "EmotiCon: Context-Aware Multimodal Emotion Recognition Using Frege's Principle," in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 14222–14231. DOI: https://doi.org/10.1109/CVPR42600.2020.01424

D. Yang et al., "Emotion Recognition for Multiple Context Awareness," in 17th European Conference on Computer Vision, Tel Aviv, Israel, 2022, pp. 144–162. DOI: https://doi.org/10.1007/978-3-031-19836-6_9

D. Yang, K. Yang, H. Kuang, Z. Chen, Y. Wang, and L. Zhang, "Towards Context-Aware Emotion Recognition Debiasing From a Causal Demystification Perspective via De-Confounded Training," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 10663–10680, Dec. 2024. DOI: https://doi.org/10.1109/TPAMI.2024.3443129