A Performance Comparison of 1D, 2D, and 3D CNN Architectures for Robot Voice Command Classification

Santoso; Tri Arief Sardjono; Djoko Purwanto

doi:10.48084/etasr.14068

Authors

Santoso Department of Electrical Engineering, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia | Department of Electrical Engineering, Universitas 17 Agustus 1945, Surabaya, Indonesia
Tri Arief Sardjono Department of Electrical Engineering, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia
Djoko Purwanto Department of Electrical Engineering, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia https://orcid.org/0000-0002-5896-2785

Volume: 16 | Issue: 1 | Pages: 32110-32119 | February 2026 | https://doi.org/10.48084/etasr.14068

Received: 14 August 2025 | Revised: 8 October 2025, 3 November 2025, and 24 November 2025 | Accepted: 25 November 2025 | Online: 9 February 2026

Corresponding author: Djoko Purwanto

Abstract

This study presents a comparative analysis of one-(1D), two-(2D), and three-Dimensional (3D) Convolutional Neural Network (CNN) architectures for robotic voice command recognition using the Google Speech Commands dataset. Each architecture was evaluated in terms of classification accuracy, test loss, and computational efficiency to assess the trade-off between performance and resource demands. The experimental results show that the 3D CNN achieved the highest accuracy of 89.61% and the lowest test loss of 0.406, demonstrating superior capability in modeling spatiotemporal correlations within stacked spectrogram frames. The 2D CNN achieved an accuracy of 87.61% with balanced generalization and inference time. In comparison, the 1D CNN exhibited the lowest accuracy (68.90%) but the fastest inference speed (0.63 ms/sample), making it suitable for real-time robotic systems with limited computational resources. Qualitative evaluation confirmed that higher-dimensional CNNs yielded fewer misclassifications, especially for acoustically similar commands. Overall, the results indicate that the 2D CNN architecture provides the optimal compromise between accuracy and efficiency, while the 3D CNN offers the highest recognition capability. Future work will focus on developing lightweight 3D CNN or transformer-based models to enhance real-time performance in embedded robotic platforms.

Keywords:

1D CNN, 2D CNN, 3D CNN, voice command classification, robot interaction, deep learning

References

H. Kheddar, M. Hemis, and Y. Himeur, "Automatic Speech Recognition Using Advanced Deep Learning Approaches: A Survey," Information Fusion, vol. 109, Sept. 2024, Art. no. 102422. DOI: https://doi.org/10.1016/j.inffus.2024.102422

T. Liu et al., "Machine Learning-assisted Wearable Sensing Systems for Speech Recognition and Interaction," Nature Communications, vol. 16, no. 1, Mar. 2025, Art. no. 2363. DOI: https://doi.org/10.1038/s41467-025-57629-5

V. Lyashenko, F. Laariedh, S. Sotnik, and M. A. Ahmad, "Recognition of Voice Commands Based on Neural Network," TEM Journal, pp. 583–591, May 2021. DOI: https://doi.org/10.18421/TEM102-13

L. Beňo, E. Kučera, P. Drahoš, and R. Pribiš, "Transforming Industrial Automation: Voice Recognition Control via Containerized PLC Device," Scientific Reports, vol. 14, no. 1, Nov. 2024, Art. no. 29387. DOI: https://doi.org/10.1038/s41598-024-81172-w

M. H. Zafar, E. F. Langås, and F. Sanfilippo, "Exploring the Synergies Between Collaborative Robotics, Digital Twins, Augmentation, and Industry 5.0 for Smart Manufacturing: A State-of-the-Art Review," Robotics and Computer-Integrated Manufacturing, vol. 89, Oct. 2024, Art. no. 102769. DOI: https://doi.org/10.1016/j.rcim.2024.102769

Y. Tong, H. Liu, and Z. Zhang, "Advancements in Humanoid Robots: A Comprehensive Review and Future Prospects," IEEE/CAA Journal of Automatica Sinica, vol. 11, no. 2, pp. 301–328, Feb. 2024. DOI: https://doi.org/10.1109/JAS.2023.124140

K. Darvish et al., "Teleoperation of Humanoid Robots: A Survey," IEEE Transactions on Robotics, vol. 39, no. 3, pp. 1706–1727, Jun. 2023. DOI: https://doi.org/10.1109/TRO.2023.3236952

S. Singh and H. Beniwal, "A Survey on Near-Human Conversational Agents," Journal of King Saud University -Computer and Information Sciences, vol. 34, no. 10, pp. 8852–8866, Nov. 2022. DOI: https://doi.org/10.1016/j.jksuci.2021.10.013

S. Kumar, Z. Ali, C. Kumar, G. Abid, S. A. Shaikh, and V. Memon, "Speech Recognition Based Robotic Mart," International Journal of Intelligent Robotics and Applications, vol. 4, no. 3, pp. 342–353, Sept. 2020. DOI: https://doi.org/10.1007/s41315-020-00144-1

V. S. R. Gade and M. Sumathi, "An Optimized Attention Based Hybrid Deep Learning Framework for Automatic Speaker Identification from Speech Signals," Multimedia Tools and Applications, vol. 84, no. 21, pp. 24319–24349, Aug. 2024. DOI: https://doi.org/10.1007/s11042-024-19996-x

S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Qadir, and B. W. Schuller, "Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends." arXiv, Sept. 2020.

A. A. Abdelhamid et al., "Robust Speech Emotion Recognition Using CNN+LSTM Based on Stochastic Fractal Search Optimization Algorithm," IEEE Access, vol. 10, pp. 49265–49284, 2022. DOI: https://doi.org/10.1109/ACCESS.2022.3172954

F. E. Aswad, G. V. T. Djogdom, M. J.-D. Otis, J. C. Ayena, and R. Meziane, "Image Generation for 2D-CNN Using Time-Series Signal Features from Foot Gesture Applied to Select Cobot Operating Mode," Sensors, vol. 21, no. 17, Aug. 2021, Art. no. 5743. DOI: https://doi.org/10.3390/s21175743

J. Liu, T. Wang, A. Skidmore, Y. Sun, P. Jia, and K. Zhang, "Integrated 1D, 2D, and 3D CNNs Enable Robust and Efficient Land Cover Classification from Hyperspectral Imagery," Remote Sensing, vol. 15, no. 19, Oct. 2023 Art. no. 4797, Oct. 2023. DOI: https://doi.org/10.3390/rs15194797

P. Ghadekar, M. Deshmukh, S. Deshmukh, D. Jangid, D. Dewalkar, and R. Dighole, "3D Image Classification Based on Multi View CNN Using 2D Images," in 2023 3rd International Conference on Innovative Sustainable Computational Technologies, Dehradun, India, Sept. 2023, pp. 1–6. DOI: https://doi.org/10.1109/CISCT57197.2023.10351341

R. Ju et al., "3D-CNN-SPP: A Patient Risk Prediction System from Electronic Health Records via 3D CNN and Spatial Pyramid Pooling," IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 5, no. 2, pp. 247–261, Apr. 2021. DOI: https://doi.org/10.1109/TETCI.2019.2960474

H. Ilgaz, B. Akkoyun, Ö. Alpay, and M. A. Akcayol, "CNN Based Automatic Speech Recognition: A Comparative Study," ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal, vol. 13, Aug. 2024, Art. no. e29191. DOI: https://doi.org/10.14201/adcaij.29191

S. A. A. Jeevakumari and K. Dey, "LipSyncNet: A Novel Deep Learning Approach for Visual Speech Recognition in Audio-Challenged Situations," IEEE Access, vol. 12, pp. 110891–110904, 2024. DOI: https://doi.org/10.1109/ACCESS.2024.3436931

K. Koutini, H. Eghbal-zadeh, and G. Widmer, "Receptive Field Regularization Techniques for Audio Classification and Tagging with Deep Convolutional Neural Networks," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1987–2000, 2021. DOI: https://doi.org/10.1109/TASLP.2021.3082307

Y. Gong, Y.-A. Chung, and J. Glass, "AST: Audio Spectrogram Transformer," in Interspeech 2021, Brno, Czechia, Aug. 2021, pp. 571–575. DOI: https://doi.org/10.21437/Interspeech.2021-698

A.-H. Jo and K.-C. Kwak, "Classification of Speech Emotion State Based on Feature Map Fusion of TCN and Pretrained CNN Model from Korean Speech Emotion Data," IEEE Access, vol. 13, pp. 19947–19963, 2025. DOI: https://doi.org/10.1109/ACCESS.2025.3534176

K. Seaborn, N. P. Miyake, P. Pennefather, and M. Otake-Matsuura, "Voice in Human–Agent Interaction: A Survey," ACM Computing Surveys, vol. 54, no. 4, pp. 1–43, May 2022. DOI: https://doi.org/10.1145/3386867

D. Liu, A. Honoré, S. Chatterjee, and L. K. Rasmussen, "Powering Hidden Markov Model by Neural Network based Generative Models," in 24th European Conference on Artificial Intelligence, Santiago de Compostela, Spain, 2020.

S. Majumdar and B. Ginsburg, "MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition," in Interspeech 2020, Shanghai, China, Oct. 2020, pp. 3356–3360. DOI: https://doi.org/10.21437/Interspeech.2020-1058

F. Jia, S. Majumdar, and B. Ginsburg, "MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection," in ICASSP 2021 -2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, ON, Canada, Jun. 2021, pp. 6818–6822. DOI: https://doi.org/10.1109/ICASSP39728.2021.9414470

S. Baghel, M. Bhattacharjee, S. R. M. Prasanna, and P. Guha, "Shouted and Normal Speech Classification Using 1D CNN," in Pattern Recognition and Machine Intelligence, vol. 11942, B. Deka, P. Maji, S. Mitra, D. K. Bhattacharyya, P. K. Bora, and S. K. Pal, Eds. Cham, Switzerland: Springer International Publishing, 2019, pp. 472–480. DOI: https://doi.org/10.1007/978-3-030-34872-4_52

I. Djemai, S. A. Fezza, W. Hamidouche, and O. Déforges, "Extending 2D Saliency Models for Head Movement Prediction in 360-Degree Images Using CNN-Based Fusion," in 2020 IEEE International Symposium on Circuits and Systems, Seville, Spain, Oct. 2020, pp. 1–5. DOI: https://doi.org/10.1109/ISCAS45731.2020.9181229

J. G. García Pardo, "Machine Learning Strategies for Diagnostic Imaging Support on Histopathology and Optical Coherence Tomography," Universitat Politècnica de València, Valencia, Spain, 2022.

Q. B. Diep, H. Y. Phan, and T.-C. Truong, "Crossmixed Convolutional Neural Network for Digital Speech Recognition," PLOS ONE, vol. 19, no. 4, Apr. 2024, Art. no. e0302394. DOI: https://doi.org/10.1371/journal.pone.0302394

N. Hajarolasvadi and H. Demirel, "3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms," Entropy, vol. 21, no. 5, May 2019, Art. no. 479. DOI: https://doi.org/10.3390/e21050479

S. A. Ahmed, E. H. Khalifa, M. Nawaz, F. A. Abdalla, and A. F. A. Mahmoud, "Enhancing Cloud Data Center Security through Deep Learning: A Comparative Analysis of RNN, CNN, and LSTM Models for Anomaly and Intrusion Detection," Engineering, Technology & Applied Science Research, vol. 15, no. 1, pp. 20071–20076, Feb. 2025. DOI: https://doi.org/10.48084/etasr.9445

A. Kuzdeuov, R. Gilmullin, B. Khakimov, and H. A. Varol, "An Open-Source Tatar Speech Commands Dataset for IoT and Robotics Applications," in IECON 2024 -50th Annual Conference of the IEEE Industrial Electronics Society, Chicago, IL, USA, Nov. 2024, pp. 1–5. DOI: https://doi.org/10.1109/IECON55916.2024.10905876

R. Sumikawa, A. Kosuge, Y.-C. Hsu, K. Shiba, M. Hamada, and T. Kuroda, "A183.4-nJ/Inference 152.8-μ W 35-Voice Commands Recognition Wired-Logic Processor Using Algorithm-Circuit Co-Optimization Technique," IEEE Solid-State Circuits Letters, vol. 7, pp. 22–25, 2024. DOI: https://doi.org/10.1109/LSSC.2023.3334625

L. Nwankwo and E. Rueckert, "The Conversation is the Command: Interacting with Real-World Autonomous Robots Through Natural Language," in Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, Boulder, CO, USA, Mar. 2024, pp. 808–812. DOI: https://doi.org/10.1145/3610978.3640723

Santoso, T. A. Sardjono, and D. Purwanto, "Optimizing Mel-Frequency Cepstral Coefficients for Improved Robot Speech Command Recognition Accuracy," in 2024 International Seminar on Application for Technology of Information and Communication (iSemantic), Semarang, Indonesia, Sept. 2024, pp. 284–289. DOI: https://doi.org/10.1109/iSemantic63362.2024.10762627

I. J. Kadhim, Tawfeeq E. Abdulabbas, R. Ali, Ali F. Hassoon, and P. Premaratne, "An Enhanced Speech Command Recognition using Convolutional Neural Networks," Journal of Engineering and Sustainable Development, vol. 28, no. 6, pp. 754–761, Nov. 2024. DOI: https://doi.org/10.31272/jeasd.28.6.8