Dangerous Sound Detection Using Convolutional Feature Extraction and Temporal Modeling with BiLSTM

Nurzhan Omarov; Aigerim Altayeva

doi:10.48084/etasr.13068

Authors

Nurzhan Omarov Al-Farabi Kazakh National University, Kazakhstan
Aigerim Altayeva Al-Farabi Kazakh National University, Kazakhstan

Volume: 15 | Issue: 6 | Pages: 28850-28855 | December 2025 | https://doi.org/10.48084/etasr.13068

Received: 30 June 2025 | Revised: 3 August 2025 and 16 August 2025 | Accepted: 26 August 2025 | Online: 8 October 2025

Corresponding author: Aigerim Altayeva

Abstract

Dangerous sound detection is essential to improve public safety through automated surveillance systems capable of identifying and classifying multiple hazardous acoustic events. This study proposes a hybrid deep learning framework that integrates a one-dimensional Convolutional Neural Network (1D-CNN) for spatial feature extraction with Bidirectional Long Short-Term Memory (BiLSTM) for temporal sequence modeling. The system is designed for multi-class classification, targeting eight distinct categories of dangerous sounds, including gunshots, explosions, screaming, crying, glass breaking, fire, emergency alarms, and weapon handling. A comprehensive set of audio features, such as mel-spectrograms, MFCCs, chroma, spectral contrast, and temporal descriptors, is extracted to capture diverse spectral, tonal, and temporal characteristics of each class. The model achieves high accuracy while maintaining low training and validation losses, demonstrating strong generalization across classes with varying acoustic similarity. Experimental results confirm the system's robustness in distinguishing between acoustically similar sounds and its ability to handle class imbalance effectively. The architecture, supported by a structured preprocessing pipeline, is optimized for scalability and real-time deployment in complex urban environments. These findings highlight the potential of combining convolutional and recurrent deep learning techniques for robust, multi-class acoustic event detection, with future work focusing on lightweight model adaptation, expanded datasets, and integration of multimodal contextual information to further enhance performance and operational reliability.

Keywords:

dangerous sound detection, deep learning, CNN, BiLSTM, mel-spectrogram, MFCC, audio classification, public safety, real-time surveillance

Downloads

Download data is not yet available.

References

T. M. Nithya, P. Dhivya, S. N. Sangeethaa, and P. R. Kanna, "TB-MFCC multifuse feature for emergency vehicle sound classification using multistacked CNN – Attention BiLSTM," Biomedical Signal Processing and Control, vol. 88, Feb. 2024, Art. no. 105688. DOI: https://doi.org/10.1016/j.bspc.2023.105688

Z. Momynkulov, N. Omarov, and A. Altayeva, "CNN-RNN Hybrid Model For Dangerous Sound Detection in Urban Area," in 2024 IEEE 4th International Conference on Smart Information Systems and Technologies (SIST), Astana, Kazakhstan, May 2024, pp. 284–289. DOI: https://doi.org/10.1109/SIST61555.2024.10629358

S. S. Gupta, S. Hossain, and K. D. Kim, "Recognize the surrounding: Development and evaluation of convolutional deep networks using gammatone spectrograms and raw audio signals," Expert Systems with Applications, vol. 200, Aug. 2022, Art. no. 116998. DOI: https://doi.org/10.1016/j.eswa.2022.116998

K. Shanmugavadivel, M. Subramanian, P. Nishdharani, E. Santhiya, and R. E. Yaswanth, "KEC_AI_BRIGHTRED@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian languages," in Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages, Albuquerque, NM, USA, Feb. 2025, pp. 754–758. DOI: https://doi.org/10.18653/v1/2025.dravidianlangtech-1.127

P. Doungpaisan and P. Khunarsa, "Deep Spectrogram Learning for Gunshot Classification: A Comparative Study of CNN Architectures and Time-Frequency Representations," Journal of Imaging, vol. 11, no. 8, Aug. 2025, Art. no. 281. DOI: https://doi.org/10.3390/jimaging11080281

S. Mishra, N. Bhatnagar, P. Prekasam, and T. R. Sureshkumar, "Speech emotion recognition and classification using hybrid deep CNN and BiLSTM model," Multimedia Tools and Applications, vol. 83, no. 13, pp. 37603–37620, Apr. 2024. DOI: https://doi.org/10.1007/s11042-023-16849-x

N. Omarov, B. Omarov, Z. Azhibekova, and B. Omarov, "Applying an augmented reality game-based learning environment in physical education classes to enhance sports motivation," Retos, vol. 60, pp. 269–278, 2024. DOI: https://doi.org/10.47197/retos.v60.109170

D. Y. Badawood and F. M. Aldosari, "Enhanced Deep Learning Techniques for Real-Time Speech Emotion Recognition in Multilingual Contexts," Engineering, Technology & Applied Science Research, vol. 14, no. 6, pp. 18662–18669, Dec. 2024. DOI: https://doi.org/10.48084/etasr.9229

S. Mehra, V. Ranga, and R. Agarwal, "A deep learning approach to dysarthric utterance classification with BiLSTM-GRU, speech cue filtering, and log mel spectrograms," The Journal of Supercomputing, vol. 80, no. 10, pp. 14520–14547, Jul. 2024. DOI: https://doi.org/10.1007/s11227-024-06015-x

N. Katayev, A. Altayeva, B. Abduraimova, N. Kurmanbekkyzy, Z. Madibaiuly, and B. Kulambayev, "Development of a Framework for Classification of Impulsive Urban Sounds using BiLSTM Network," International Journal of Advanced Computer Science and Applications, vol. 14, no. 11, 2023. DOI: https://doi.org/10.14569/IJACSA.2023.0141148

M. Harish, H. S. Kumar, S. Banupriya, R. Gowtham, and M. V. Rahul, "Enhancing Earthquake Prediction and Early Warning Systems using CapsNet-BiLSTM Models," in 2025 5th International Conference on Trends in Material Science and Inventive Materials (ICTMIM), Kanyakumari, India, Apr. 2025, pp. 1734–1739. DOI: https://doi.org/10.1109/ICTMIM65579.2025.10988387

F. Khanmohammadi and R. Azmi, "Time-Series Anomaly Detection in Automated Vehicles Using D-CNN-LSTM Autoencoder," IEEE Transactions on Intelligent Transportation Systems, vol. 25, no. 8, pp. 9296–9307, Dec. 2024. DOI: https://doi.org/10.1109/TITS.2024.3380263

P. Dang Thi, H. T. N. Dang, P. D. Huu, and H. D. Sy, "Video classification for efficient data storage using deep learning: a comparison of sequential and simultaneous feature extraction methods," Multimedia Tools and Applications, vol. 84, no. 6, pp. 3071–3094, Feb. 2025. DOI: https://doi.org/10.1007/s11042-024-20549-5

J. Xie, Y. Pang, J. Nie, J. Cao, and J. Han, "Latent Feature Pyramid Network for Object Detection," IEEE Transactions on Multimedia, vol. 25, pp. 2153–2163, 2023. DOI: https://doi.org/10.1109/TMM.2022.3143707

C. Shi, W. Zhang, C. Duan, and H. Chen, "A pooling-based feature pyramid network for salient object detection," Image and Vision Computing, vol. 107, Mar. 2021, Art. no. 104099. DOI: https://doi.org/10.1016/j.imavis.2021.104099

B. Omarov, M. Baikuvekov, D. Sultan, N. Mukazhanov, M. Suleimenova, and M. Zhekambayeva, "Ensemble Approach Combining Deep Residual Networks and BiGRU with Attention Mechanism for Classification of Heart Arrhythmias," Computers, Materials & Continua, vol. 80, no. 1, pp. 341–359, 2024. DOI: https://doi.org/10.32604/cmc.2024.052437

J. B. Thomas, S. G. Chaudhari, K. V. Shihabudheen, and N. K. Verma, "CNN-Based Transformer Model for Fault Detection in Power System Networks," IEEE Transactions on Instrumentation and Measurement, vol. 72, pp. 1–10, 2023. DOI: https://doi.org/10.1109/TIM.2023.3238059

P. Rawat, M. Bajaj, S. Vats, and V. Sharma, "A comprehensive study based on MFCC and spectrogram for audio classification," Journal of Information and Optimization Sciences, vol. 44, no. 6, pp. 1057–1074, 2023. DOI: https://doi.org/10.47974/JIOS-1431

B. S. Soares, J. S. Luz, V. F. de Macêdo, R. R. V. e Silva, F. H. D. de Araújo, and D. M. V. Magalhães, "MFCC-based descriptor for bee queen presence detection," Expert Systems with Applications, vol. 201, Sep. 2022, Art. no. 117104. DOI: https://doi.org/10.1016/j.eswa.2022.117104

R. Deng, G. Zhou, L. Tang, C. Yang, and A. Chen, "E-DOCRNet: A multi-feature fusion network for dog bark identification," Applied Acoustics, vol. 220, Apr. 2024, Art. no. 109950. DOI: https://doi.org/10.1016/j.apacoust.2024.109950

K. J. Piczak, "ESC: Dataset for Environmental Sound Classification," in Proceedings of the 23rd ACM international conference on Multimedia, Brisbane, Australia, Jul. 2015, pp. 1015–1018. DOI: https://doi.org/10.1145/2733373.2806390

M. K. Gourisaria, R. Agrawal, M. Sahni, and P. K. Singh, "Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques," Discover Internet of Things, vol. 4, no. 1, Jan. 2024. DOI: https://doi.org/10.1007/s43926-023-00049-y