Real Time Speech Recognition based on PWP Thresholding and MFCC using SVM

W. Helali; Ζ. Hajaiej; A. Cherif

doi:10.48084/etasr.3759

Authors

W. Helali Faculty of Sciences of Tunis, University Tunis El-Manar, Tunisia
Ζ. Hajaiej Faculty of Sciences of Tunis, University Tunis El-Manar, Tunisia
A. Cherif Research Unit of Processing and Analysis of Electrical and Energetic Systems, Faculty of Sciences, University of Tunis El Manar, Tunisia

Volume: 10 | Issue: 5 | Pages: 6204-6208 | October 2020 | https://doi.org/10.48084/etasr.3759

Published online first on September 29, 2020.

Corresponding author: W. Helali

Abstract

The real-time performance of Automatic Speech Recognition (ASR) is a big challenge and needs high computing capability and exhaustive memory consumption. Getting a robust performance against inevitable various difficult situations such as speaker variations, accents, and noise is a tedious task. It’s crucial to expand new and efficient approaches for speech signal extraction features and pre-processing. In order to fix the high dependency issue related to processing succeeding steps in ARS and enhance the extracted features’ quality, noise robustness can be solved within the ARS extraction block feature, removing implicitly the need for further additional specific compensation parameters or data collection. This paper proposes a new robust acoustic extraction approach development based on a hybrid technique consisting of Perceptual Wavelet Packet (PWP) and Mel Frequency Cepstral Coefficients (MFCCs). The proposed system was implemented on a Rasberry Pi board and its performance was checked in a clean environment, reaching 99% average accuracy. The recognition rate was improved (from 80% to 99%) for the majority of Signal-to-Noise Ratios (SNRs) under real noisy conditions for positive SNRs and considerably improved results especially for negative SNRs.

Keywords:

automatic speech recognition, erceptual wavelet packet transform, Mel frequency cestrum coefficients, SVM, Raspberry Pi 3

Downloads

Download data is not yet available.

References

D. Karaboga and E. Kaya, "Adaptive network based fuzzy inference system (ANFIS) training approaches: a comprehensive survey," Artificial Intelligence Review, vol. 52, no. 4, pp. 2263-2293, Dec. 2019. DOI: https://doi.org/10.1007/s10462-017-9610-2

H. A. Yanco, A. Norton, W. Ober, D. Shane, A. Skinner, and J. Vice, "Analysis of Human-robot Interaction at the DARPA Robotics Challenge Trials," Journal of Field Robotics, vol. 32, no. 3, pp. 420-444, May 2015. DOI: https://doi.org/10.1002/rob.21568

A. Pereira, C. Oertel, L. Fermoselle, J. Mendelson, and J. Gustafson, "Responsive Joint Attention in Human-Robot Interaction," in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Nov. 2019, pp. 1080-1087. DOI: https://doi.org/10.1109/IROS40897.2019.8968130

I. Tiddi, E. Bastianelli, E. Daga, M. d'Aquin, and E. Motta, "Robot-City Interaction: Mapping the Research Landscape-A Survey of the Interactions Between Robots and Modern Cities," International Journal of Social Robotics, vol. 12, no. 2, pp. 299-324, May 2020. DOI: https://doi.org/10.1007/s12369-019-00534-x

Y. Zheng, Y. Liu, and J. H. L. Hansen, "Navigation-orientated natural spoken language understanding for intelligent vehicle dialogue," in 2017 IEEE Intelligent Vehicles Symposium (IV), Jun. 2017, pp. 559-564. DOI: https://doi.org/10.1109/IVS.2017.7995777

T. Hino, S. Ito, T. Liu, and M. Maeda, "Set-based particle swarm optimization with status memory for knapsack problem," Artificial Life and Robotics, vol. 21, no. 1, pp. 98-105, Mar. 2016. DOI: https://doi.org/10.1007/s10015-015-0253-6

A. Koduru, H. B. Valiveti, and A. K. Budati, "Feature extraction algorithms to improve the speech emotion recognition rate," International Journal of Speech Technology, vol. 23, no. 1, pp. 45-55, Mar. 2020. DOI: https://doi.org/10.1007/s10772-020-09672-4

S. Zhu, C. Xu, J. Wang, Y. Xiao, and F. Ma, "Research and application of combined kernel SVM in dynamic voiceprint password authentication system," in 2017 IEEE 9th International Conference on Communication Software and Networks (ICCSN), May 2017, pp. 1052-1055. DOI: https://doi.org/10.1109/ICCSN.2017.8230271

E. Rodríguez-Orozco et al., "FPGA-based Chaotic Cryptosystem by Using Voice Recognition as Access Key," Electronics, vol. 7, no. 12, p. 414, Dec. 2018. DOI: https://doi.org/10.3390/electronics7120414

Q. Li et al., "MSP-MFCC: Energy-Efficient MFCC Feature Extraction Method With Mixed-Signal Processing Architecture for Wearable Speech Recognition Applications," IEEE Access, vol. 8, pp. 48720-48730, 2020. DOI: https://doi.org/10.1109/ACCESS.2020.2979799

P. J. Dugan, H. Klinck, J. A. Zollweg, and C. W. Clark, "Data Mining Sound Archives: A New Scalable Algorithm for Parallel-Distributing Processing," in 2015 IEEE International Conference on Data Mining Workshop (ICDMW), Nov. 2015, pp. 768-772. DOI: https://doi.org/10.1109/ICDMW.2015.235

K. Gupta and D. Gupta, "An analysis on LPC, RASTA and MFCC techniques in Automatic Speech recognition system," in 2016 6th International Conference - Cloud System and Big Data Engineering (Confluence), Jan. 2016, pp. 493-497. DOI: https://doi.org/10.1109/CONFLUENCE.2016.7508170

S. P. Panda, A. K. Nayak, and S. C. Rai, "A survey on speech synthesis techniques in Indian languages," Multimedia Systems, vol. 26, no. 4, pp. 453-478, Aug. 2020. DOI: https://doi.org/10.1007/s00530-020-00659-4

V. M. Patel, N. K. Ratha, and R. Chellappa, "Cancelable Biometrics: A review," IEEE Signal Processing Magazine, vol. 32, no. 5, pp. 54-65, Sep. 2015. DOI: https://doi.org/10.1109/MSP.2015.2434151

L. Jiao et al., "A Survey of Deep Learning-Based Object Detection," IEEE Access, vol. 7, pp. 128837-128868, 2019. DOI: https://doi.org/10.1109/ACCESS.2019.2939201

R. Chakroun and M. Frikha, "Efficient text-independent speaker recognition with short utterances in both clean and uncontrolled environments," Multimedia Tools and Applications, vol. 79, no. 29, pp. 21279-21298, Aug. 2020. DOI: https://doi.org/10.1007/s11042-020-08824-7

C. Kim and R. M. Stern, "Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 7, pp. 1315-1329, Jul. 2016. DOI: https://doi.org/10.1109/TASLP.2016.2545928

S.-S. Wang, P. Lin, Y. Tsao, J.-W. Hung, and B. Su, "Suppression by Selecting Wavelets for Feature Compression in Distributed Speech Recognition," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 3, pp. 564-579, Mar. 2018. DOI: https://doi.org/10.1109/TASLP.2017.2779787

M. A. Islam, W. A. Jassim, N. S. Cheok, and M. S. A. Zilany, "A Robust Speaker Identification System Using the Responses from a Model of the Auditory Periphery," PLoS One, vol. 11, no. 7, p. e0158520, Jul. 2016. DOI: https://doi.org/10.1371/journal.pone.0158520

N. Das, S. Chakraborty, J. Chaki, N. Padhy, and N. Dey, "Fundamentals, present and future perspectives of speech enhancement," International Journal of Speech Technology, Jan. 2020. DOI: https://doi.org/10.1007/s10772-020-09674-2

C. Jiang, L. Ba, X. Tang, and D. Wen, "Speaker Verification Using IMNMF and MFCC with Feature Warping Under Noisy Environment," in 2018 Chinese Automation Congress (CAC), Nov. 2018, pp. 2583-2588. DOI: https://doi.org/10.1109/CAC.2018.8623278

A. K. H. Al-Ali, V. Chandran, and G. R. Naik, "Enhanced forensic speaker verification performance using the ICA-EBM algorithm under noisy and reverberant environments," Evolutionary Intelligence, May 2020. DOI: https://doi.org/10.1007/s12065-020-00406-8

O. Mamyrbayev, A. Toleu, G. Tolegen, and N. Mekebayev, "Neural architectures for gender detection and speaker identification," Cogent Engineering, vol. 7, no. 1, p. 1727168, Jan. 2020. DOI: https://doi.org/10.1080/23311916.2020.1727168

L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, 1 edition. Englewood Cliffs, N.J: Pearson, 1993.

N. Holighaus, G. Koliander, Z. Průša, and L. D. Abreu, "Characterization of Analytic Wavelet Transforms and a New Phaseless Reconstruction Algorithm," IEEE Transactions on Signal Processing, vol. 67, no. 15, pp. 3894-3908, Aug. 2019. DOI: https://doi.org/10.1109/TSP.2019.2920611

W. Helali, Z. Hajaiej, and A. Cherif, "Automatic Speech Recognition System Based on Hybrid Feature Extraction Techniques Using TEO-PWP for in Real Noisy Environment," IJCSNS - International Journal of Computer Science and Network Security, vol. 19, no. 10, pp. 118-124, Oct. 2019.

A. Rinoshika and H. Rinoshika, "Application of multi-dimensional wavelet transform to fluid mechanics," Theoretical and Applied Mechanics Letters, vol. 10, no. 2, pp. 98-115, Jan. 2020. DOI: https://doi.org/10.1016/j.taml.2020.01.017

D. G. Manolakis and V. K. Ingle, Applied Digital Signal Processing: Theory and Practice, 1 edition. New York: Cambridge University Press, 2011.

A. Mnassri, M. Bennasr, and C. Adnane, "A Robust Feature Extraction Method for Real-Time Speech Recognition System on a Raspberry Pi 3 Board," Engineering, Technology & Applied Science Research, vol. 9, no. 2, pp. 4066-4070, Apr. 2019. DOI: https://doi.org/10.48084/etasr.2533

S. N. Truong, "A Low-cost Artificial Neural Network Model for Raspberry Pi," Engineering, Technology & Applied Science Research, vol. 10, no. 2, pp. 5466-5469, Apr. 2020. DOI: https://doi.org/10.48084/etasr.3357