Real Time Speech Recognition based on PWP Thresholding and MFCC using SVM

−The real-time performance of Automatic Speech Recognition (ASR) is a big challenge and needs high computing capability and exhaustive memory consumption. Getting a robust performance against inevitable various difficult situations such as speaker variations, accents, and noise is a tedious task. It’s crucial to expand new and efficient approaches for speech signal extraction features and pre-processing. In order to fix the high dependency issue related to processing succeeding steps in ARS and enhance the extracted features’ quality, noise robustness can be solved within the ARS extraction block feature, removing implicitly the need for further additional specific compensation parameters or data collection. This paper proposes a new robust acoustic extraction approach development based on a hybrid technique consisting of Perceptual Wavelet Packet (PWP) and Mel Frequency Cepstral Coefficients (MFCCs). The proposed system was implemented on a Rasberry Pi board and its performance was checked in a clean environment, reaching 99% average accuracy. The recognition rate was improved (from 80% to 99%) for the majority of Signal-to-Noise Ratios (SNRs) under real noisy conditions for positive SNRs and considerably improved results especially for negative SNRs. Keywords-automatic speech recognition; perceptual wavelet packet transform; Mel frequency cestrum coefficients; SVM; Raspberry Pi 3

INTRODUCTION Speech recognition technology is a widespread dynamic research area. Automatic Speech Recognition (ASR) has been vastly used in many human-machine interaction applications, such as electronics [1], mobile robots [2][3][4], car audio systems [5], manipulators in industrial assembly lines [6], and security systems [7]. Nonetheless, robust performance constitutes obviously a real concern for any real-time application due to various difficult conditions such as noisy background, accents, and speaker variations. As a result, the need for accuracy, high performance and fast embedded ASR are growing continuously. Many projects have been invested in ASR techniques in order to achieve proficient embedded systems that are able to imitate human behavior at all levels. The ASR accuracy obtained in laboratory environments is quite high, but once the recognition system is placed in a real background, the recognition rate gets roughly low. Several embedded voice recognition systems have been reported and some of them are implemented in Field Programmable Gate Arrays (FPGAs) [8][9][10] or in Digital Signal Processors (DSPs) [11,12], all of them with a modest accuracy rate. ASR state-of-the-art systems are linking the performance to reasonable and controlled training conditions. Considering the noise impact, the system accuracy may become unacceptably low in some sensitive environments. Several researchers have shown their interests on speech feature extraction methods such as Linear Prediction Coefficients (LPC) [13], Perceptual Linear Predictive (PLP) [14] and Linear Predictive Cepstral Coefficients (LPCC) which are used due to their effectiveness and simplicity in speech/speaker recognition [15][16].
Mel Frequency Cepstral Coefficients (MFCCs) constitute feature parameters that present widely popular acoustic features mostly used in speech recognition [17]. In spite of its good performance achieved in clean background, the MFCCs feature extraction for speech recognition has been used to enhance speech recognition system performance in noisy environments. The most cited methods are the Cepstral Mean Subtraction (CMS) [18], the Power-Normalized Cepstral Coefficients (PNCCs) [19], and the Cepstral Mean Normalization (CMN) [20] which is a popular feature compensation method dealing with convolutional noise. In this same context, the majority of the published works demonstrated that the wavelet-based feature extraction [21][22][23][24] has better performance improvement than traditional Cepstral features in noisy environments. The already presented wavelet-based techniques rely on the multiresolution PWP properties and combine the extracted MFCC features from various frequency sub bands to a unique feature vector.
In this paper, a new method for real-time speech recognition is proposed under both clean and noisy environments, and it is presented and implemented on a Raspberry Pi3 board. The proposed method is based on MFCC extraction from speech signal after applying wavelet thresholding. The main idea relies on obtaining coefficient exploitation which represents the wavelet transform decomposition after eliminating the small coefficients associated with the noise usually located in high frequencies.
Then, the MFCCs method is applied to the signal. Finally, a feature vector is acquired by the obtained MFCC concatenation that constitutes one input parameters of the SVM used for classification.
Our main contribution resides in ensuring a good recognition rate, close to 100%, for positive SNRs. In real noisy areas, particularly within the range of [0, -10db], challenging results have been reached using the proposed real time approach. Obviously, real time implementation with Raspberry Pi gives excellent recognition performance in clean and noisy states.

II. FEATURE EXTRACTION
Feature extraction is the process of retaining useful information within a speech signal when rejecting the redundant and unwanted information. It represents merely the speech signal parameterization. This process includes: • Segmentation of the speech signal into windows.
• Speech signal frequency decomposition into critical bands by transforming it into PWP.
The feature extraction is mostly used, thanks to its better performance for ASR and low computational complexity under standard environment. MFCC and its hybrid feature extraction technique with PWP will be employed. A brief outline of the proposed method is described in Figure 1.

A. Mel Frequency Cepstral Coefficients Meaning
MFCCs are frequency field features based on the human ear scale. The scale [25] is approximately linear until 1kHz and logarithmic at higher frequencies. These frequency domain features [26] offer more accuracy than time domain ones. In this technique, the same information can be incorporated in less coefficients, making it more compact. The calculation proceeds as described in our previous work [27]. Afterwards, FFT is computed for each speech frame so that signal frequency components could be extracted in time-domain. Then, the logarithmic Mel scaled filter bank is applied to the FFT frame. The log filter bank energies are calculated using the DCT. Only the first thirteen DCT coefficients are kept and the rest are discarded. These DCT coefficients decorrelate the features as well they arrange them in decreasing information order.

B. Perceptual Wavelet Packet
The wavelets offer a technique that represents the timefrequency domain. It has usually been used for signal decomposition into high and low frequency components. Its coefficients depict frequency content similarity measured between a chosen wavelet function and a given signal. These coefficients are calculated as a convolution of the signal and the scaled wavelet function, which can be explained as an expanded band-pass filter due to its band-pass spectrum [27][28]. Subsequently, the resulted wavelet transforms are exploited as a filter bank named Perceptual Wavelet Packet (PWP). The PWP results to a non-redundant restoration, which gives better spectral and spatial localization of signal configuration. Compared with other multi-scale representations such as Gaussian and Laplacian pyramid the PWP represents the privilege of multilevel decomposition, where the signal is decomposed in 'approximation' and 'detail' coefficients at each level [29], through an equivalent process to high-pass and low pass filtering components. As mentioned above, the wavelet transform was introduced for time and frequency analysis of transient signals and it was extended to multiresolution wavelet transform theory via a Finite Impulse Response (FIR) filter approximation. The discrete wavelets used in multi-resolution analysis constitute an orthonormal basis. The PWP decomposition steps are explicated taking into account details and approximation coefficients.

III. REAL TIME IMPLEMENTATION SLANT
The proposed speech recognition system's block diagram is illustrated in Figure 2. The various system steps are explained in this section. The microphone input speech is sampled at 16kHz. First of all, we mention that a Voice Activity Detector (VAD) is used as a noise estimator. The VAD's output presents the binary signal resulting of the comparison between the speech input signal and the threshold value. Thus, VAD value is either true (VAD=1) when the measured input is greater than the threshold and the signal is considered as a voiced frame, or the VAD value is false (VAD=0) and the signal frame is considered as a noisy frame. The second approach step consists on speech signal decomposition with the PWP. The PWP outcome is a multilevel decomposition, in which the signal is divided into 'approximation' and 'detail' coefficients at every stage. This process is similar to low-pass and high pass filtering. The simplest way to remove noise is by using the wavelet coefficients, which are the result of the wavelet transform decomposition. The small coefficients associated with the noise through the threshold step are eliminated. Indeed, the threshold purpose offers the ideal components from the noisy signal giving the noise level estimation. There are various threshold methods. Between the most commonly used are the hard and soft threshold. They are used and adopted in this work and modeled by: where x, y and ߣ present respectively the input signal, the threshold signal, and the threshold value. The MFCCs are applied to the signal after the threshold and concatenation steps. The signal is filtered and windowed by the hamming window for FFT transformation. Next, the signal passes through a Mel-filter to obtain the twelve Cepstral coefficients. Finally, the resulted Cepstral coefficients are concatenated to construct the SVM classifier input. Similarly, this technique is applied also to our proper training speech database containing spoken words which are recorded by a mono-speaker. In order to increase the performance of our proposed speech recognition algorithm, a denoising module was added to the proposed system to enhance its robustness. This denoising module relies on Adaptive Median Filtering (AMF) [30] which is able to eliminate the data speckle noise without harming the embedded sharp contrasts. It's noticeable that the noise impact can be significantly reduced by applying the AMF to the temporal modulation spectrum, which is the Fourier transform for either real or imaginary acoustic spectrograms along the time axis. Thus, the resulting speech features can be more noise-robust and give better speech recognition performance. Figure 3 represents the modified speech recognition system in which the AMF is introduced as a speech denoising module.

IV. TEST PERFORMING AND OBTAINED RESULTS
In order to build the speech recognition system, voice commands and speech models have to be optimized based on a solid training database. In this experiment, the training database contained eleven commands recorded five times by a monospeaker for a voice command under a silent environment. Each recorded data consisted of up to 4s of utterance. The speech recognition application needs more than just the simulation and the proposed algorithm was tested on a particularly suitable flexible platform. The complete setup has been implemented and tested on a Raspberry Pi 3 board.

A. Used Raspberry Pi Card Synopsis and System Pattern
The Raspberry Pi 3 is simply a performed sized card processor [31], containing a micro-controller and a CPU. The Raspberry Pi processor core system is a Broadcom BCM2837 System-on-Chip (SoC) multimedia processor, which has 64-bit quad-core ARMv8 Cortex A53 with 1GB of RAM. Besides, it's equipped with 16GB expandable to 128GB. An SD card slot ,1.2GHz SoC processor, Video Core IV GPU, 4 USB ports, 1 HDMI port, 40 GPIO pins which could be configured as digital output or input and a jack audio output. The Raspberry Pi is controlled by an amended version of Linux (Raspbian) optimized for the ARM architecture. As Raspbian is built based on Debian, it implicitly has all the compatibilities and features required for the program. Python 2.7 or 3.5 is already installed in the Raspbian operating system and therefore a new installation is not compulsory. Python 2.7 was selected because it owns more store community support accessible contrary to Python 3.5. The project requires some external Python packages that need to be separately installed. We have also installed some other measurement packages in order to evaluate the program performance. All specifications are mentioned in Table I.

B. Real-time Performance
In order to validate the proposed speech recognition, based on MF-PWP/MFCC, algorithm's performance, a comparison was made of the improvements in speech recognition accuracy that can be obtained through the use of several types of features such as MFCC, PWP/MFCC and MF-MFCC. The recognition experiments were performed using noisy testing data with various noisy conditions: white Gaussian and babble noise, with a noise ratio (SNR) from -10db to +10db. Figure 4 compares the results for speech in white (Figure4(a)) and babble noise (Figure 4(b)) under different SNRs for several methods. Specifically, the recognition accuracy percent was compared for PWP/MFCC and MF-PWP/MFCC methods as described above, along with baseline MFCC and MF-MFCC. It can be seen that the PWP/MFCC processing provides better accuracy than MFCC features for all the tested noises, although improvements are small in high SNRs. The lack of improvement observed for clean speech and high SNRs is a common observation for many approaches to robust speech recognition. It is also noted that the denoising module provides a trivial improvement in recognition accuracy expressly in lower SNRs. Finally, it can be observed that the proposed features based on MF-PWP/MFCC perform better than other features under all test conditions. With clean and noisy data testing, we can obtain a great and expectant recognition rate with the MF-PWP/MFCC for real-time speech recognition system. Aiming to validate the proposed speech recognition algorithm performance with several feature extractions we have measured the memory use, the CPU use, and the execution time. Table II presents the CPU and memory use. This verification is obtained using htop, a popular Linux text mode utility, which is ideal for monitoring system processes and performance metrics. In addition, the maximum time execution difference of the proposed algorithm to the other algorithms doesn't exceed 15ms. It was noticed that this low difference in time execution and resources consumption did not affect the proposed algorithm's robustness.

C. Recognition Rate Comparison
The negative recognition rate part was given much attention and it represents the main contribution of this study. The comparison with the work in [30]is shown in Table III:   TABLE III Generally, the published works do not take into account the range [0,-10db]. Recognized words in this noisy area are very hard to extract. Although, the recognition rate within the range [0, 10db] reaches nearly 100% which is also the current case.

V. CONCLUSION
A new real-time speech recognition algorithm has been proposed in this paper. The proposed algorithm exploits the PWP combined with MFCC in order to match speech features in addition to the SVM classification block. The proposed method proves its effectiveness to pick up an ideal recognition rate of about 100% in clean environment. The recognition rate ranges from 98.18% to 100%, even in noisy environments from 0db to10db with the use of adaptive median filter as a denoising module. In the real noisy part, principally inside the range [0, -10db], good results have been reached with the proposed real time method. For real-time experimentation a Raspberry Pi has been used as the hardware platform. The proposed system's performance was sufficient for a wide range of speech-controlled applications. As future work, resource consumption and its impact of speech embedded applications in addition to accuracy and timing will be investigated.