Efficient Feature Extraction Algorithms to Develop an Arabic Speech Recognition System

—This paper studies three feature extraction methods, Mel-Frequency Cepstral Coefficients (MFCC), Power-Normalized Cepstral Coefficients (PNCC), and Modified Group Delay Function (ModGDF) for the development of an Automated Speech Recognition System (ASR) in Arabic. The Support Vector Machine (SVM) algorithm processed the obtained features. These feature extraction algorithms extract speech or voice characteristics and process the group delay functionality calculated straight from the voice signal. These algorithms were deployed to extract audio forms from Arabic speakers. PNCC provided the best recognition results in Arabic speech in comparison with the other methods. Simulation results showed that PNCC and ModGDF were more accurate than MFCC in Arabic speech recognition.

INTRODUCTION Speech is the most commonly and widely used form of communication. Many researches focus on developing reliable systems that can understand and accept commands through speech. Nowadays computers are involved in almost every aspect of our life, and as communication between people is mostly vocal, people anticipate the same way of interaction with computers [1]. Speech has the capacity to be an important mode of human-computer interaction, and the interest in developing computers that can accept speech as input is growing. The substantial research effort in global speech recognition and the increasing computational power at lower cost could result in more speech recognition applications in the near future [3]. Arabic language is the most popular in the Arab world, and the Arabic alphabet is used in some other languages such as Persian, Urdu, and Malaysian [2].
Research in human-computer speech interaction has focused mostly on developing better technical speech recognition systems, and gains in precision and productivity [4]. This research applied three distinct feature extraction methods onto an Arabic speech dataset, namely Mel-Frequency Cepstral (MFCC), Power-Normalized Cepstral Coefficients (PNCC) and Modified Group Delay Function (ModGDF). The extracted features were classified by a Support Vector Machine (SVM). The results of these three feature extracting techniques were compared in order to get the most efficient and accurate output. The feature extraction techniques, having their own properties like ModGDF, give additive and high-resolution signal. The additive property adds different functions in one group domain, and high-resolution property is used to sharpen the peaks of a group delay domain [5].
II. BACKGROUND Speech awareness and evaluation have captivated researchers from Fletcher's early works [6] and the first voice identification devices [7], to present-day. Nevertheless, high precision machine speech recognition can be achieved mostly in quiet settings, as the efficiency of a typical speech recognizer reduces significantly in loud settings [8]. Environmental influence and other variables were explored in [9]. As technology progresses, speech recognition will be embedded in more devices used in everyday activities, where environmental variables perform a major part, such as mobile phone voice recognition applications [10], cars [11], integrated access control and information systems [12], emotion identification systems [13], application monitoring [14], disabled assistance [15], and intelligent technology. In addition to voice, many acoustic applications are also essential in diverse engineering issues [16][17][18][19][20][21][22]. A noise decrease method could be deployed to enhance efficiency in real-world noisy settings [23][24][25][26]. Machine efficiency degrades on noise, channel variance, and spontaneous expressions further below than humans [27]. Automatic Speech Recognition (ASR) has not surpassed human efficiency in precision and robustness but we continue to avail from it by knowing the central values behind the identification of Human Speech (HS) [28]. Despite the advancements in auditory processing and popular frontends for ASR devices, only a few elements of noise handling in the auditory periphery are modeled and simulated [29]. For instance, common methods such as MFCC use auditory features like varying bandwidth filter bank and compression size. Coefficients of Perceptual Linear Prediction (PLP) focus on perceptual processing by using curves of critical band resolution, corresponding loudness scaling, and cube root energy laws of listening Linear Prediction Coefficients (LPC) [30]. Synaptic adjustment could include an instance of auditory-motivated enhancements of voice depiction. Standard MFCC or PLP coefficients could be substituted by coefficients depending on some cochlear model in order to better represent human auditory periphery. The proposed model of synaptic adaptation in [31] showed important improvements in the efficiency of speech recognition. The PNCC proposed in [32], was based on auditory processing, including new characteristics, using a nonlinearity of power-law, a noisesuppression algorithm relying on asymmetric filtering, and temporal masking. The experimental findings exhibited enhanced precision of acceptance, comparing to MFCC and PLP. Another strategy for feature removal was based on Deep Neural Networks (DNN). The noise robustness of sound designs relying on DNN was evaluated in [33]. Recurrent Neural Networks (RNN) for cleaning distorted input characteristics were applied in [34]. The use of LSTM-RNNs was suggested in [35] to manage extremely non-stationary additive noise. For solid voice recognition, an all-inclusive outline of profound teaching was presented in [36]. Many researches utilized PNCC and MFCC to extract the most significant features from speech signals [37][38][39]. Group Delay Function (ModGDF) was used to extract speech signals, being more efficient than MFCC.  Audio from Arabic speakers was given as input to the system, and three feature extraction techniques, MFCC, PNCC and ModGDF, were applied to extract significant features of Arabic speech. SVM algorithm was used for training and classification, and performance measures were employed to evaluate these algorithms.

IV. DATABASE
A speech database was created, populated with utterances from volunteered Yemeni students studying at Dr. Babasaheb Ambedkar Marathwada University, in Aurangabad, India. Tables I and II, include the demographic information of the volunteers and the basic parameters of the recordings. A. Recording Procedure The database was recorded using high quality headsets (Sennheiser PC360) and PRAAT Software, in a quiet environment. Speech samples were recorded in mono mode with 16000Hz sampling rate. A microphone was placed at a distance of about 3cm from the volunteer's mouth. Table III, displays the hardware and software used during the speech samples recording procedure.  V. FEATURE EXTRACTION ALGORITHMS Feature extraction is vital for developing a speech recognition system. Its main objective is to extract the most significant features for identifying Arabic speakers. Three feature extraction algorithms were applied: PNCC, ModGDF, and MFCC.

A. Power Normalized Cepstral Cofficients (PNCC)
The PNCC feature extraction algorithm for extracting features for speech recognition can be seen in [3]. PNCC has two components: initial processing, and temporal integration for environmental analysis.

1) Initial Processing
This processing uses a pre-emphasis filter in the form of: Subsequently, a Short-Time Fourier Transformation (STFT) is conducted using Hamming windows. The use of a DFT volume of 1024 was intended to produce a length of 25.6ms, with 10ms between frames. By weighting magnitudesquared STFT outputs, spectral power in 40 analysis bands was obtained for positive frequencies. Center frequencies are also linearly spaced between 200Hz and 8000Hz using gamma tone filters in Equivalent Rectangular Bandwidth (ERB) [3].

2) Temporal Integration for Environmental Analysis
Most speech recognition systems use length frames of analysis between 20 and 30ms. It is often found that longer analytical windows deliver greater noise modeling efficiency and environmental normalization [6], because of the facility related to most background conditions, and changes slower than the speaking-related instant power. In PNCC processing, an estimate is made of a quantity referred to as "medium-time power" Q[m,l] by calculating the running average of P[m, l], the power observed in a single frame of analysis, according to: where m is the index of the frame, and l is the index of the channel.

B. Modified Group Delay Function (ModGDF)
This method was discussed in detail in [7][8][9][10][11][12][13][14][15]. It should be noted that the group delay feature is different from the phase spectrum, and it is defined as the phase negative derivative which can be used effectively to extract different system parameters when the signal is considered as a minimum phase signal. This is mainly so because a minimum phase signal's magnitude spectrum is similar to each other and its group delay feature. Figure 2, shows the process of ModGDF algorithm for extracting speech features. The algorithm is described below.  Figure 3 shows the processing steps of MFCC for feature extraction. Pre-emphasis is the first step of MFCC, which produces energy, that was earlier compressed during sound generation, at a high frequency. Framing uses narrower parts to trim the sound signals. Windowing is used to avert discontinuity of the signals produced by the framing method. Fast Fourier Transform (FFT) is used for adapting a signal from time to frequency domain. Filter bank is the overlapping band pass filter. The final process is the Discrete Cosine Transform (DCT) making the coefficients of MFCC [18]. MFCC is computed from speech signal using the following three steps: • Compute the FFT power spectrum of the speech signal • Apply a Mel-space filter-bank to the power spectrum to get energies • Compute DCT of log filter-bank energies to get uncorrelated MFCC's The speech signal is first divided into time frames comprising of a random number of samples. In most systems, overlapping of frames is used to smooth transition from frame to frame. Each time frame is then windowed with a Hamming window to eliminate discontinuities at the edges [17]. The filter coefficients w(n) of a Hamming window of length n are computed according to: ‫ݓ‬ሺ݊ሻ ൌ 0.54 െ 0.46 cos ቀ ଶగ ேିଵ ቁ, 0 ݊ ܰ െ 1 ‫ݓ‬ሺ݊ሻ ൌ 0, otherwise. (2) where N is the total number of samples, and n is the current sample. Mel scale links perceived frequency or pitch of a pure tone to its actual measured frequency. Humans discern better small changes in pitch at lower frequencies. Integrating this scale makes the features match more closely to what humans hear. The formula for converting from frequency to Mel scale is: while the formula to go back from Mel's scale to frequency is: VI. CLASSIFICATION SVM is principally a binary classifier, but with the following two approaches it can be extended to multi-class tasks, the first being 1-vs-all i.e. comparing each class to the rest and the second, 1-vs-1, i.e. each class to the other, separately [20]. In this study, the i-vs-all was used, consisting of multiple binary SVMs equal to the number of classes. Every SVM with each one of the classes against the rest of them is taught and taken into consideration when testing. The decision is eventually made based on the distance from all SVMs between the test data and the hyper planes.

VII. SIMULATION RESULTS
Several experiments were conducted, employing the speech database, for classification and recognition using MFCC, PNCC and ModGDF for feature extraction. Training procedure used 60% of the data, while 40% were used for testing. The test procedure was implemented in Matlab 2016, and screen shots are shown in Figures 4 and 5. Evaluation and testing was performed using accuracy rate, specificity, sensitivity, precision, and execution time.

A. Analysis for Arabic Digits
The feature extraction methods were applied on the digit samples, and the results are shown in Table VII.  Figure 6 illustrates the methods' performance. As it can be observed, ModGDF with SVN obtained better results regarding time cost. PNCC and MFCC with SVN obtained good results, but their execution time was much higher. It is concluded that ModGDF had lower time cost, as it reduced execution time complexity. Table VIII, shows the confusion matrix of PNCC for the recognition of Arabic digits. Figure 7, displays a sample of ModGDF with SVM for the recognition of an Arabic digit ("Khamsah").  Table IX shows the results on the recognition of Arabic words. The results of ModGDF with SVM are reported to be not satisfactory, but time cost is much lesser than the other feature extraction methods. PNCC with SVM performed better, but time cost turned out to be significantly more. The results are also shown in Figure 8. Table X shows the confusion matrix of PNCC/SVM for the recognition of Arabic words. The confusion matrix has attested that PNCC is more robust and demonstrates more strength to identify Arabic words. Figure 9 illustrates the performance of the PNCC on the recognition of an Arabic word ("Dirham").    Table XII. The confusion matrix shows that the PNCC/SVM is capable of recognizing sentences with satisfactory results. The results are also shown in Figure 10, while Figure 11 illustrates the performance of PNCC on the recognition of an Arabic sentence ("What are the available majors?").

VIII. CONCLUSION
In this paper, a speech recognition system for Arabic language was presented, evaluating three feature extraction algorithms, namely MFCC, PNCC, and ModGDF, while an SVM was used for the classification process. Results showed that PNCC was more efficient, while ModGDF had moderate accuracy. PNCC and ModGDF fill the gaps in SVM, as they both had greater accuracy than MFCC. PNCC had a 93-97% accuracy rate, ModGDF had 90% and MFCC had 88%.