Development of a baseline system for phonemes recognition task Maros Jakubec, Eva Lieskovska, Roman Jarina, Michal Chmulik, Michal Kuba Department of Multimedia and Information-Communication Technologies, University of Zilina Univerzitna 8215/1, 010 26 Zilina, Slovak Republic Abstract. The phonemes recognition is one of the fundamental The above-mentioned works are focused on different type problems in automatic speech recognition. Despite the great of phone set from the TIMIT database. Several studies progress in speech recognition, discrimination of isolated regarding the vowels classification have also been made. phonemes is still challenging task due to coarticulation, and great variability in speaking style. The aim of this work is to Weenink [6] proposed vowel classification improvement by develop a system for classification of isolated English vowels including information about the known speaker into the from the TIMIT dataset. In the paper, the following process. The goal was to reduce the variance in vowel space. conventional methods are compared: a) k-Nearest Neighbours The 13 monophthong vowels were selected similarly as in approach as a simple nonlinear instance-based classifier b) [7]. Linear discriminant analysis on bark-scale filter bank Gaussian Mixture Model, which belongs to the class of energies was used as a classification method. They reported probabilistic acoustical modelling techniques. As a front-end, we applied standard mel-frequency cepstral coefficients with that information about spectral dynamics improved the their time derivates. Various experimental methods such as classification process. Reduction of the between-speaker trimming of audio data and cross-validation were used to variance and the within-speaker variance resulted in higher increase recognition precision and reliability of system classification accuracy. evaluation. The developed system will be used as a baseline for An empirical comparison of five classifiers was presented comparison with other newer state-of-the-art approaches. in [8]. SVM, k-NN, Naive Bayes, Quadratic Bayes Normal (QDC) and Nearest Mean algorithms were tested for vowel 1 Introduction recognition using the TIMIT Corpus. MFCCs were used for Despite the significant progress in automatic speech signal parameterization. The results of this experiment show recognition (ASR) in recent decades, the role of phonemes that SVM classifier achieved the best performance. The recognition is still a challenging task. Many experiments QDC classifier had the lowest accuracy. The error rate of have been made to improve the performance of phoneme QDC method has decreased about 10% by using the recognition, including the use of better features or multiple combination of k-NN-QDC-NB. Such combination of features combinations, improved statistical models, é criteria classifiers can be efficient way to boost the performance of or modelling of pronunciation, noise, language and more [1]. machine learning method. In the paper, we present an ongoing work on development Amami et al. [9] conducted a study on different SVM of the system for classification of isolated English vowels kernels for a multi-class vowel recognition from the TIMIT from the TIMIT dataset. The developed system will be used corpus. Investigation of the optimal parameters of the kernel as a baseline for comparison with more advanced state-of- tricks and the regularization parameter was done. Two the-art approaches. In the paper we discuss system different features such as MFCC and PLP were also applied. performance using a) k-nearest neighbours (k-NN) as a Middle frames of the vowels and Fuzzy c-means clustering simple nonlinear instance-based classifier, and b) (FCM) were evaluated to determine the appropriate front- probabilistic approach based on Gaussian Mixture Model end analysis. The method based on middle frames (GMM). Speech spectrum is represented by conventional outperforms FCM method. Three middle frames turned out mel-frequency cepstral coefficients (MFCC). to have the best recognition accuracy. Interestingly, the results showed that the recognition accuracy decreased as the 1.1 Related works1 number of frames increased Regarding SVM classification, Sha and Saul [2] introduced a system for phonemes the accuracy of the vowel system and the runtime improves recognition. They trained GMM for multiway classification, with smaller value of the kernel width and the regularization using the basic principle of SVM. With MFCCs including parameter. their deltas (time derivates) and 16 Gaussian mixtures they Palaz et al. [10] claim that the ASR system based on a achieved 69.9% accuracy. Deng and Yu [3] used the Hidden neural network can be modelled by end-to-end training Trajectory Model on a phone recognition task. Similarly, procedure, without the need of separation into feature feature vectors consist of joint static cepstra and their deltas. extraction and classifier parts. In the proposed method, raw The resulting accuracy was 75.17%. Hifny and Renals [4] speech waveform was used as an input to the CNN-based introduced a phonetic recognition system based on TIMIT speech recognition system. According to the results on the database where an acoustic modulation is achieved through TIMIT phonemes and the Aurora2 connected words augmented conditional random fields. They achieved 73.4% recognition tasks, the CNN-based end-to-end system yields accuracy using the core test set and 77% in test which better performance than a standard spectral feature includes the complete test set. A publication from Mohamed extraction-based system. et al. [5] reports the use of neural networks for acoustic Although it is not always possible to achieve exactly the modelling. The outcome is 79.3% accuracy in the core test. same comparison of existing systems, Table 1 summarizes Copyright ©2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). some of the most important systems in the field of TIMIT the system accuracy, including the used methods and the sets phonemes recognition over the last twenty years. of features. Subsequently, the presented survey is ranked according to Table 1. Comparison of existing works related to phoneme classification Authors Proposed Methods Descriptors Classes Accuracy Biswas, A. et al. [24] Hidden Markov Model (HMM) wavelet based features (84 - PCA) 21 phonemes 88.90 % Karsmakers P. et al. [13] SVM- RBF Kernel 181 dimensional 39 phonemes 82.90 % Mohamed et al. [5] Monophone Deep Belief Networks MFCC, Δ, ΔΔ, energy (39) 39 phonemes 79.30 % TRAPs, temporal context division + Siniscalchi et al. [14] MFCC, Δ, ΔΔ, energy (39) 39 phonemes 79.04 % lattice rescoring Hifny & Renals [4] HMM 13 MFCC, Δ, ΔΔ, (39) 39 phonemes 77.00 % Deng & Yu [3] Hidden Trajectory Models static / delta cepstra 39 phonemes 75.17 % Sha & Saul [2] GMMs trained as SVMs 13 MFCC, Δ, ΔΔ, (39) 39 phonemes 69.90 % Frejd & Ouni [26] HMM 13 MFCC, Δ, ΔΔ, - PLP (39) 39 phonemes 67.60 % - two-layer MLP Dimitri Palaz et al. [12] MFCC, Δ, ΔΔ, energy (39) 39 phonemes 66.65 % - HMM decoder -Convolutional neural network Palaz et al. [10] Raw speech 39 phonemes 65.50 % - HMM decoder Weenink [6] Linear discriminant analysis 54 dimensional 13 vowels 60.30 % SVM- RBF Kernel Amami et al. [8] MFCC, Δ, ΔΔ, (36) 20 vowels 51.60 % - middle frames selection window is applied to frames to maintain the continuity of the 2 Proposed methods first and last points in the frames. The signal is converted to the frequency domain by using the FFT algorithm. The 2.1 Dataset magnitude frequency response is then calculated. The The TIMIT Acoustic-Phonetic Continuous Speech spectrum values are multiplied by a series of 20 triangular Corpus (LDC) database [1, 15] was used for classification. bandpass filters, summed for individual filters and then The TIMIT speech corpora contains read speech and is logarithmized. primarily designed for studying acoustic-phonetic phenomena and for testing automatic speech recognition The triangular filter bank has a linear frequency systems. 630 people participated in creating of this database, distribution in the Mel frequency range: each contributing by reading 10 phonetically rich sentences. The recordings are in the eight main dialects of American 𝑓 (1) English. 𝑚𝑒𝑙(𝑓) = 1125 ∗ ln⁡(1 + ) 700 Audio files are recorded at 16 000 Hz, 16 bit. Each audio file is accompanied by metadata files containing phonetic and lexical transcriptions. where f [Hz] is the frequency in the linear scale and mel (f) [mel] corresponds to the frequency in the mel scale. The last step is to calculate the coefficients using the 2.2 Features extraction methods discrete cosine transformation DCT. The extraction of appropriate features is one of the basic task of objects recognition. In the conventional ASR front- end, speech is represented by a sequence of feature vectors retaining particularly useful information from the signal. There are a large number of approaches and features extraction methods in ASR techniques. The features that have been used in our algorithm will be described in the following section. Mel Frequency Cepstral Coefficients - are the most commonly used acoustic features in ASR. MFCCs are designed to respect non-linear sound perception by human ear [16]. In our system, the MFCCs are computed as follows (Fig. 1): The pre-emphasis is applied to the speech signal in order to emphasize its high-frequency components. The next step Fig. 1. Block diagram of the MFCC computation is to divide the signal into 16 ms long frames with an overlap of 1/2 of the frame length. The given frame length was An important parameter is also the energy of the frame. selected based on previous studies on isolated phonemes Log energy is usually added as the 13th feature to MFCC. recognition [8, 11, 24]. The number of signal samples (256) The short-term energy function is defined by: is chosen as power 2 due to the use of FFT. A Hamming ∞ We recall a description of these methods in the following (2) section. 𝐸 = ∑ [⁡𝑠(𝑘) 𝑤(𝑛 − 𝑘)]2 𝑘=−∞ The Gaussian Mixture Model works on the principle of probabilistic modelling of audio features in the feature where s(k) is signal sample in time k and w(n) is the space. GMM is defined as the probability density function corresponding window type. It is then possible to obtain an formed by a linear superposition of K Gaussian components average energy value for each frame. The disadvantage of [18][19] as follows: this characteristic is the high sensitivity to rapid changes in 𝐾 the signal level. Values of this characteristic can be also used 𝑝(𝑥) = ∑ 𝜋𝑘 𝑁 (𝒙|𝝁𝑘 , ∑𝑘 ) (4) to separate silence segments from speech segments. 𝑘=1 Static features, which are obtained using the procedure where, the probability density function of the multivariate above, do not capture inter-frame changes along time index. Gaussian distribution for n-dimensional vector x is given by: Therefore, dynamic (or delta) features are commonly appended to the feature vectors. Usually delta features are 1 1 the estimates of the time derivatives of static features and are 𝑁(𝒙|µ, ∑) = 𝑛 exp⁡(− (𝒙 − µ)𝑇 ∑−1 (𝒙 − µ)) (5) | | (2𝜋)2 ⁡ ∑ 1/2 2 a computed as follows [17]: ∑𝑀 𝑚=1 𝑚(𝑐𝑘 [𝑖 + 𝑚] − 𝑐𝑘 [𝑖 − 𝑚]) (3) with mean vector µ ∈ Rn and covariance matrix ∑ ∈ Rn x n. Δ𝑘 [𝑖] = πk are mixing coefficients, which must satisfy the following ∑𝑀 𝑚=1 𝑚 2 conditions where Δ𝑘 [𝑖] is the delta coefficient, from frame i, 𝑐𝑘 is the 0⁡ ≤ ⁡ 𝜋𝑘⁡ ≤ 1⁡⁡⁡and ∑𝐾𝑘=1 𝜋𝑘 = 1 (6) static coefficient and a typical value for M is 1. The classification function for the proposed GMM classifier In the developed system, total features consist of 39 has the following form: elements per frame: - 12 MFCC, - 12 delta (ΔMFCC), 𝑓(𝑥) = 𝑎𝑟𝑔⁡max⁡⁡( 𝐶 𝑝(𝒙)) (7) 𝐶 - 12 delta-delta (ΔΔMFCC), - 3 log energy. where Cp(x) is GMM of the class C. Thus, we are looking for the maximal probability over all C classes. 2.3 Classification The classification process can be divided into a learning The training algorithm, which returns a set of parameters and testing phase. Thus, data set needs to be divided into two Θ = {µ,  and π} for each class, is based on the Maximum subsets. Because of 10-fold cross-validation evaluation Likelihood (ML) criterion. Given the model p (x, Θ) with the process (2.4), we selected the same number of vowels from unknow parameters, the aim is to derive its parameters based each class. the training data – set of the feature vectors {x1, x2, …, xm}. Once the data were split, models of selected vowels were The ML method uses Fisher likelihood function, which is trained and tested according to the chosen method. The defined as: 𝑁 general classification scheme can be seen in Fig. 2. 𝐹(𝒙1 , 𝒙2 , . . . . , 𝒙𝑁 |⁡𝜽) = ∏ 𝑝(𝒙𝑛 |⁡𝜽) (8) 𝑛=1 The maximum of this function with respect to unknown parameters Θ can be formalized as follows: 𝑁 ̂ = arg max⁡⁡ ∑ log 𝑝 (𝒙𝑛 |⁡𝜽) 𝜃 (9) 𝜃 𝑛=1 The maximization defined by (9) is a complicated task that does not have an explicit solution. The expectation- maximization (EM) algorithm [18] is used for finding Fig. 2. Block diagram for classification scheme maximum likelihood solutions. There are several methods suitable for phoneme Training the GMM statistical model for each single vowel classification task addressed in this work. The following is challenging for both computing power and memory. well-established classifiers, namely Gaussian mixture model Fitting the model also suffers from lack of a sufficient (GMM), Gaussian mixture model-Universal background amount of training data. It is therefore advisable to train a model (GMM-UBM) and a k-nearest neighbours (k-NN), universal generic model (so called Universal Background were chosen for the baseline system development due to Model UBM), which represents the possible distribution of their easy implementation and good classification properties. the features for a wide group of sounds, and then derive from 𝑛 it the class-specific model for an individual vowel. The Maximum likelihood estimation (ML) of the model 𝑑𝑐𝑖𝑡𝑦−𝑏𝑙𝑜𝑐𝑘 (𝒙, 𝒚) = ⁡ ∑|𝑥𝑖 − 𝑦𝑖 | (14) parameters is used for UBM training [20]. 𝑖=1 The Maximum a posteriori probability (MAP) estimate is used for UBM adaptation to the vowel model (i.e. class- 𝑑𝐶ℎ𝑒𝑏𝑦𝑠ℎ𝑒𝑣 (𝒙, 𝒚) = ⁡ max⁡(|𝑥𝑖 − 𝑦𝑖 |) (15) 𝑖 specific GMM). In the presented experiments, only vectors of mean values of UBM were adjusted to obtain individual ∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 models. 𝑑𝑐𝑜𝑠𝑖𝑛𝑒 (𝒙, 𝒚) = (16) Given a sequence of features vectors 𝑂 = {𝑜1, 𝑜2 ,. . . , 𝑜𝑁} √∑ 𝑥𝑖2 √∑ 𝑦𝑖2 from one class of vowels, the score is expressed by (10), From the above-mentioned facts it’s obvious that two where θv and θ𝑈𝐵𝑀 denote the actual vowel model and important factors play a role in the successful classification: universal model respectively. According to (10), the greater • the choice of distance function the probability p(𝑜𝑛|θv) against background model for as • the choice of the value for the parameter k (i.e. many feature vectors as possible, the more will be supported number of neighbours) the hypothesis that the recognized audio sample belongs to It is advised to choose an odd number for k to avoid the the given vowel class. scenario when two classes labels achieve the same score. 𝑁 Some issues need to be considered during the selection of k 1 𝑝(𝒐𝑛 ⁡|⁡𝛉𝑠 )⁡ value. Classes with a great number of samples can 𝑠𝑐𝑜𝑟𝑒 = ⁡ ∑ log (10) overwhelm small ones and the results will be biased, so it is 𝑁 𝑝(𝒐𝑛 ⁡|⁡𝛉𝑈𝐵𝑀 ) 𝑛=1 not recommended to set large k value. The advantage of using many samples in the training set is not exploited if k is too small [21]. The k-Nearest Neighbours (k-NN) is a simple nonlinear The disadvantage of this classifier is the calculation of all instance-based classification method and is one of the most distances for each classification, which can considerably popular classical approaches of cluster analysis. It classifies slow down the process and it can be computationally an unknown sample based on the known classification of its expensive if the training set or the number of unknown neighbours [21][22]. samples is large. The model itself is essentially made up of a training set, and the learning process consists in storing of patterns from all training samples in one model. Given an unknown 2.4 k-fold cross-validation sample, the distances between the unknown sample and all If there is not a sufficient number of observations, an the samples in the training set can be computed. Input appropriate approach to determine the optimal solution for attributes must be numeric so that their distance can be training/testing is the so-called cross-validation technique. calculated for each of the two patterns. Samples from the [23]. training set have 𝑛 number attributes, and each one sample The data set is divided into k parts, with one part always represents a point in the 𝑁-dimensional space. If a classifier being used for testing, and the remaining k-1 parts being wants to determine the target attribute of an unknown used for training. The process is repeated so that each part is sample, it searches in the 𝑘 sample space of the training set used for testing just once (Fig. 3). The advantage of for those that are closest to that unknown sample. Training validation is a relatively accurate estimate of the set can be defined as: classification success. The disadvantage of validation is that it requires more computer memory and consumes more time {𝒙𝑖 , 𝐶𝑖 }𝑖=1,…,𝐾 , ⁡𝐶𝑖 ∈ {1,2, … , 𝐿} (11) because a lot of calculations are needed at every step. where xi is a sample with its corresponding label C and K is the size of the whole training set, L is a number of classes (i.e. number of vowels). Given unknow sample x, we are looking for sample 𝑥k according to following formula: ‖𝒙𝑘 − 𝒙‖ = 𝑚𝑖𝑛‖𝒙𝑖 − 𝒙‖𝑖=1,….,𝐾 (12) Subsequently, the sample x is placed to the same class that 𝑥𝑘 belongs to. In the proposed system we used the Euclidean distance, which is the most commonly used metric for distance determination, as well as the city-block, Chebyshev and Fig. 3. k-fold cross-validation cosine distance metrics. They are defined as follows: 𝑛 3 Experimental setup and results 𝑑𝐸𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛 (𝒙, 𝒚) = √∑(𝑥𝑖 − 𝑦𝑖 )2 (13) The evaluation of the proposed GMM, GMM-UBM and 𝑖=1 k-NN methods was performed. All the tests were evaluated on isolated vowels extracted from the TIMIT data set. Two sets of vowels were created. The first set consists of the 5 GMM achieved the best success rate of 91.1% at n = 32 classes aa, eh, iy, ow, uh. This subset correlates with the gaussians and full covariance matrix. The comparison of the common vowels of the most European languages (e.g. best results for 5 vowels achieved by the above-mentioned ‘a’,’e’,’i’,’o’,’u’ in Slovak) [25]. The second set consists of methods is shown in Fig. 4. 18 American English vowels (see Table 4 for a list). The set of the 5 classes was used in the first and second experiments. Finally, performance of developed system was evaluated on the second set of the 18 classes. Proposed algorithms were implemented in MATLAB 2018b with support of the Voicebox [27] and Netlab [28] Fig. 4. The comparison of classification of 5 selected toolboxes. vowels Classifier training and testing was performed by 10-fold cross-validation. Data was initially randomly divided into 10 In the last experiment, testing was performed on a larger equally large subsets. Each of them contained approximately set of classes - 18 vowels of American English were the same number of vowels represented by the feature selected. Data needed for UBM training were selected from vectors. Nine of them were used to train the model and the other recordings available in the database. A total of 4600 rest one to test it. This was repeated 10 times, so that all 10 recordings from 510 speakers in a total length of subsets were tested. All data were parameterized by 39 approximately 3 hours and 54 minutes were used to train the MFCCs (incl. deltas and delta-deltas) per 16 ms frame with UBM model. The front-end with data manipulation is the 8 ms overlap. The features matrix dimension for each vowel same as in experiments with the recognition of 5 vowels was 10800x39 (frames x features). (referred as D2 in the text above). The experiments with The results of the experiments with 5 vowels classification GMM-UBM training/classification approach is also added. using k-NN and simple trained GMM are shown in Tables 2 Fig. 5 shows the best results achieved Interestingly, the and 3 respectively. There are shown the results achieved for k-NN algorithm outperformed both GMM and GMM-UBM various k-NN setup (type of metric and number of approaches. It achieved 84.2% vowel recognition accuracy, neighbours) and GMMs (number of gaussians and at setting k = 5 neighbours and cityblock metric. The second covariance matrix types) settings. An effort has been made most successful system was GMM-UBM, which achieved to achieve a better classification accuracy by editing the data. success rate of 78.1% at n = 256 gaussians and full Therefore, the entire database was mixed so that the speech covariance matrix. The worst performance had the GMM dialects are evenly distributed between the training and the classifier, probably due to insufficient amount of training test part. Another data modification was vowel trimming by data It achieved a system success rate of 75.5% at n = 16 omitting the first and last frames for each vowel recording. gaussians and full covariance matrix. So that silent parts as well as parts affected by coarticulation or unprecise vowel border detection were not taken into account. In addition, the middle frames are known to contain the most important information about the vowel. Such modified data are referred as D2, D1 indicates original data. Table 2. The overall system accuracy for 5 vowels recognition, using k-NN classifier, and 2 data manipulation techniques: whole vowels (D1), trimmed vowels (D2) k=3 k=5 k=7 Metric D1 D2 D1 D2 D1 D2 Chebyshev 73.54 89.48 74.53 86.61 74.81 84.37 Cosine 74.56 91.24 75.62 88.57 75.94 86.23 Euclidean 75.83 92.13 77.33 90.25 77.87 88.67 Cityblock 75.64 95.08 79.47 92.96 79.80 91.19 Table 3. The overall system accuracy for 5 vowels recognition, using GMM classifier, and 2 data manipulation techniques: whole vowels (D1), trimmed vowels (D2) n=16 n=32 n=64 Fig. 5. The comparison of classification of 18 vowels Covartype D1 D2 D1 D2 D1 D2 ppca 80.42 81.64 79.28 80.37 78.21 80.86 Table 4 shows the classification of the individual vowels diag 82.53 84.73 83.47 85.93 84.85 85.84 full 86.53 87.45 83.80 91.10 82.33 86.25 for the best k-NN model settings in form of confusion matrix. The data in table indicates the performance of the Significant improvement can be seen for both methods of algorithm as well as the false recognized vowels. This is the classification if only stationary middle part of the vowels is best way to see how the system works when recognizing analysed (D2). At k-NN method, a success rate of 95.08% individual vowels. The diagonal shows the correctly classified vowels. The lines specify incorrectly identified with k = 3 neighbours and cityblock metric, was achieved. vowels. The final success rate in percentage is also stated. Table 4. Confusion matrix of phoneme recognition for the best k-NN model The total number of correctly classified vowels was 4548 Acknowledgment out of 5400 and the success rate of 84.2% was achieved. As seen from Fig. 5 and Table 4, in the case of k-NN, the This publication is the result of the project vowels: aa, ae, ao, aw, and ux were recognized best, while implementation: Centre of excellence for systems and for the vowels ax, eh, and ix, a considerable number of services of intelligent transport II, ITMS 26220120050 samples were misclassified. Note that using GMM-UBM supported by the Research & Development Operational classifier, largest recognition errors occurred in other group Programme funded by the ERDF. of vowels (see Fig. 5). The largest difference in recognition rate between k-NN and GMM-UBM is in the case of the References vowels aa, ux, ix. From Fig 5, also disbalance between [1] C. Lopes, F. Perdigao, Phone recognition on the TIMIT simple GMM and GMM-UBM can be seen (theoretically, database. Speech Technologies, IntechOpen 2011, pp. GMM-UBM should outperform GMM in all cases). 285-302. Probably, further optimization of GMM-UBM is required. [2] F. Sha, L. K. Saul, Large margin Gaussia nmixture Phoneme recognition task on the TIMIT database consists modelling for phonetic classification and recognition. of several years of intensive research. There exists a number Proceedings of IEEE International Conference on of systems and their classification success has naturally Acoustics, Speech and Signal Processing, 2006 improved over time. Results presented in this paper are (ICASSP), France, May 2006. comparable to the existing research reported in the literature [3] L. Deng, D. Yu, Use of differential cepstra as acoustic (see section 1.1). However, it is not possible to compare features in hidden trajectory modelling for phonetic these works directly with our system because of different recognition. Proceedings of IEEE International parameters and experimental settings that have been used. Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2007. [4] Y. Hifny, S. Renals, Speech recognition using 4 Conclusion augmented conditional random fields. IEEE Transactions on Audio, Speech & Language Processing, This work deals with the design of a system for vol. 17, no. 2, 2009, pp. 354–365, ISSN 1558-7916. recognition of isolated vowels extracted from the TIMIT 2009. dataset and subsequent optimization of the training [5] A. Mohamed, G. Dahl, G. Hinton, Acoustic Modeling algorithm. Three different approaches for phoneme using Deep Belief Networks", IEEE Transactions on classification were k-NN, GMM, and GMM-UBM. The k- Audio, Speech, and Language Processing1558-7916, NN method achieved the best results with overall accuracy 2011. of 95.08% for 5 vowels and 84.2% for 18 vowels [6] D. Weenink, Vowels normalizations with the TIMIT recognition. GMM-UBM gave comparable results for 18 acoustic phonetic speech corpus. Institute of Phonetic vowels recognition but classification error was distributed Sciences, University of Amsterdam, Proceedings 24, 117–123, 2001. differently among vowel classes than in the case of k-NN. This recognition disbalance issue between k-NN and GMM approaches needs further investigation. [7] H.M. Meng, V.W. Zue, “Signal representation [24] A. Biswas, P.K. Sahu, A. Bhowmick, M. Chandra, comparison for phonetic classification”, in IEEE Proc. Feature extraction technique using ERB like wavelet ICASSP, Toronto, 285–288, 1991. sub-band periodic and aperiodic decomposition for [8] R. Amami, D.B. Ayed, N. Ellouze, An Empirical TIMIT phoneme recognition. International Journal of Comparison of SVM and Some Supervised Learning Algorithms for Vowel recognition. In: International Speech Technology, Volume 17, Issue 4, pp 389–399, Journal of Intelligent Information Processing, 2012. December 2014. [9] R. Amami, D.B. Ayed, N. Ellouze. Practical selection [25] P. Grzybek and M. Rusko, Letter, Grapheme and of svm supervised parameters with different feature (Allo-)Phone Frequencies: The Case of Slovak, representations for vowel recognition. Int J Digit Glottotheory, vol. 2, No. 1, 2009, pp 30–48. Content Technol Appl, 7/2013, pp. 418-424. [26] I. Ben Fredj and K. Ouni, Optimization of Features [10] D. Palaz, M. Magimai.-Doss, R. Collobert, Analysis of Parameters for HMM Phoneme Recognition of TIMIT CNN-based Speech Recognition System using Raw Corpus, International Journal of Advanced Research in Speech as Input. In Proceedings of the 16th Annual Electrical, Vol. 4, Issue 8, Aug. 2015. Conference of International Speech Communication [27] M. Brookes, VOICEBOX: A speech processing toolbox Association (Interspeech), Dresden, Germany, 6–10 for MATLAB (available at http://www. ee. ic. ac. uk/... Sept. 2015; pp. 11–15. hp/staff/dmb/voicebox/voicebox. html). [11] O. Farooq and S. Datta, Phoneme recognition using [23]I. Nabney, Netlab: Pattern analysis toolbox (available at wavelet based features, Information Sciences 150, 2003, https://www.mathworks.com/matlabcentral/fileexchan pp. 5-15. ge/2654-netlab). [12] D. Palaz, R. Collobert, M. Magimai.-Doss, End-to-end Phoneme Sequence Recognition using Convolutional Neural Networks. Idiap, Dec. 2013 [13] P. Karsmakers, K. Pelckmans, J. Suykens, H. Van Hamme, Fixed size kernel logistic regression for phone classification. Proceedings of Interspeech 2007, 1990- 9772 Belgium, 2007. [14] S.M. Siniscalchi, P. Schwarz, C.H. Lee, High-accuracy phone recognition by combining high-performance lattice generation and knowledge based rescoring. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2007. [15] S. J. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, N.L. Dahlgren, V. Zue, TIMIT Acoustic- Phonetic Continuous Speech Corpus. Linguistic Data Consortium, Philadelphia, 1993. [16] R. Jang, Audio Signal Processing and Recognition: 12- 2 MFCC (2005), (available at: http://mirlab.org/jang/books/audiosignalprocessing/spe echFeatureMfcc.asp?title=12-2%20MFCC). [17] S. Young, et al., “The HTK Book (for HTK Version 3.4),” Cambridge University Engineering Department, 2006. [18] Chuong B. Do. “The Multivariate Gaussian Distribution.” Stanford, CA, USA, 2008. [19] Ch. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006. [20] A. R. Avilla, S. P. Milton, F. J. Fraga, D. D. O’Shaughnessy, T. H. Falk, Improving the Performance of Far-Field Speaker Verification Using Multi- Condition Training: The Case of GMM-UBM and i- vector Systems. In: Proceedings of the Fifteenth Annual Conference of the International Speech Communication Association. Singapore, 2014 [21] A. Mucherino, P.J. Papajorgji, P.M. Pardalos, Data mining in agriculture. Springer Dordrecht Heidelberg London New York, ISBN 978-0-387-88614-5 pp. 83-8, 2009. [22] P. Cunningham, S.J. Delany, k-Nearest neighbour classifiers. Technical Report UCD-CSI-2007-4, Dublin: Artificial Intelligence Group, 2007. [23] Y. Bengio and Y. Grandvalet. No unbiased estimator of the variance of k-fold cross-validation, Journal of Machine Learning Research, 5:1089–1105, 2004.