Development of a baseline system for phonemes recognition task
                       Maros Jakubec, Eva Lieskovska, Roman Jarina, Michal Chmulik, Michal Kuba
                Department of Multimedia and Information-Communication Technologies, University of Zilina
                               Univerzitna 8215/1, 010 26 Zilina, Slovak Republic

Abstract. The phonemes recognition is one of the fundamental             The above-mentioned works are focused on different type
problems in automatic speech recognition. Despite the great           of phone set from the TIMIT database. Several studies
progress in speech recognition, discrimination of isolated            regarding the vowels classification have also been made.
phonemes is still challenging task due to coarticulation, and
great variability in speaking style. The aim of this work is to       Weenink [6] proposed vowel classification improvement by
develop a system for classification of isolated English vowels        including information about the known speaker into the
from the TIMIT dataset. In the paper, the following                   process. The goal was to reduce the variance in vowel space.
conventional methods are compared: a) k-Nearest Neighbours            The 13 monophthong vowels were selected similarly as in
approach as a simple nonlinear instance-based classifier b)           [7]. Linear discriminant analysis on bark-scale filter bank
Gaussian Mixture Model, which belongs to the class of                 energies was used as a classification method. They reported
probabilistic acoustical modelling techniques. As a front-end,
we applied standard mel-frequency cepstral coefficients with          that information about spectral dynamics improved the
their time derivates. Various experimental methods such as            classification process. Reduction of the between-speaker
trimming of audio data and cross-validation were used to              variance and the within-speaker variance resulted in higher
increase recognition precision and reliability of system              classification accuracy.
evaluation. The developed system will be used as a baseline for          An empirical comparison of five classifiers was presented
comparison with other newer state-of-the-art approaches.              in [8]. SVM, k-NN, Naive Bayes, Quadratic Bayes Normal
                                                                      (QDC) and Nearest Mean algorithms were tested for vowel
1     Introduction                                                    recognition using the TIMIT Corpus. MFCCs were used for
   Despite the significant progress in automatic speech               signal parameterization. The results of this experiment show
recognition (ASR) in recent decades, the role of phonemes             that SVM classifier achieved the best performance. The
recognition is still a challenging task. Many experiments             QDC classifier had the lowest accuracy. The error rate of
have been made to improve the performance of phoneme                  QDC method has decreased about 10% by using the
recognition, including the use of better features or multiple         combination of k-NN-QDC-NB. Such combination of
features combinations, improved statistical models, é criteria        classifiers can be efficient way to boost the performance of
or modelling of pronunciation, noise, language and more [1].          machine learning method.
   In the paper, we present an ongoing work on development               Amami et al. [9] conducted a study on different SVM
of the system for classification of isolated English vowels           kernels for a multi-class vowel recognition from the TIMIT
from the TIMIT dataset. The developed system will be used             corpus. Investigation of the optimal parameters of the kernel
as a baseline for comparison with more advanced state-of-             tricks and the regularization parameter was done. Two
the-art approaches. In the paper we discuss system                    different features such as MFCC and PLP were also applied.
performance using a) k-nearest neighbours (k-NN) as a                 Middle frames of the vowels and Fuzzy c-means clustering
simple nonlinear instance-based classifier, and b)                    (FCM) were evaluated to determine the appropriate front-
probabilistic approach based on Gaussian Mixture Model                end analysis. The method based on middle frames
(GMM). Speech spectrum is represented by conventional                 outperforms FCM method. Three middle frames turned out
mel-frequency cepstral coefficients (MFCC).                           to have the best recognition accuracy. Interestingly, the
                                                                      results showed that the recognition accuracy decreased as the
1.1 Related works1                                                    number of frames increased Regarding SVM classification,
   Sha and Saul [2] introduced a system for phonemes                  the accuracy of the vowel system and the runtime improves
recognition. They trained GMM for multiway classification,            with smaller value of the kernel width and the regularization
using the basic principle of SVM. With MFCCs including                parameter.
their deltas (time derivates) and 16 Gaussian mixtures they              Palaz et al. [10] claim that the ASR system based on a
achieved 69.9% accuracy. Deng and Yu [3] used the Hidden              neural network can be modelled by end-to-end training
Trajectory Model on a phone recognition task. Similarly,              procedure, without the need of separation into feature
feature vectors consist of joint static cepstra and their deltas.     extraction and classifier parts. In the proposed method, raw
The resulting accuracy was 75.17%. Hifny and Renals [4]               speech waveform was used as an input to the CNN-based
introduced a phonetic recognition system based on TIMIT               speech recognition system. According to the results on the
database where an acoustic modulation is achieved through             TIMIT phonemes and the Aurora2 connected words
augmented conditional random fields. They achieved 73.4%              recognition tasks, the CNN-based end-to-end system yields
accuracy using the core test set and 77% in test which                better performance than a standard spectral feature
includes the complete test set. A publication from Mohamed            extraction-based system.
et al. [5] reports the use of neural networks for acoustic               Although it is not always possible to achieve exactly the
modelling. The outcome is 79.3% accuracy in the core test.            same comparison of existing systems, Table 1 summarizes

Copyright ©2019 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
some of the most important systems in the field of TIMIT                the system accuracy, including the used methods and the sets
phonemes recognition over the last twenty years.                        of features.
Subsequently, the presented survey is ranked according to
                                   Table 1. Comparison of existing works related to phoneme classification
            Authors                        Proposed Methods                     Descriptors                Classes      Accuracy
    Biswas, A. et al. [24]      Hidden Markov Model (HMM)            wavelet based features (84 - PCA)   21 phonemes     88.90 %
    Karsmakers P. et al. [13]   SVM- RBF Kernel                      181 dimensional                     39 phonemes     82.90 %
    Mohamed et al. [5]          Monophone Deep Belief Networks       MFCC, Δ, ΔΔ, energy (39)            39 phonemes     79.30 %
                                TRAPs, temporal context division +
    Siniscalchi et al. [14]                                          MFCC, Δ, ΔΔ, energy (39)            39 phonemes      79.04 %
                                lattice rescoring
    Hifny & Renals [4]          HMM                                  13 MFCC, Δ, ΔΔ, (39)                39 phonemes      77.00 %
    Deng & Yu [3]               Hidden Trajectory Models             static / delta cepstra              39 phonemes      75.17 %
    Sha & Saul [2]              GMMs trained as SVMs                 13 MFCC, Δ, ΔΔ, (39)                39 phonemes      69.90 %
    Frejd & Ouni [26]           HMM                                  13 MFCC, Δ, ΔΔ, - PLP (39)          39 phonemes      67.60 %
                                - two-layer MLP
    Dimitri Palaz et al. [12]                                        MFCC, Δ, ΔΔ, energy (39)            39 phonemes      66.65 %
                                - HMM decoder
                                -Convolutional neural network
    Palaz et al. [10]                                                Raw speech                          39 phonemes      65.50 %
                                - HMM decoder
    Weenink [6]                 Linear discriminant analysis         54 dimensional                       13 vowels       60.30 %
                                SVM- RBF Kernel
    Amami et al. [8]                                                 MFCC, Δ, ΔΔ, (36)                    20 vowels       51.60 %
                                - middle frames selection


                                                                        window is applied to frames to maintain the continuity of the
2    Proposed methods                                                   first and last points in the frames. The signal is converted to
                                                                        the frequency domain by using the FFT algorithm. The
2.1 Dataset
                                                                        magnitude frequency response is then calculated. The
   The TIMIT Acoustic-Phonetic Continuous Speech                        spectrum values are multiplied by a series of 20 triangular
Corpus (LDC) database [1, 15] was used for classification.              bandpass filters, summed for individual filters and then
The TIMIT speech corpora contains read speech and is                    logarithmized.
primarily designed for studying acoustic-phonetic
phenomena and for testing automatic speech recognition                     The triangular filter bank has a linear frequency
systems. 630 people participated in creating of this database,          distribution in the Mel frequency range:
each contributing by reading 10 phonetically rich sentences.
The recordings are in the eight main dialects of American                                                      𝑓                    (1)
English.                                                                          𝑚𝑒𝑙(𝑓) = 1125 ∗ ln⁡(1 +         )
                                                                                                              700
   Audio files are recorded at 16 000 Hz, 16 bit. Each audio
file is accompanied by metadata files containing phonetic
and lexical transcriptions.                                             where f [Hz] is the frequency in the linear scale and mel (f)
                                                                        [mel] corresponds to the frequency in the mel scale.
                                                                           The last step is to calculate the coefficients using the
2.2 Features extraction methods                                         discrete cosine transformation DCT.
   The extraction of appropriate features is one of the basic
task of objects recognition. In the conventional ASR front-
end, speech is represented by a sequence of feature vectors
retaining particularly useful information from the signal.
There are a large number of approaches and features
extraction methods in ASR techniques. The features that
have been used in our algorithm will be described in the
following section.

   Mel Frequency Cepstral Coefficients - are the most
commonly used acoustic features in ASR. MFCCs are
designed to respect non-linear sound perception by human
ear [16].
   In our system, the MFCCs are computed as follows (Fig.
1): The pre-emphasis is applied to the speech signal in order
to emphasize its high-frequency components. The next step                       Fig. 1. Block diagram of the MFCC computation
is to divide the signal into 16 ms long frames with an overlap
of 1/2 of the frame length. The given frame length was                    An important parameter is also the energy of the frame.
selected based on previous studies on isolated phonemes                 Log energy is usually added as the 13th feature to MFCC.
recognition [8, 11, 24]. The number of signal samples (256)             The short-term energy function is defined by:
is chosen as power 2 due to the use of FFT. A Hamming
                   ∞                                               We recall a description of these methods in the following
                                                            (2)    section.
           𝐸 = ∑ [⁡𝑠(𝑘) 𝑤(𝑛 − 𝑘)]2
                 𝑘=−∞
                                                                     The Gaussian Mixture Model works on the principle of
                                                                   probabilistic modelling of audio features in the feature
where s(k) is signal sample in time k and w(n) is the              space. GMM is defined as the probability density function
corresponding window type. It is then possible to obtain an        formed by a linear superposition of K Gaussian components
average energy value for each frame. The disadvantage of           [18][19] as follows:
this characteristic is the high sensitivity to rapid changes in                                  𝐾
the signal level. Values of this characteristic can be also used                    𝑝(𝑥) = ∑ 𝜋𝑘 𝑁 (𝒙|𝝁𝑘 , ∑𝑘 )                   (4)
to separate silence segments from speech segments.
                                                                                             𝑘=1
   Static features, which are obtained using the procedure           where, the probability density function of the multivariate
above, do not capture inter-frame changes along time index.        Gaussian distribution for n-dimensional vector x is given by:
Therefore, dynamic (or delta) features are commonly
appended to the feature vectors. Usually delta features are                              1                 1
the estimates of the time derivatives of static features and are      𝑁(𝒙|µ, ∑) =       𝑛         exp⁡(− (𝒙 − µ)𝑇 ∑−1 (𝒙 − µ))   (5)
                                                                                          | |
                                                                                    (2𝜋)2 ⁡ ∑ 1/2       2
a computed as follows [17]:

                 ∑𝑀
                  𝑚=1 𝑚(𝑐𝑘 [𝑖 + 𝑚] − 𝑐𝑘 [𝑖 − 𝑚])           (3)     with mean vector µ ∈ Rn and covariance matrix ∑ ∈ Rn x n.
      Δ𝑘 [𝑖] =                                                     πk are mixing coefficients, which must satisfy the following
                           ∑𝑀 𝑚=1 𝑚
                                    2
                                                                   conditions
where Δ𝑘 [𝑖] is the delta coefficient, from frame i, 𝑐𝑘 is the               0⁡ ≤ ⁡ 𝜋𝑘⁡ ≤ 1⁡⁡⁡and ∑𝐾𝑘=1 𝜋𝑘 = 1                   (6)
static coefficient and a typical value for M is 1.
                                                                   The classification function for the proposed GMM classifier
   In the developed system, total features consist of 39
                                                                   has the following form:
elements per frame:
  - 12 MFCC,
  - 12 delta (ΔMFCC),                                                            𝑓(𝑥) = 𝑎𝑟𝑔⁡max⁡⁡( 𝐶 𝑝(𝒙))                       (7)
                                                                                                      𝐶
  - 12 delta-delta (ΔΔMFCC),
  - 3 log energy.                                                  where Cp(x) is GMM of the class C.
                                                                   Thus, we are looking for the maximal probability over all C
                                                                   classes.
2.3 Classification
   The classification process can be divided into a learning       The training algorithm, which returns a set of parameters
and testing phase. Thus, data set needs to be divided into two     Θ = {µ,  and π} for each class, is based on the Maximum
subsets. Because of 10-fold cross-validation evaluation            Likelihood (ML) criterion. Given the model p (x, Θ) with the
process (2.4), we selected the same number of vowels from          unknow parameters, the aim is to derive its parameters based
each class.                                                        the training data – set of the feature vectors {x1, x2, …, xm}.
   Once the data were split, models of selected vowels were        The ML method uses Fisher likelihood function, which is
trained and tested according to the chosen method. The             defined as:
                                                                                                               𝑁
general classification scheme can be seen in Fig. 2.
                                                                            𝐹(𝒙1 , 𝒙2 , . . . . , 𝒙𝑁 |⁡𝜽) = ∏ 𝑝(𝒙𝑛 |⁡𝜽)          (8)
                                                                                                               𝑛=1

                                                                     The maximum of this function with respect to unknown
                                                                   parameters Θ can be formalized as follows:
                                                                                                     𝑁

                                                                               ̂ = arg max⁡⁡ ∑ log 𝑝 (𝒙𝑛 |⁡𝜽)
                                                                               𝜃                                                 (9)
                                                                                             𝜃
                                                                                                     𝑛=1

                                                                     The maximization defined by (9) is a complicated task
                                                                   that does not have an explicit solution. The expectation-
                                                                   maximization (EM) algorithm [18] is used for finding
       Fig. 2. Block diagram for classification scheme
                                                                   maximum likelihood solutions.
   There are several methods suitable for phoneme
                                                                      Training the GMM statistical model for each single vowel
classification task addressed in this work. The following
                                                                   is challenging for both computing power and memory.
well-established classifiers, namely Gaussian mixture model
                                                                   Fitting the model also suffers from lack of a sufficient
(GMM), Gaussian mixture model-Universal background
                                                                   amount of training data. It is therefore advisable to train a
model (GMM-UBM) and a k-nearest neighbours (k-NN),
                                                                   universal generic model (so called Universal Background
were chosen for the baseline system development due to
                                                                   Model UBM), which represents the possible distribution of
their easy implementation and good classification properties.
                                                                   the features for a wide group of sounds, and then derive from
                                                                                                      𝑛
it the class-specific model for an individual vowel. The
Maximum likelihood estimation (ML) of the model                                𝑑𝑐𝑖𝑡𝑦−𝑏𝑙𝑜𝑐𝑘 (𝒙, 𝒚) = ⁡ ∑|𝑥𝑖 − 𝑦𝑖 |          (14)
parameters is used for UBM training [20].                                                            𝑖=1
   The Maximum a posteriori probability (MAP) estimate is
used for UBM adaptation to the vowel model (i.e. class-                      𝑑𝐶ℎ𝑒𝑏𝑦𝑠ℎ𝑒𝑣 (𝒙, 𝒚) = ⁡ max⁡(|𝑥𝑖 − 𝑦𝑖 |)        (15)
                                                                                                      𝑖
specific GMM). In the presented experiments, only vectors
of mean values of UBM were adjusted to obtain individual                                            ∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖
models.                                                                          𝑑𝑐𝑜𝑠𝑖𝑛𝑒 (𝒙, 𝒚) =                          (16)
   Given a sequence of features vectors 𝑂 = {𝑜1, 𝑜2 ,. . . , 𝑜𝑁}                                  √∑ 𝑥𝑖2 √∑ 𝑦𝑖2
from one class of vowels, the score is expressed by (10),            From the above-mentioned facts it’s obvious that two
where θv and θ𝑈𝐵𝑀 denote the actual vowel model and                important factors play a role in the successful classiﬁcation:
universal model respectively. According to (10), the greater             • the choice of distance function
the probability p(𝑜𝑛|θv) against background model for as                 • the choice of the value for the parameter k (i.e.
many feature vectors as possible, the more will be supported                  number of neighbours)
the hypothesis that the recognized audio sample belongs to            It is advised to choose an odd number for k to avoid the
the given vowel class.                                             scenario when two classes labels achieve the same score.
                           𝑁
                                                                   Some issues need to be considered during the selection of k
                   1         𝑝(𝒐𝑛 ⁡|⁡𝛉𝑠 )⁡                         value. Classes with a great number of samples can
            𝑠𝑐𝑜𝑟𝑒 = ⁡ ∑ log                               (10)     overwhelm small ones and the results will be biased, so it is
                   𝑁        𝑝(𝒐𝑛 ⁡|⁡𝛉𝑈𝐵𝑀 )
                          𝑛=1                                      not recommended to set large k value. The advantage of
                                                                   using many samples in the training set is not exploited if k is
                                                                   too small [21].
   The k-Nearest Neighbours (k-NN) is a simple nonlinear
                                                                      The disadvantage of this classifier is the calculation of all
instance-based classification method and is one of the most
                                                                   distances for each classification, which can considerably
popular classical approaches of cluster analysis. It classifies
                                                                   slow down the process and it can be computationally
an unknown sample based on the known classiﬁcation of its
                                                                   expensive if the training set or the number of unknown
neighbours [21][22].
                                                                   samples is large.
   The model itself is essentially made up of a training set,
and the learning process consists in storing of patterns from
all training samples in one model. Given an unknown                2.4 k-fold cross-validation
sample, the distances between the unknown sample and all               If there is not a sufficient number of observations, an
the samples in the training set can be computed. Input             appropriate approach to determine the optimal solution for
attributes must be numeric so that their distance can be           training/testing is the so-called cross-validation technique.
calculated for each of the two patterns. Samples from the          [23].
training set have 𝑛 number attributes, and each one sample             The data set is divided into k parts, with one part always
represents a point in the 𝑁-dimensional space. If a classifier     being used for testing, and the remaining k-1 parts being
wants to determine the target attribute of an unknown              used for training. The process is repeated so that each part is
sample, it searches in the 𝑘 sample space of the training set      used for testing just once (Fig. 3). The advantage of
for those that are closest to that unknown sample. Training        validation is a relatively accurate estimate of the
set can be defined as:                                             classification success. The disadvantage of validation is that
                                                                   it requires more computer memory and consumes more time
             {𝒙𝑖 , 𝐶𝑖 }𝑖=1,…,𝐾 , ⁡𝐶𝑖 ∈ {1,2, … , 𝐿}       (11)     because a lot of calculations are needed at every step.

 where xi is a sample with its corresponding label C and K is
the size of the whole training set, L is a number of classes
(i.e. number of vowels). Given unknow sample x, we are
looking for sample 𝑥k according to following formula:

            ‖𝒙𝑘 − 𝒙‖ = 𝑚𝑖𝑛‖𝒙𝑖 − 𝒙‖𝑖=1,….,𝐾                (12)

  Subsequently, the sample x is placed to the same class that
𝑥𝑘 belongs to.
  In the proposed system we used the Euclidean distance,
which is the most commonly used metric for distance
determination, as well as the city-block, Chebyshev and                             Fig. 3. k-fold cross-validation
cosine distance metrics. They are defined as follows:
                                      𝑛
                                                                   3    Experimental setup and results
           𝑑𝐸𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛 (𝒙, 𝒚) = √∑(𝑥𝑖 − 𝑦𝑖 )2            (13)
                                                                     The evaluation of the proposed GMM, GMM-UBM and
                                    𝑖=1
                                                                   k-NN methods was performed. All the tests were evaluated
                                                                   on isolated vowels extracted from the TIMIT data set. Two
sets of vowels were created. The first set consists of the 5                GMM achieved the best success rate of 91.1% at n = 32
classes aa, eh, iy, ow, uh. This subset correlates with the                 gaussians and full covariance matrix. The comparison of the
common vowels of the most European languages (e.g.                          best results for 5 vowels achieved by the above-mentioned
‘a’,’e’,’i’,’o’,’u’ in Slovak) [25]. The second set consists of             methods is shown in Fig. 4.
18 American English vowels (see Table 4 for a list). The set
of the 5 classes was used in the first and second experiments.
Finally, performance of developed system was evaluated on
the second set of the 18 classes.

   Proposed algorithms were implemented in MATLAB
2018b with support of the Voicebox [27] and Netlab [28]                         Fig. 4. The comparison of classification of 5 selected
toolboxes.                                                                                           vowels
   Classifier training and testing was performed by 10-fold
cross-validation. Data was initially randomly divided into 10                  In the last experiment, testing was performed on a larger
equally large subsets. Each of them contained approximately                 set of classes - 18 vowels of American English were
the same number of vowels represented by the feature                        selected. Data needed for UBM training were selected from
vectors. Nine of them were used to train the model and the                  other recordings available in the database. A total of 4600
rest one to test it. This was repeated 10 times, so that all 10             recordings from 510 speakers in a total length of
subsets were tested. All data were parameterized by 39                      approximately 3 hours and 54 minutes were used to train the
MFCCs (incl. deltas and delta-deltas) per 16 ms frame with                  UBM model. The front-end with data manipulation is the
8 ms overlap. The features matrix dimension for each vowel                  same as in experiments with the recognition of 5 vowels
was 10800x39 (frames x features).                                           (referred as D2 in the text above). The experiments with
   The results of the experiments with 5 vowels classification              GMM-UBM training/classification approach is also added.
using k-NN and simple trained GMM are shown in Tables 2                     Fig. 5 shows the best results achieved Interestingly, the
and 3 respectively. There are shown the results achieved for                k-NN algorithm outperformed both GMM and GMM-UBM
various k-NN setup (type of metric and number of                            approaches. It achieved 84.2% vowel recognition accuracy,
neighbours) and GMMs (number of gaussians and                               at setting k = 5 neighbours and cityblock metric. The second
covariance matrix types) settings. An effort has been made                  most successful system was GMM-UBM, which achieved
to achieve a better classification accuracy by editing the data.            success rate of 78.1% at n = 256 gaussians and full
Therefore, the entire database was mixed so that the speech                 covariance matrix. The worst performance had the GMM
dialects are evenly distributed between the training and the                classifier, probably due to insufficient amount of training
test part. Another data modification was vowel trimming by                  data It achieved a system success rate of 75.5% at n = 16
omitting the first and last frames for each vowel recording.                gaussians and full covariance matrix.
So that silent parts as well as parts affected by coarticulation
or unprecise vowel border detection were not taken into
account. In addition, the middle frames are known to contain
the most important information about the vowel. Such
modified data are referred as D2, D1 indicates original data.
   Table 2. The overall system accuracy for 5 vowels recognition,
using k-NN classifier, and 2 data manipulation techniques: whole
               vowels (D1), trimmed vowels (D2)
                      k=3               k=5                   k=7
    Metric
               D1        D2      D1            D2      D1            D2
 Chebyshev    73.54     89.48   74.53         86.61   74.81         84.37
  Cosine      74.56     91.24   75.62         88.57   75.94         86.23
 Euclidean    75.83     92.13   77.33         90.25   77.87         88.67
 Cityblock    75.64     95.08   79.47         92.96   79.80         91.19


   Table 3. The overall system accuracy for 5 vowels recognition,
using GMM classifier, and 2 data manipulation techniques: whole
               vowels (D1), trimmed vowels (D2)
                    n=16             n=32                  n=64                Fig. 5. The comparison of classification of 18 vowels
 Covartype
                D1       D2      D1       D2           D1       D2
    ppca       80.42   81.64    79.28    80.37        78.21    80.86           Table 4 shows the classification of the individual vowels
    diag       82.53   84.73    83.47    85.93        84.85    85.84
     full      86.53   87.45    83.80    91.10        82.33    86.25
                                                                            for the best k-NN model settings in form of confusion
                                                                            matrix. The data in table indicates the performance of the
   Significant improvement can be seen for both methods of                  algorithm as well as the false recognized vowels. This is the
classification if only stationary middle part of the vowels is              best way to see how the system works when recognizing
analysed (D2). At k-NN method, a success rate of 95.08%                     individual vowels. The diagonal shows the correctly
                                                                            classified vowels. The lines specify incorrectly identified
with k = 3 neighbours and cityblock metric, was achieved.
                                                                            vowels. The final success rate in percentage is also stated.
                            Table 4. Confusion matrix of phoneme recognition for the best k-NN model


   The total number of correctly classified vowels was 4548      Acknowledgment
out of 5400 and the success rate of 84.2% was achieved. As
seen from Fig. 5 and Table 4, in the case of k-NN, the              This publication is the result of the project
vowels: aa, ae, ao, aw, and ux were recognized best, while       implementation: Centre of excellence for systems and
for the vowels ax, eh, and ix, a considerable number of          services of intelligent transport II, ITMS 26220120050
samples were misclassified. Note that using GMM-UBM              supported by the Research & Development Operational
classifier, largest recognition errors occurred in other group   Programme funded by the ERDF.
of vowels (see Fig. 5). The largest difference in recognition
rate between k-NN and GMM-UBM is in the case of the              References
vowels aa, ux, ix. From Fig 5, also disbalance between           [1] C. Lopes, F. Perdigao, Phone recognition on the TIMIT
simple GMM and GMM-UBM can be seen (theoretically,                    database. Speech Technologies, IntechOpen 2011, pp.
GMM-UBM should outperform GMM in all cases).                          285-302.
Probably, further optimization of GMM-UBM is required.            [2] F. Sha, L. K. Saul, Large margin Gaussia nmixture
   Phoneme recognition task on the TIMIT database consists            modelling for phonetic classification and recognition.
of several years of intensive research. There exists a number         Proceedings of IEEE International Conference on
of systems and their classification success has naturally             Acoustics, Speech and Signal Processing, 2006
improved over time. Results presented in this paper are               (ICASSP), France, May 2006.
comparable to the existing research reported in the literature   [3] L. Deng, D. Yu, Use of differential cepstra as acoustic
(see section 1.1). However, it is not possible to compare             features in hidden trajectory modelling for phonetic
these works directly with our system because of different             recognition. Proceedings of IEEE International
parameters and experimental settings that have been used.             Conference on Acoustics, Speech, and Signal
                                                                      Processing (ICASSP), 2007.
                                                                 [4] Y. Hifny, S. Renals, Speech recognition using
4    Conclusion                                                       augmented conditional random fields. IEEE
                                                                      Transactions on Audio, Speech & Language Processing,
   This work deals with the design of a system for                    vol. 17, no. 2, 2009, pp. 354–365, ISSN 1558-7916.
recognition of isolated vowels extracted from the TIMIT               2009.
dataset and subsequent optimization of the training              [5] A. Mohamed, G. Dahl, G. Hinton, Acoustic Modeling
algorithm. Three different approaches for phoneme                     using Deep Belief Networks", IEEE Transactions on
classification were k-NN, GMM, and GMM-UBM. The k-                    Audio, Speech, and Language Processing1558-7916,
NN method achieved the best results with overall accuracy             2011.
of 95.08% for 5 vowels and 84.2% for 18 vowels                   [6] D. Weenink, Vowels normalizations with the TIMIT
recognition. GMM-UBM gave comparable results for 18                   acoustic phonetic speech corpus. Institute of Phonetic
vowels recognition but classification error was distributed           Sciences, University of Amsterdam, Proceedings 24,
                                                                      117–123, 2001.
differently among vowel classes than in the case of k-NN.
This recognition disbalance issue between k-NN and GMM
approaches needs further investigation.
[7] H.M. Meng, V.W. Zue, “Signal representation                [24] A. Biswas, P.K. Sahu, A. Bhowmick, M. Chandra,
     comparison for phonetic classification”, in IEEE Proc.         Feature extraction technique using ERB like wavelet
     ICASSP, Toronto, 285–288, 1991.                                sub-band periodic and aperiodic decomposition for
[8] R. Amami, D.B. Ayed, N. Ellouze, An Empirical                   TIMIT phoneme recognition. International Journal of
     Comparison of SVM and Some Supervised Learning
     Algorithms for Vowel recognition. In: International            Speech Technology, Volume 17, Issue 4, pp 389–399,
     Journal of Intelligent Information Processing, 2012.           December 2014.
[9] R. Amami, D.B. Ayed, N. Ellouze. Practical selection       [25] P. Grzybek and M. Rusko, Letter, Grapheme and
     of svm supervised parameters with different feature            (Allo-)Phone Frequencies: The Case of Slovak,
     representations for vowel recognition. Int J Digit             Glottotheory, vol. 2, No. 1, 2009, pp 30–48.
     Content Technol Appl, 7/2013, pp. 418-424.                [26] I. Ben Fredj and K. Ouni, Optimization of Features
[10] D. Palaz, M. Magimai.-Doss, R. Collobert, Analysis of          Parameters for HMM Phoneme Recognition of TIMIT
     CNN-based Speech Recognition System using Raw                  Corpus, International Journal of Advanced Research in
     Speech as Input. In Proceedings of the 16th Annual             Electrical, Vol. 4, Issue 8, Aug. 2015.
     Conference of International Speech Communication          [27] M. Brookes, VOICEBOX: A speech processing toolbox
     Association (Interspeech), Dresden, Germany, 6–10              for MATLAB (available at http://www. ee. ic. ac. uk/...
     Sept. 2015; pp. 11–15.                                         hp/staff/dmb/voicebox/voicebox. html).
[11] O. Farooq and S. Datta, Phoneme recognition using         [23]I. Nabney, Netlab: Pattern analysis toolbox (available at
     wavelet based features, Information Sciences 150, 2003,        https://www.mathworks.com/matlabcentral/fileexchan
     pp. 5-15.                                                      ge/2654-netlab).
[12] D. Palaz, R. Collobert, M. Magimai.-Doss, End-to-end
     Phoneme Sequence Recognition using Convolutional
     Neural Networks. Idiap, Dec. 2013
[13] P. Karsmakers, K. Pelckmans, J. Suykens, H. Van
     Hamme, Fixed size kernel logistic regression for phone
     classification. Proceedings of Interspeech 2007, 1990-
     9772 Belgium, 2007.
[14] S.M. Siniscalchi, P. Schwarz, C.H. Lee, High-accuracy
     phone recognition by combining high-performance
     lattice generation and knowledge based rescoring.
     Proceedings of IEEE International Conference on
     Acoustics, Speech and Signal Processing, 2007.
[15] S. J. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus,
     D.S. Pallett, N.L. Dahlgren, V. Zue, TIMIT Acoustic-
     Phonetic Continuous Speech Corpus. Linguistic Data
     Consortium, Philadelphia, 1993.
[16] R. Jang, Audio Signal Processing and Recognition: 12-
     2 MFCC (2005), (available at:
     http://mirlab.org/jang/books/audiosignalprocessing/spe
     echFeatureMfcc.asp?title=12-2%20MFCC).
[17] S. Young, et al., “The HTK Book (for HTK Version
     3.4),” Cambridge University Engineering Department,
     2006.
[18] Chuong B. Do. “The Multivariate Gaussian
     Distribution.” Stanford, CA, USA, 2008.
[19] Ch. M. Bishop, Pattern Recognition and Machine
     Learning. Springer, 2006.
[20] A. R. Avilla, S. P. Milton, F. J. Fraga, D. D.
     O’Shaughnessy, T. H. Falk, Improving the Performance
     of Far-Field Speaker Verification Using Multi-
     Condition Training: The Case of GMM-UBM and i-
     vector Systems. In: Proceedings of the Fifteenth Annual
     Conference of the International Speech Communication
     Association. Singapore, 2014
[21] A. Mucherino, P.J. Papajorgji, P.M. Pardalos, Data
     mining in agriculture. Springer Dordrecht Heidelberg
     London New York, ISBN 978-0-387-88614-5 pp. 83-8,
     2009.
[22] P. Cunningham, S.J. Delany, k-Nearest neighbour
     classifiers. Technical Report UCD-CSI-2007-4, Dublin:
     Artificial Intelligence Group, 2007.
[23] Y. Bengio and Y. Grandvalet. No unbiased estimator of
     the variance of k-fold cross-validation, Journal of
     Machine Learning Research, 5:1089–1105, 2004.