<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Development of a baseline system for phonemes recognition task</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Maros Jakubec, Eva Lieskovska, Roman Jarina, Michal Chmulik, Michal Kuba Department of Multimedia and Information-Communication Technologies, University of Zilina Univerzitna 8215/1</institution>
          ,
          <addr-line>010 26 Zilina</addr-line>
          ,
          <country>Slovak Republic</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The phonemes recognition is one of the fundamental problems in automatic speech recognition. Despite the great progress in speech recognition, discrimination of isolated phonemes is still challenging task due to coarticulation, and great variability in speaking style. The aim of this work is to develop a system for classification of isolated English vowels from the TIMIT dataset. In the paper, the following conventional methods are compared: a) k-Nearest Neighbours approach as a simple nonlinear instance-based classifier b) Gaussian Mixture Model, which belongs to the class of probabilistic acoustical modelling techniques. As a front-end, we applied standard mel-frequency cepstral coefficients with their time derivates. Various experimental methods such as trimming of audio data and cross-validation were used to increase recognition precision and reliability of system evaluation. The developed system will be used as a baseline for comparison with other newer state-of-the-art approaches.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Related works1</title>
      <p>
        Sha and Saul [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] introduced a system for phonemes
recognition. They trained GMM for multiway classification,
using the basic principle of SVM. With MFCCs including
their deltas (time derivates) and 16 Gaussian mixtures they
achieved 69.9% accuracy. Deng and Yu [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] used the Hidden
Trajectory Model on a phone recognition task. Similarly,
feature vectors consist of joint static cepstra and their deltas.
The resulting accuracy was 75.17%. Hifny and Renals [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
introduced a phonetic recognition system based on TIMIT
database where an acoustic modulation is achieved through
augmented conditional random fields. They achieved 73.4%
accuracy using the core test set and 77% in test which
includes the complete test set. A publication from Mohamed
et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] reports the use of neural networks for acoustic
modelling. The outcome is 79.3% accuracy in the core test.
Copyright ©2019 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
      </p>
      <p>
        The above-mentioned works are focused on different type
of phone set from the TIMIT database. Several studies
regarding the vowels classification have also been made.
Weenink [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] proposed vowel classification improvement by
including information about the known speaker into the
process. The goal was to reduce the variance in vowel space.
The 13 monophthong vowels were selected similarly as in
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Linear discriminant analysis on bark-scale filter bank
energies was used as a classification method. They reported
that information about spectral dynamics improved the
classification process. Reduction of the between-speaker
variance and the within-speaker variance resulted in higher
classification accuracy.
      </p>
      <p>
        An empirical comparison of five classifiers was presented
in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. SVM, k-NN, Naive Bayes, Quadratic Bayes Normal
(QDC) and Nearest Mean algorithms were tested for vowel
recognition using the TIMIT Corpus. MFCCs were used for
signal parameterization. The results of this experiment show
that SVM classifier achieved the best performance. The
QDC classifier had the lowest accuracy. The error rate of
QDC method has decreased about 10% by using the
combination of k-NN-QDC-NB. Such combination of
classifiers can be efficient way to boost the performance of
machine learning method.
      </p>
      <p>
        Amami et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] conducted a study on different SVM
kernels for a multi-class vowel recognition from the TIMIT
corpus. Investigation of the optimal parameters of the kernel
tricks and the regularization parameter was done. Two
different features such as MFCC and PLP were also applied.
Middle frames of the vowels and Fuzzy c-means clustering
(FCM) were evaluated to determine the appropriate
frontend analysis. The method based on middle frames
outperforms FCM method. Three middle frames turned out
to have the best recognition accuracy. Interestingly, the
results showed that the recognition accuracy decreased as the
number of frames increased Regarding SVM classification,
the accuracy of the vowel system and the runtime improves
with smaller value of the kernel width and the regularization
parameter.
      </p>
      <p>
        Palaz et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] claim that the ASR system based on a
neural network can be modelled by end-to-end training
procedure, without the need of separation into feature
extraction and classifier parts. In the proposed method, raw
speech waveform was used as an input to the CNN-based
speech recognition system. According to the results on the
TIMIT phonemes and the Aurora2 connected words
recognition tasks, the CNN-based end-to-end system yields
better performance than a standard spectral feature
extraction-based system.
      </p>
      <p>Although it is not always possible to achieve exactly the
same comparison of existing systems, Table 1 summarizes
some of the most important systems in the field of TIMIT
phonemes
recognition
over
the
last
twenty
years.</p>
      <p>Subsequently, the presented survey is ranked according to
the system accuracy, including the used methods and the sets
of features.</p>
      <sec id="sec-2-1">
        <title>Proposed methods</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Dataset</title>
      <p>The</p>
      <sec id="sec-3-1">
        <title>TIMIT</title>
      </sec>
      <sec id="sec-3-2">
        <title>Acoustic-Phonetic Continuous Speech Corpus (LDC) database [1, 15] was used for classification.</title>
        <p>The TIMIT speech corpora contains read speech and is
primarily
designed
for
studying
acoustic-phonetic
phenomena and for testing automatic speech recognition
systems. 630 people participated in creating of this database,
each contributing by reading 10 phonetically rich sentences.
The recordings are in the eight main dialects of American</p>
      </sec>
      <sec id="sec-3-3">
        <title>English. Audio files are recorded at 16 000 Hz, 16 bit. Each audio file is accompanied by metadata files containing phonetic and lexical transcriptions.</title>
        <p>2.2</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Features extraction methods</title>
      <p>The extraction of appropriate features is one of the basic
task of objects recognition. In the conventional ASR
frontend, speech is represented by a sequence of feature vectors
retaining particularly useful information from the signal.
There are a large number of approaches and features
extraction methods in ASR techniques. The features that
have been used in our algorithm will be described in the
following section.</p>
      <p>
        Mel Frequency Cepstral Coefficients - are the most
commonly used acoustic features in ASR. MFCCs are
designed to respect non-linear sound perception by human
ear [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        In our system, the MFCCs are computed as follows (Fig.
1): The pre-emphasis is applied to the speech signal in order
to emphasize its high-frequency components. The next step
is to divide the signal into 16 ms long frames with an overlap
of 1/2 of the frame length. The given frame length was
selected based on previous studies on isolated phonemes
recognition [
        <xref ref-type="bibr" rid="ref11 ref24 ref8">8, 11, 24</xref>
        ]. The number of signal samples (256)
is chosen as power 2 due to the use of FFT. A Hamming
window is applied to frames to maintain the continuity of the
first and last points in the frames. The signal is converted to
the frequency domain by using the FFT algorithm. The
magnitude frequency response is then calculated. The
spectrum values are multiplied by a series of 20 triangular
bandpass filters, summed for individual filters and then
logarithmized.
      </p>
      <p>The triangular filter bank has a linear frequency
distribution in the Mel frequency range:

( ) = 1125 ∗ ln⁡(1 +</p>
      <p>700
)
(1)
where f [Hz] is the frequency in the linear scale and mel (f)
[mel] corresponds to the frequency in the mel scale.</p>
      <p>The last step is to calculate the coefficients using the
discrete cosine transformation DCT.</p>
      <p>Fig. 1. Block diagram of the MFCC computation
An important parameter is also the energy of the frame.
Log energy is usually added as the 13th feature to MFCC.
The short-term energy function is defined by:
∞
 =−∞
 =</p>
      <p>∑ [⁡ ( )  ( −  )]2
where s(k) is signal sample in time k and w(n) is the
corresponding window type. It is then possible to obtain an
average energy value for each frame. The disadvantage of
this characteristic is the high sensitivity to rapid changes in
the signal level. Values of this characteristic can be also used
to separate silence segments from speech segments.</p>
      <p>
        Static features, which are obtained using the procedure
above, do not capture inter-frame changes along time index.
Therefore, dynamic (or delta) features are commonly
appended to the feature vectors. Usually delta features are
the estimates of the time derivatives of static features and are
a computed as follows [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]:
where Δ [ ] is the delta coefficient, from frame i,   is the
static coefficient and a typical value for M is 1.
      </p>
      <p>In the developed system, total features consist of 39
elements per frame:
- 12 MFCC,
- 12 delta (ΔMFCC),
- 12 delta-delta (ΔΔMFCC),
- 3 log energy.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Classification</title>
      <p>The classification process can be divided into a learning
and testing phase. Thus, data set needs to be divided into two
subsets. Because of 10-fold cross-validation evaluation
process (2.4), we selected the same number of vowels from
each class.</p>
      <p>Once the data were split, models of selected vowels were
trained and tested according to the chosen method. The
general classification scheme can be seen in Fig. 2.
their easy implementation and good classification properties.</p>
      <p>
        modelling of audio features in the feature
space. GMM is defined as the probability density function
formed by a linear superposition of K Gaussian components
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ][
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] as follows:
 ( ) =
∑
      </p>
      <p>( |  , ∑ )
where, the probability density function of the multivariate
Gaussian distribution for n-dimensional vector x is given by:

1
(2 )2⁡|∑|1/2
1
2
 ( |µ, ∑) =
exp⁡(−
( − µ) ∑−1( − µ))
(3)
with mean vector µ ∈ Rn and covariance matrix ∑ ∈ Rn x n.
πk are mixing coefficients, which must satisfy the following
conditions
0⁡ ≤ ⁡   ⁡ ≤ 1⁡⁡⁡and</p>
      <p>∑ =1   = 1
The classification function for the proposed GMM classifier
has the following form:</p>
      <p>We recall a description of these methods in the following
(2)
section.
(4)
(5)
(6)
(7)
(8)
(9)
 ( ) = 
⁡max⁡⁡(   ( ))</p>
      <p>where Cp(x) is GMM of the class C.</p>
      <p>Thus, we are looking for the maximal probability over all C
classes.</p>
      <p>The training algorithm, which returns a set of parameters
Θ = {µ,  and π} for each class, is based on the Maximum
Likelihood (ML) criterion. Given the model p (x, Θ) with the
unknow parameters, the aim is to derive its parameters based
the training data – set of the feature vectors {x1, x2, …, xm}.
The ML method uses Fisher likelihood function, which is
defined as:
 ( 1,  2, . . . . ,   |⁡ ) =
∏  (  |⁡ )

 =1</p>
      <p>The maximum of this function with respect to unknown
parameters Θ can be formalized as follows:


 =1
̂ = arg max⁡⁡ ∑ log  (  |⁡ )</p>
      <p>
        The maximization defined by (9) is a complicated task
that does not have an explicit solution. The
expectationmaximization (EM) algorithm [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] is used for finding
maximum likelihood solutions.
      </p>
      <p>Training the GMM statistical model for each single vowel
is challenging for both computing power and memory.
Fitting the model also suffers from lack of a sufficient
amount of training data. It is therefore advisable to train a
universal generic model (so called Universal Background
Model UBM), which represents the possible distribution of
the features for a wide group of sounds, and then derive from
it the class-specific model for an individual vowel. The
used for UBM adaptation to the vowel model (i.e.
classspecific GMM). In the presented experiments, only vectors
of mean values of UBM were adjusted to obtain individual
models.</p>
      <p>Given a sequence of features vectors  = { 1,  2 ,. . . ,   }
from one class of vowels, the score is expressed by (10),
where θv and θ</p>
      <p>denote the actual vowel model and
universal model respectively. According to (10), the greater
the probability p(  |θv) against background model for as
many feature vectors as possible, the more will be supported
the hypothesis that the recognized audio sample belongs to
the given vowel class.</p>
      <p>=</p>
      <p>⁡ ∑ log

1

 =1</p>
      <p>(  ⁡|⁡  )⁡
 (  ⁡|⁡ 
)
(10)</p>
      <p>
        The k-Nearest Neighbours (k-NN) is a simple nonlinear
instance-based classification method and is one of the most
popular classical approaches of cluster analysis. It classifies
an unknown sample based on the known classification of its
neighbours [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ][
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
      </p>
      <p>The model itself is essentially made up of a training set,
and the learning process consists in storing of patterns from
all training samples in one model. Given an unknown
sample, the distances between the unknown sample and all
the samples in the training set can be computed. Input
attributes must be numeric so that their distance can be
calculated for each of the two patterns. Samples from the
training set have  number attributes, and each one sample
represents a point in the  -dimensional space. If a classifier
wants to determine the target attribute of an unknown
sample, it searches in the  sample space of the training set
for those that are closest to that unknown sample. Training
set can be defined as:
{  ,   } =1,…, , ⁡  ∈ {1,2, … ,  }
(11)
where xi is a sample with its corresponding label C and K is
the size of the whole training set, L is a number of classes
(i.e. number of vowels). Given unknow sample x, we are
looking for sample  k according to following formula:
‖  −  ‖ = 
‖  −  ‖ =1,….,
(12)</p>
      <p>Subsequently, the sample x is placed to the same class that
  belongs to.</p>
      <p>In the proposed system we used the Euclidean distance,
which is the most commonly used metric for distance
determination, as well as the city-block, Chebyshev and
cosine distance metrics. They are defined as follows:

important factors play a role in the successful classification:
•
•
the choice of distance function
the choice of the value for the parameter k (i.e.
number of neighbours)</p>
      <p>It is advised to choose an odd number for k to avoid the
scenario when two classes labels achieve the same score.
Some issues need to be considered during the selection of k
value. Classes</p>
      <p>
        with a great number of samples can
overwhelm small ones and the results will be biased, so it is
not recommended to set large k value. The advantage of
using many samples in the training set is not exploited if k is
too small [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
      </p>
      <p>The disadvantage of this classifier is the calculation of all
distances for each classification, which can considerably
slow</p>
      <p>down the process and it can be computationally
expensive if the training set or the number of unknown
samples is large.</p>
    </sec>
    <sec id="sec-6">
      <title>2.4 k-fold cross-validation</title>
      <p>
        If there is not a sufficient number of observations, an
appropriate approach to determine the optimal solution for
training/testing is the so-called cross-validation technique.
[
        <xref ref-type="bibr" rid="ref23 ref28">23</xref>
        ].
      </p>
      <p>
        The data set is divided into k parts, with one part always
being used for testing, and the remaining k-1 parts being
used for training. The process is repeated so that each part is
used for testing just once (Fig. 3). The advantage of
validation
is
a
relatively
accurate
estimate
of the
classification success. The disadvantage of validation is that
it requires more computer memory and consumes more time
because a lot of calculations are needed at every step.
on isolated vowels extracted from the TIMIT data set. Two
sets of vowels were created. The first set consists of the 5
classes aa, eh, iy, ow, uh. This subset correlates with the
common vowels of the most European languages (e.g.
‘a’,’e’,’i’,’o’,’u’ in Slovak) [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. The second set consists of
18 American English vowels (see Table 4 for a list). The set
of the 5 classes was used in the first and second experiments.
Finally, performance of developed system was evaluated on
the second set of the 18 classes.
      </p>
      <p>
        Proposed algorithms were implemented in MATLAB
2018b with support of the Voicebox [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] and Netlab [28]
toolboxes.
      </p>
      <p>Classifier training and testing was performed by 10-fold
cross-validation. Data was initially randomly divided into 10
equally large subsets. Each of them contained approximately
the same number of vowels represented by the feature
vectors. Nine of them were used to train the model and the
rest one to test it. This was repeated 10 times, so that all 10
subsets were tested. All data were parameterized by 39
MFCCs (incl. deltas and delta-deltas) per 16 ms frame with
8 ms overlap. The features matrix dimension for each vowel
was 10800x39 (frames x features).</p>
      <p>The results of the experiments with 5 vowels classification
using k-NN and simple trained GMM are shown in Tables 2
and 3 respectively. There are shown the results achieved for
various k-NN setup (type of metric and number of
neighbours) and GMMs (number of gaussians and
covariance matrix types) settings. An effort has been made
to achieve a better classification accuracy by editing the data.
Therefore, the entire database was mixed so that the speech
dialects are evenly distributed between the training and the
test part. Another data modification was vowel trimming by
omitting the first and last frames for each vowel recording.
So that silent parts as well as parts affected by coarticulation
or unprecise vowel border detection were not taken into
account. In addition, the middle frames are known to contain
the most important information about the vowel. Such
modified data are referred as D2, D1 indicates original data.
GMM achieved the best success rate of 91.1% at n = 32
gaussians and full covariance matrix. The comparison of the
best results for 5 vowels achieved by the above-mentioned
methods is shown in Fig. 4.</p>
      <p>Fig. 4. The comparison of classification of 5 selected
vowels</p>
      <p>In the last experiment, testing was performed on a larger
set of classes - 18 vowels of American English were
selected. Data needed for UBM training were selected from
other recordings available in the database. A total of 4600
recordings from 510 speakers in a total length of
approximately 3 hours and 54 minutes were used to train the
UBM model. The front-end with data manipulation is the
same as in experiments with the recognition of 5 vowels
(referred as D2 in the text above). The experiments with
GMM-UBM training/classification approach is also added.
Fig. 5 shows the best results achieved Interestingly, the
k-NN algorithm outperformed both GMM and GMM-UBM
approaches. It achieved 84.2% vowel recognition accuracy,
at setting k = 5 neighbours and cityblock metric. The second
most successful system was GMM-UBM, which achieved
success rate of 78.1% at n = 256 gaussians and full
covariance matrix. The worst performance had the GMM
classifier, probably due to insufficient amount of training
data It achieved a system success rate of 75.5% at n = 16
gaussians and full covariance matrix.</p>
      <p>Table 4 shows the classification of the individual vowels
for the best k-NN model settings in form of confusion
matrix. The data in table indicates the performance of the
algorithm as well as the false recognized vowels. This is the
best way to see how the system works when recognizing
individual vowels. The diagonal shows the correctly
classified vowels. The lines specify incorrectly identified
vowels. The final success rate in percentage is also stated.</p>
      <p>Metric</p>
      <p>Significant improvement can be seen for both methods of
classification if only stationary middle part of the vowels is
analysed (D2). At k-NN method, a success rate of 95.08%
with k = 3 neighbours and cityblock metric, was achieved.
The total number of correctly classified vowels was 4548
out of 5400 and the success rate of 84.2% was achieved. As
seen from Fig. 5 and Table 4, in the case of k-NN, the
vowels: aa, ae, ao, aw, and ux were recognized best, while
for the vowels ax, eh, and ix, a considerable number of
samples were misclassified. Note that using GMM-UBM
classifier, largest recognition errors occurred in other group
of vowels (see Fig. 5). The largest difference in recognition
rate between k-NN and GMM-UBM is in the case of the
vowels aa, ux, ix. From Fig 5, also disbalance between
simple GMM and GMM-UBM can be seen (theoretically,
GMM-UBM should outperform GMM in all cases).
Probably, further optimization of GMM-UBM is required.</p>
      <p>Phoneme recognition task on the TIMIT database consists
of several years of intensive research. There exists a number
of systems and their classification success has naturally
improved over time. Results presented in this paper are
comparable to the existing research reported in the literature
(see section 1.1). However, it is not possible to compare
these works directly with our system because of different
parameters and experimental settings that have been used.
4</p>
      <sec id="sec-6-1">
        <title>Conclusion</title>
        <p>This work deals with the design of a system for
recognition of isolated vowels extracted from the TIMIT
dataset and subsequent optimization of the training
algorithm. Three different approaches for phoneme
classification were k-NN, GMM, and GMM-UBM. The
kNN method achieved the best results with overall accuracy
of 95.08% for 5 vowels and 84.2% for 18 vowels
recognition. GMM-UBM gave comparable results for 18
vowels recognition but classification error was distributed
differently among vowel classes than in the case of k-NN.
This recognition disbalance issue between k-NN and GMM
approaches needs further investigation.</p>
      </sec>
      <sec id="sec-6-2">
        <title>Acknowledgment</title>
        <p>This publication is the result of the project
implementation: Centre of excellence for systems and
services of intelligent transport II, ITMS 26220120050
supported by the Research &amp; Development Operational
Programme funded by the ERDF.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lopes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Perdigao</surname>
          </string-name>
          ,
          <article-title>Phone recognition on the TIMIT database</article-title>
          . Speech Technologies,
          <source>IntechOpen</source>
          <year>2011</year>
          , pp.
          <fpage>285</fpage>
          -
          <lpage>302</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Sha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. K.</given-names>
            <surname>Saul</surname>
          </string-name>
          ,
          <article-title>Large margin Gaussia nmixture modelling for phonetic classification and recognition</article-title>
          .
          <source>Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing</source>
          ,
          <year>2006</year>
          (ICASSP), France, May
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Use of differential cepstra as acoustic features in hidden trajectory modelling for phonetic recognition</article-title>
          .
          <source>Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hifny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Renals</surname>
          </string-name>
          ,
          <article-title>Speech recognition using augmented conditional random fields</article-title>
          .
          <source>IEEE Transactions on Audio, Speech &amp; Language Processing</source>
          , vol.
          <volume>17</volume>
          , no.
          <issue>2</issue>
          ,
          <issue>2009</issue>
          , pp.
          <fpage>354</fpage>
          -
          <lpage>365</lpage>
          , ISSN 1558-
          <fpage>7916</fpage>
          .
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          , G. Dahl, G. Hinton,
          <article-title>Acoustic Modeling using Deep Belief Networks"</article-title>
          ,
          <source>IEEE Transactions on Audio, Speech, and Language Processing1558-7916</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Weenink</surname>
          </string-name>
          ,
          <article-title>Vowels normalizations with the TIMIT acoustic phonetic speech corpus</article-title>
          .
          <source>Institute of Phonetic Sciences, University of Amsterdam, Proceedings</source>
          <volume>24</volume>
          ,
          <fpage>117</fpage>
          -
          <lpage>123</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.M.</given-names>
            <surname>Meng</surname>
          </string-name>
          , V.W. Zue, “
          <article-title>Signal representation comparison for phonetic classification”</article-title>
          ,
          <source>in IEEE Proc. ICASSP</source>
          , Toronto,
          <fpage>285</fpage>
          -
          <lpage>288</lpage>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Amami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.B.</given-names>
            <surname>Ayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ellouze</surname>
          </string-name>
          ,
          <article-title>An Empirical Comparison of SVM and Some Supervised Learning Algorithms for Vowel recognition</article-title>
          .
          <source>In: International Journal of Intelligent Information Processing</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Amami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.B.</given-names>
            <surname>Ayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ellouze</surname>
          </string-name>
          .
          <article-title>Practical selection of svm supervised parameters with different feature representations for vowel recognition</article-title>
          .
          <source>Int J Digit Content Technol Appl</source>
          ,
          <volume>7</volume>
          /2013, pp.
          <fpage>418</fpage>
          -
          <lpage>424</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Palaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Magimai</surname>
          </string-name>
          .-Doss,
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          ,
          <article-title>Analysis of CNN-based Speech Recognition System using Raw Speech as Input</article-title>
          .
          <source>In Proceedings of the 16th Annual Conference of International Speech Communication Association (Interspeech)</source>
          , Dresden, Germany,
          <fpage>6</fpage>
          -
          <lpage>10</lpage>
          Sept.
          <year>2015</year>
          ; pp.
          <fpage>11</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>O.</given-names>
            <surname>Farooq</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Datta</surname>
          </string-name>
          ,
          <article-title>Phoneme recognition using wavelet based features</article-title>
          ,
          <source>Information Sciences 150</source>
          ,
          <year>2003</year>
          , pp.
          <fpage>5</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Palaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Magimai</surname>
          </string-name>
          .
          <article-title>-Doss, End-to-end Phoneme Sequence Recognition using Convolutional Neural Networks</article-title>
          . Idiap, Dec. 2013
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Karsmakers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Pelckmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Suykens</surname>
          </string-name>
          ,
          <string-name>
            <surname>H. Van Hamme</surname>
          </string-name>
          ,
          <article-title>Fixed size kernel logistic regression for phone classification</article-title>
          .
          <source>Proceedings of Interspeech</source>
          <year>2007</year>
          ,
          <fpage>1990</fpage>
          - 9772 Belgium,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.M.</given-names>
            <surname>Siniscalchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schwarz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>High-accuracy phone recognition by combining high-performance lattice generation and knowledge based rescoring</article-title>
          .
          <source>Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Garofolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.F.</given-names>
            <surname>Lamel</surname>
          </string-name>
          , W.M. Fisher,
          <string-name>
            <given-names>J.G.</given-names>
            <surname>Fiscus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.S.</given-names>
            <surname>Pallett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.L.</given-names>
            <surname>Dahlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zue</surname>
          </string-name>
          ,
          <string-name>
            <surname>TIMIT AcousticPhonetic Continuous Speech Corpus. Linguistic Data</surname>
            <given-names>Consortium</given-names>
          </string-name>
          , Philadelphia,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>R.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <source>Audio Signal Processing and Recognition: 12- 2 MFCC</source>
          (
          <year>2005</year>
          ), (available at: http://mirlab.org/jang/books/audiosignalprocessing/spe echFeatureMfcc.asp?title=
          <fpage>12</fpage>
          -
          <lpage>2</lpage>
          %20MFCC).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Young</surname>
          </string-name>
          , et al., “
          <article-title>The HTK Book (for HTK Version 3</article-title>
          .4),” Cambridge University Engineering Department,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Chuong</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Do</surname>
          </string-name>
          . “
          <article-title>The Multivariate Gaussian Distribution</article-title>
          .” Stanford, CA, USA,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Ch. M. Bishop</surname>
          </string-name>
          ,
          <source>Pattern Recognition and Machine Learning</source>
          . Springer,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Avilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Milton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. J.</given-names>
            <surname>Fraga</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. D. O'Shaughnessy</surname>
            ,
            <given-names>T. H.</given-names>
          </string-name>
          <string-name>
            <surname>Falk</surname>
          </string-name>
          ,
          <article-title>Improving the Performance of Far-Field Speaker Verification Using MultiCondition Training: The Case of GMM-UBM and ivector Systems</article-title>
          .
          <source>In: Proceedings of the Fifteenth Annual Conference of the International Speech Communication Association. Singapore</source>
          ,
          <year>2014</year>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mucherino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.J.</given-names>
            <surname>Papajorgji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.M.</given-names>
            <surname>Pardalos</surname>
          </string-name>
          , Data mining in agriculture. Springer Dordrecht Heidelberg London New York, ISBN 978-0-
          <fpage>387</fpage>
          -88614-5 pp.
          <fpage>83</fpage>
          -
          <lpage>8</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>P.</given-names>
            <surname>Cunningham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.J.</given-names>
            <surname>Delany</surname>
          </string-name>
          ,
          <article-title>k-Nearest neighbour classifiers</article-title>
          .
          <source>Technical Report UCD-CSI-2007-4</source>
          , Dublin: Artificial Intelligence Group,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Grandvalet</surname>
          </string-name>
          .
          <article-title>No unbiased estimator of the variance of k-fold cross-validation</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>5</volume>
          :
          <fpage>1089</fpage>
          -
          <lpage>1105</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Biswas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.K.</given-names>
            <surname>Sahu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhowmick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chandra</surname>
          </string-name>
          ,
          <article-title>Feature extraction technique using ERB like wavelet sub-band periodic and aperiodic decomposition for TIMIT phoneme recognition</article-title>
          .
          <source>International Journal of Speech Technology</source>
          , Volume
          <volume>17</volume>
          , Issue 4, pp
          <fpage>389</fpage>
          -
          <lpage>399</lpage>
          ,
          <year>December 2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>P.</given-names>
            <surname>Grzybek</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Rusko</surname>
          </string-name>
          , Letter, Grapheme and (Allo-)
          <source>Phone Frequencies: The Case of Slovak, Glottotheory</source>
          , vol.
          <volume>2</volume>
          , No.
          <volume>1</volume>
          ,
          <issue>2009</issue>
          , pp
          <fpage>30</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>I. Ben</given-names>
            <surname>Fredj</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Ouni</surname>
          </string-name>
          ,
          <article-title>Optimization of Features Parameters for HMM Phoneme Recognition of TIMIT Corpus</article-title>
          ,
          <source>International Journal of Advanced Research in Electrical</source>
          , Vol.
          <volume>4</volume>
          ,
          <string-name>
            <surname>Issue</surname>
            <given-names>8</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aug</surname>
          </string-name>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>M.</given-names>
            <surname>Brookes</surname>
          </string-name>
          ,
          <article-title>VOICEBOX: A speech processing toolbox for MATLAB (available</article-title>
          at http://www. ee. ic. ac. uk/... hp/staff/dmb/voicebox/voicebox. html).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [23]
          <string-name>
            <surname>I. Nabney</surname>
          </string-name>
          , Netlab:
          <article-title>Pattern analysis toolbox (available</article-title>
          at https://www.mathworks.com/matlabcentral/fileexchan ge/2654-netlab).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>