<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A comparison of formant and CNN models for vowel frame recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ondrej Šuch</string-name>
          <email>ondrejs@savbb.sk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Santiago Barreda</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anton Mojsej</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Mathematical Institue, Slovak Academy of Sciences 955 01 Banská Bystrica</institution>
          ,
          <country country="SK">Slovakia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of California Davis</institution>
          ,
          <addr-line>956 16 CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Žilinská Univerzita v Žiline</institution>
          ,
          <addr-line>Univerzitná 8215/1, 010 26 Žilina</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Models for the classification of vowels are of continuing interest in both phonetics and for the development of automatic speech recognition (ASR) systems. The phonetics researchers favor linear classifiers based on formants, whereas ASR systems have adopted deep neural networks in recent years. In our work we compare the performance of several kinds of convolutional neural networks (CNN) with the linear classifier on the task of classifying short vowel frames from TIMIT corpus. Our primary hypothesis was that the CNN models would prove significantly more precise than linear models, including during consonant-vowel and vowel-consonant transitions, while obviating the inherent difficulties of formant tracking. We confirmed the hypothesis, although the improvement was modest. Our secondary goal was to investigate the possibility to mitigate loudness sensitivity of CNN models, and determine whether the mitigation would have deleterious effect on the classification performance of CNN models. Our experiments indicate that the loudness invariant CNN architecture performs equally well to traditional spectrum based convolutional models.</p>
      </abstract>
      <kwd-group>
        <kwd>MNIST</kwd>
        <kwd>convolutional network</kwd>
        <kwd>pairwise coupling</kwd>
        <kwd>one-on-one classification</kwd>
        <kwd>binary classification</kwd>
        <kwd>dropout</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The difference in a single English vowel may convey
nearly a dozen different meanings (e.g. hVd examples in
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). Models that people use to differentiate vowels have
been under study for many decades, both in phonetics as
well as in automated speech recognition (ASR) systems.
Recognition models can be broadly divided between two
paradigms based on their use of the dynamic information
in vowel sounds.
      </p>
      <p>
        First, one may idealize a vowel as a stationary sound.
This is a reasonable approximation since speakers can
stretch almost any vowel to many times its ordinary
duration without much effort. This view was adopted for
instance in classical studies of English vowels in [
        <xref ref-type="bibr" rid="ref1 ref2">2, 1</xref>
        ], or
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Second, one may view a vowel as a dynamic entity
evolving in time to both accommodate its acoustic
surroundings as well as to express inherent vowel dynamic.
This view fits much better to acoustics observed in
conversational speech, where higher articulation rates lead
speakers to shorten stationary vowel centers to the point
of nearly eliminating them. This view was already hinted
on in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], reiterated in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and thoroughly examined in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>A model belonging to the first paradigm is sometimes
called a frame recognition model, since it can be trained
based on extracted vowel windows (frames), and it can
categorize a vowel given a previously unseen sound
window (frame). We will call a model belonging to the second
paradigm a segmental model.</p>
      <p>
        There are two principal methodologies that are used to
create frame recognition models. For phonetics, the
typical approach is to create a linear model based on the first
three formants, which are known to be key determinants
of a vowel quality. The second methodology is a powerful
class of classification models that has garnered much
attention in automated speech recognition systems, namely
convolutional neural networks (CNN) [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ].
      </p>
      <p>
        The goal of this work is to compare, both qualitatively
and quantitatively, classification results obtained by the
classical phonetic approach based on formants and the
newer method based on convolutional neural networks
with the eye on suitability of adoption of CNN for
phonetic analyses. A key potential problem is the known
dependence of deep neural network models on large amount
of data [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. While large amounts of data are commonly
used in ASR research, phonetics datasets are usually much
smaller.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Dataset</title>
      <p>
        The performance of classifiers is easiest to investigate in
cases when they are faced with a difficult recognition
problem. A well-known pair of vowels that repeatedly shows
up in listening experiments as difficult to distinguish is the
aa/ao pair [
        <xref ref-type="bibr" rid="ref1 ref3 ref8">1, 3, 8</xref>
        ]. Therefore we looked for vowels that
could be confused with this pair of vowels.
      </p>
      <p>
        We opted to use the TIMIT corpus [
        <xref ref-type="bibr" rid="ref10">9</xref>
        ], which has been
the focus of many automated speech recognition studies.
Compared to single syllable laboratory recordings, the
corpus has the advantage of being closer to natural speech,
and it comes with annotations dividing speech into
individual phonemes.
      </p>
      <p>
        We used vowel segments of male speakers from SA1
and SA2 sentences of the TIMIT corpus (we exluded
female speakers due to the well-known problem of F1
determination for voices with high fundamental frequency,
as well as due to likely separate models required for
different sexes [
        <xref ref-type="bibr" rid="ref11 ref12">10, 11</xref>
        ]). The segments were the first
vocalic segments in the words “dark”, “wash”, “water”, “all”,
from SA1 sentences and “don’t”, “oily” from SA2
sentences. We excluded segments from those instances of
words “oily”, which were annotated with three vocalic
segments. Tables 1 and 2 indicate phonetic annotation of
the segments chosen by the TIMIT authors. The tables
indicate that the words are prototypes respectively for
segments /ao/, /aa/, /ow/ and /oy/.
      </p>
      <p>
        In this work we used classes implied by the words,
rather than the classes provided by TIMIT annotators.
The primary reason is that those classes are less
ambiguous, and possibly more reliable [
        <xref ref-type="bibr" rid="ref13">12</xref>
        ]. This approach also
matches current trend in ASR to directly output lexical
labels [
        <xref ref-type="bibr" rid="ref14 ref15 ref16">13, 14, 15</xref>
        ]. Our decision makes little difference for
words “all”, “dark”, “don’t”, “oily”, because these are
typically annotated with a single TIMIT phoneme label. But
it may affect classification results of the phonemes from
words “wash”, “water”, which were labelled mostly by
/aa/ and /ao/ annotations in TIMIT corpus.
      </p>
      <p>
        The analysis window was set to be 24ms (384
samples). For each vowel altogether 21 frames were extracted
equally spaced within the segment duration as indicated
by TIMIT segmentation boundaries. Thus there were
altogether 109410 samples divided among TRAIN and TEST
datasets approximately in ratio 2:1. Extraction of formants
and LPC spectra was performed using phonTools R
package [
        <xref ref-type="bibr" rid="ref17">16</xref>
        ]. We normalized features for CNN, LSTM and
k-NN models, which we will describe in detail in the next
section.
We considered three types of frame classification models.
The first was the logistic generalized linear model (GLM)
[
        <xref ref-type="bibr" rid="ref20 ref21">19, 20</xref>
        ], which is of kind commonly used in phonetic
studies of vowel qualities. The second one was
convolutional neural network [
        <xref ref-type="bibr" rid="ref22">21</xref>
        ], which is a powerful
classifier used in image processing, but also commonly applied
in deep learning ASR systems. We expected that a
linear model may have a problem capturing consonant-vowel
and vowel-consonant transitions at the starts and ends
of the segments. Therefore we included also k-nearest
neighbor (k-NN) non-parametric classifier [
        <xref ref-type="bibr" rid="ref20 ref21">19, 20</xref>
        ], which
should be able to learn nonlinear boundaries.
      </p>
      <p>Neural network models Currently, search for neural
network architectures is an active research field. We used ad
hoc designs as shown in Figure 2.</p>
      <p>For convolutional networks we used architectures
shown in the upper part of Figure 2. All convolutional
layers were 1-D with kernel of length 3; we used 5 neurons
on the first layer and 10 on the second. ReLU
nonlinearity was used both in convolutional and dense layers. The
dense layer had 32 neurons. Finally, we distinguish several
variations of the model based on provided spectral input:
CNN-A. The input was 192 point Fourier
logperiodogram.</p>
      <p>CNN-B. The input was LPC spectrum of order 19,
sampled at 192 equally spaced frequencies.
SPECTRUM AT 5%</p>
      <p>FINAL SPECTRUM</p>
      <p>The purpose of the variations was to try providing data
to CNN which is closer to data that formant models use
(e.g. the third formant F3 almost never exceeds value
3,5kHz).</p>
      <p>Finally, let us explain the rationale behind
intensityinvariant models CNN-C and CNN-D. If the sound source
is further away from the listener, obviously the sound is
quieter. In the context of frame classification, there should
be no change in prediction if the sound envelope is
uniformly shifted by an equal constant. Admittedly, the
prediction may change, if more context were available,
because then a decrease of intensity may constitute a
phonological contrast (e.g. syllabic stress). Let us note that any
model based on formants satisfies this invariance property,
(1)
(2)
by virtue of ignoring formant amplitudes. To achieve an
equivalent property with CNN, one can transform the
input tensor. Thus instead of supplying input vector s
representing log-magnitudes of spectral components
s = (s1; s2; : : : ; sM)
we provide as input the vector N(s) defined as follows:
N(s) := (s2
s1; s3
s1; : : : ; sM
s1)
The key feature of this transform is that the sounds
differing purely in intensity, say by d decibels, map to the same
input vector since</p>
      <p>N(s1 + d; s2 + d; : : : ; sM + d) = N(s1; s2; : : : ; sM)
(3)
Moreover, the topographic distribution of input vector
components, which is important for CNN, is maintained,
and only a very low frequency component is missing,
which should not be relevant for vowel identification.
3.2</p>
      <sec id="sec-2-1">
        <title>Segment classification models</title>
        <p>
          We considered two kinds of segmental models. First, we
supplied frame information from two frames: the onset at
10% and the offset at 90% position. This is inspired by
research of Nearey and Morrison [
          <xref ref-type="bibr" rid="ref23">22</xref>
          ], who indicated that
onset + offset data fits well with perceptual experiments.
We considered two implementations of the model, a CNN
with stacked onset+offset spectral information as shown in
Figure 2 upper right, and logistic regression (GLM) model.
        </p>
        <p>
          Second, we trained LSTM (long short-term memory)
network [
          <xref ref-type="bibr" rid="ref24">23</xref>
          ] with Fourier and LPC spectra (denoted
respectively LSTM-A and LSTM-B) with architecture as
shown in the bottom of Figure 2. The LSTM layer used
4 units.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>
        We have opted for pairwise classification paradigm, where
we classify frames (or segments) from a pair of words at
a time. This decision is motivated primarily by our desire
to elucidate strengths and weaknesses of various models,
which may be confounded in the multiclass setting. We
note that the paradigm has analogues and applications in
phonetics (concept of minimal word pairs [
        <xref ref-type="bibr" rid="ref25">24</xref>
        ], forced A/B
choice experimental design [
        <xref ref-type="bibr" rid="ref26">25</xref>
        ], or comparing accents by
contrasting pairs of vowels in various words [
        <xref ref-type="bibr" rid="ref27">26</xref>
        ]).
4.1
      </p>
      <sec id="sec-3-1">
        <title>Summary of accuracy with respect to location</title>
        <p>Interpretability for classification models is highly
desirable for recognition models in phonetics. For frame
classification models it is possible to analyze their performance
based on the position within vowel. Summary analysis is
presented in Figure 3, from which we draw the following
conclusions:</p>
        <p>Formant models (GLM as well as k-NN) in several
cases closely follow convolutional neural networks
(e.g. all-dark, or dark-wash, or dark-oily). This
indicates that very often formant positions are sufficient
for classification.
4.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Accuracy improvement with segmental models</title>
        <p>We trained also segmental models in order to gauge the
effect of restricting classification to frames rather than
segments. The results are summarized in Table 3.</p>
        <p>Clearly, segmental models can make good use of extra
information, since all six segments are well classified. The
most likely explanation is the ability of segmental models
to discriminate consonant-vowel/vowel-consonant
transitions.
4.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Multi-class classification</title>
        <p>
          For ASR applications, especially for the systems based
on hidden Markov models [
          <xref ref-type="bibr" rid="ref28">27</xref>
          ], it is desirable to obtain
multi-class likelihoods. Passage from pairwise models
to multi-class ones can be achieved by any of multiple
pairwise-coupling techniques [
          <xref ref-type="bibr" rid="ref29 ref30 ref31">28, 29, 30</xref>
          ]. We performed
this coupling using the second method suggested in [
          <xref ref-type="bibr" rid="ref30">29</xref>
          ],
which is widely used in machine learning [
          <xref ref-type="bibr" rid="ref32">31</xref>
          ]. The
multiclass models were composed of 6 pairwise models trained
on frames from the first syllables of all possible pairs
of words “all”, “dark”, “don’t”, and “oily”. We used
those four words, because they are the prototypes of the
four phonemes /ao/, /aa/, /ow/, /oy/, that account for most
TIMIT annotations of our dataset (see Tables 1 and 2). The
likelihood plots are shown in Figure 4.
        </p>
        <p>It is hard to argue which model predicted likelihoods
that are more correct, since there is no ground truth.
Subjectively, one may feel that CNN models are too certain
about their predictions, especially since the predictions are
considerably less smooth than that of the formants based
GLM model. The small size of the dataset may be a
possible cause.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>
        Our experiments indicate that it is feasible to construct
CNN models for frame classification for medium size
phonetic corpora. The resulting models showed superior
performance to formant-based models even without extensive
parameter optimization, which is in accordance with
findings of a recent study on Korean vowels [
        <xref ref-type="bibr" rid="ref33">32</xref>
        ]. It is also
possible to make CNN models intensity-invariant akin to
formant-based models without noticeable loss of
performance. The only downside we observed (Section 4.3) was
that the likelihoods predicted by CNN models lacked
temporal smoothness compared to the formant model.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgment</title>
      <p>Work on this paper was partially supported by grant
VEGA 2/0144/18. We are thankful to Peter Tarábek for
his suggestion to use the stacked input architecture for
onset+offset CNN models, which proved much more
accurate than the previously investigated branched architecture.
1.0
0.8
0.6
0.4
1.0
0.8
0.6
0.4</p>
      <p>GLM
CNN−A</p>
      <p>CNN−B
CNN−C</p>
      <p>CNN−D
k−NN
wash</p>
      <p>water
dark
wash</p>
      <p>water
dark
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
0.0
oily
all
oily
all
oily
all
0.0 0.4 0.8</p>
      <p>0.0 0.4 0.8
internal percentage
wash
dark
water
all
dark
dont
oily
all
dark
dont
oily
d
o
o
h
li
ike1.0
l 0.8
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
internal percentage</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Peterson</surname>
            ,
            <given-names>G.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barney</surname>
            ,
            <given-names>H.L.</given-names>
          </string-name>
          :
          <article-title>Control methods used in a study of the vowels</article-title>
          ,
          <source>J. Acoust. Soc. Am</source>
          .
          <volume>24</volume>
          (
          <year>1952</year>
          )
          <fpage>175</fpage>
          -
          <lpage>184</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Steinberg</surname>
            ,
            <given-names>J. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potter</surname>
            ,
            <given-names>R. K.</given-names>
          </string-name>
          :
          <article-title>Toward the specification of speech</article-title>
          ,
          <source>J. Acoust. Soc. Am</source>
          .
          <volume>22</volume>
          (
          <year>1950</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Hillenbrand</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Getty</surname>
            ,
            <given-names>L.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wheeler</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Acoustic characteristics of American English vowels</article-title>
          ,
          <source>J. Acoust. Soc. Am. 97 5</source>
          (
          <year>1995</year>
          )
          <fpage>3099</fpage>
          -
          <lpage>3111</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Morrison</surname>
            ,
            <given-names>G. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Assmann</surname>
            ,
            <given-names>P. F.</given-names>
          </string-name>
          :
          <article-title>Vowel inherent spectral change</article-title>
          , Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Abdel-Hamid</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohamed</surname>
            ,
            <given-names>A-R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Penn</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Convolutional neural networks for speech recognition</article-title>
          ,
          <source>IEEE/ACM Transactions On Audio, Speech, and Language Processing</source>
          ,
          <fpage>22</fpage>
          <lpage>10</lpage>
          (
          <year>2014</year>
          )
          <fpage>1533</fpage>
          -
          <lpage>1545</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Tóth</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Convolutional deep rectifier neural nets for phone recognition</article-title>
          ,
          <source>INTERSPEECH</source>
          , (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Hestess</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          et al:
          <article-title>Deep learning scaling is predictable, empirically</article-title>
          , https://arxiv.org/abs/1712.00409
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Zahorian</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jagharghi</surname>
            ,
            <given-names>A.J.</given-names>
          </string-name>
          :
          <article-title>Spectral-shape features versus formants as acoustic correlates for vowels</article-title>
          ,
          <source>J. Acoust. Soc. Am. 94 4</source>
          (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>CNN−A model 0.0 0.4 0</source>
          .
          <fpage>8</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Garofolo</surname>
            ,
            <given-names>J. S.</given-names>
          </string-name>
          et al:
          <article-title>TIMIT Acoustic-phonetic continuous speech corpus LDC93S1</article-title>
          .
          <article-title>Web download</article-title>
          . Philadelphia: Linguistic Data Consortium (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Nearey</surname>
            <given-names>T.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Assmann</surname>
            ,
            <given-names>P.F.</given-names>
          </string-name>
          :
          <article-title>Probabilistic 'slidingtemplate' models for indirect vowel normalization</article-title>
          , In Experimental Approaches to Phonology, Eds. M.J.
          <string-name>
            <surname>Solé</surname>
            ,
            <given-names>P.S.</given-names>
          </string-name>
          <string-name>
            <surname>Beddor</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ohala</surname>
          </string-name>
          , Oxford University Press (
          <year>2007</year>
          )
          <fpage>246</fpage>
          -
          <lpage>269</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Barreda</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nearey</surname>
            ,
            <given-names>T.M.:</given-names>
          </string-name>
          <article-title>The direct and indirect roles of fundamental frequency in vowel perception</article-title>
          ,
          <source>J. Acoust. Soc. Am</source>
          .
          <volume>131</volume>
          (
          <issue>1</issue>
          ) (
          <year>2012</year>
          )
          <fpage>466</fpage>
          -
          <lpage>477</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Šuch</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <source>Benˇuš</source>
          , Š, Tinajová,
          <string-name>
            <surname>A.</surname>
          </string-name>
          ,
          <article-title>A new method to combine probability estimates from pairwise binary classifiers</article-title>
          ,
          <source>ITAT</source>
          <year>2015</year>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jaitly</surname>
            <given-names>N.</given-names>
          </string-name>
          :
          <article-title>Towards end-to-end speech recognition with recurrent neural networks</article-title>
          ,
          <source>International conference on machine learning</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Chan</surname>
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jaitly</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            <given-names>Q.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinyals</surname>
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Listen, attend and spel: a neural nework for lage vocabulary conversational speech recognition</article-title>
          ,
          <source>ICASSP</source>
          <year>2016</year>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Park</surname>
            <given-names>D.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chan</surname>
            <given-names>W.</given-names>
          </string-name>
          , Zhang Y.,
          <string-name>
            <surname>Chiu</surname>
            <given-names>C-C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zoph</surname>
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cubuk</surname>
            <given-names>E.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            <given-names>Q-V.</given-names>
          </string-name>
          :
          <article-title>SpecAugment: A simple data augmentation method for automatic speech recognition</article-title>
          , arXiv:
          <year>1904</year>
          .
          <volume>08779</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Barreda</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : phonTools R package,
          <source>version 0</source>
          .2-
          <issue>2</issue>
          .1 (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [17]
          <article-title>R software for statistical computing</article-title>
          , https://www.rproject.org
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Chollet</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          et al, Keras, https://keras.io (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [19]
          <string-name>
            <surname>James</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hastie</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tibshirani</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>An introduction to statistical learning</article-title>
          , Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Hastie</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tibshirani</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The Elements of Statistical Learning - Data Mining, Inference, and Prediction</article-title>
          . New York: Springer (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [21]
          <string-name>
            <surname>LeCun</surname>
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Convolutional networks for images, speech, and time-series</article-title>
          . In M. A. Arbib, editor,
          <source>The Handbook of Brain Theory and Neural Networks</source>
          . MIT Press (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Morrison</surname>
            <given-names>G. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nearey</surname>
            ,
            <given-names>T. M.:</given-names>
          </string-name>
          <article-title>Testing theories of spectral change</article-title>
          ,
          <source>J. Acoust. Soc. Am.</source>
          ,
          <volume>122</volume>
          (
          <year>2007</year>
          ),
          <fpage>EL15</fpage>
          -
          <lpage>EL22</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Long short-term memory</article-title>
          ,
          <source>Neural computation</source>
          ,
          <volume>9</volume>
          (
          <issue>8</issue>
          ) (
          <year>1997</year>
          ),
          <fpage>1735</fpage>
          --
          <lpage>1780</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Ladefoged</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Vowels and consonants,
          <source>Blackwell</source>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Jeffress</surname>
            ,
            <given-names>L. A.</given-names>
          </string-name>
          : Masking, in J. V. Tobias ed.,
          <source>Foundations of modern auditory theory</source>
          , Academic Press (
          <year>1970</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Huckvale</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>ACCDIST: a metric for comparing speaker's accents</article-title>
          ,
          <source>In Proc. Int. Conf. on Spoken Lang. Proc.</source>
          ,
          <string-name>
            <surname>Juju</surname>
            <given-names>Island</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Korea</surname>
          </string-name>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Rabiner</surname>
            ,
            <given-names>L. R.:</given-names>
          </string-name>
          <article-title>A tutorial on hidden Markov models and selected applications in speech recognition</article-title>
          ,
          <source>Proceedings of the IEEE</source>
          ,
          <volume>77</volume>
          , no.
          <issue>2</issue>
          (
          <year>1989</year>
          )
          <fpage>257</fpage>
          -
          <lpage>286</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Hastie</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tibshirani</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Classification by pairwise coupling</article-title>
          ,
          <source>The Annals of Statistics</source>
          ,
          <volume>26</volume>
          (
          <issue>2</issue>
          ) (
          <year>1998</year>
          )
          <fpage>451</fpage>
          --
          <lpage>471</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Wu</surname>
            <given-names>T-F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>C-J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weng</surname>
            <given-names>R.C.</given-names>
          </string-name>
          :
          <article-title>Probability estimates for multi-class classification by pairwise coupling, Jounral of Machine Learning Reserach (</article-title>
          <year>2004</year>
          )
          <fpage>975</fpage>
          -
          <lpage>1005</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Šuch</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barreda</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Bayes covariant multi-class classification</article-title>
          ,
          <source>Pattern Recognition Letters</source>
          ,
          <volume>84</volume>
          (
          <year>2016</year>
          )
          <fpage>99</fpage>
          -
          <lpage>106</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>C-C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>C-J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>LIBSVM:</surname>
          </string-name>
          <article-title>A library for support vector machines</article-title>
          ,
          <source>ACM Transactions on Intelligent System and Technology</source>
          , vol
          <volume>2</volume>
          ., issue
          <volume>3</volume>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>W. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yoo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ahn</surname>
            ,
            <given-names>K.H.</given-names>
          </string-name>
          :
          <article-title>Origin of the higher difficulty in the recognition of vowels compared to handwritten digits in deep neural networks</article-title>
          ,
          <source>J. Korean Physical Soc</source>
          .
          <volume>74</volume>
          , No.
          <volume>1</volume>
          (
          <year>2019</year>
          )
          <fpage>12</fpage>
          -
          <lpage>18</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>