=Paper= {{Paper |id=Vol-2473/paper30 |storemode=property |title=A Comparison of Formant and CNN Models for Vowel Frame Recognition |pdfUrl=https://ceur-ws.org/Vol-2473/paper30.pdf |volume=Vol-2473 |authors=Ondrej Šuch,Santiago Barreda,Anton Mojsej |dblpUrl=https://dblp.org/rec/conf/itat/SuchBM19 }} ==A Comparison of Formant and CNN Models for Vowel Frame Recognition== https://ceur-ws.org/Vol-2473/paper30.pdf
       A comparison of formant and CNN models for vowel frame recognition

                                        Ondrej Šuch1,2 , Santiago Barreda3 , and Anton Mojsej2
                                           1    Mathematical Institue, Slovak Academy of Sciences
                                                       955 01 Banská Bystrica, Slovakia
                                                             ondrejs@savbb.sk,
                                               2 Žilinská Univerzita v Žiline, Univerzitná 8215/1,

                                                                 010 26 Žilina
                                                           3 University of California

                                                            Davis, 956 16 CA, USA

Abstract: Models for the classification of vowels are of                  instance in classical studies of English vowels in [2, 1], or
continuing interest in both phonetics and for the develop-                [3].
ment of automatic speech recognition (ASR) systems. The                      Second, one may view a vowel as a dynamic entity
phonetics researchers favor linear classifiers based on for-              evolving in time to both accommodate its acoustic sur-
mants, whereas ASR systems have adopted deep neural                       roundings as well as to express inherent vowel dynamic.
networks in recent years. In our work we compare the                      This view fits much better to acoustics observed in con-
performance of several kinds of convolutional neural net-                 versational speech, where higher articulation rates lead
works (CNN) with the linear classifier on the task of clas-               speakers to shorten stationary vowel centers to the point
sifying short vowel frames from TIMIT corpus.                             of nearly eliminating them. This view was already hinted
   Our primary hypothesis was that the CNN models                         on in [2], reiterated in [3], and thoroughly examined in [4].
would prove significantly more precise than linear models,                   A model belonging to the first paradigm is sometimes
including during consonant-vowel and vowel-consonant                      called a frame recognition model, since it can be trained
transitions, while obviating the inherent difficulties of for-            based on extracted vowel windows (frames), and it can
mant tracking. We confirmed the hypothesis, although the                  categorize a vowel given a previously unseen sound win-
improvement was modest.                                                   dow (frame). We will call a model belonging to the second
   Our secondary goal was to investigate the possibility to               paradigm a segmental model.
mitigate loudness sensitivity of CNN models, and deter-                      There are two principal methodologies that are used to
mine whether the mitigation would have deleterious effect                 create frame recognition models. For phonetics, the typi-
on the classification performance of CNN models. Our ex-                  cal approach is to create a linear model based on the first
periments indicate that the loudness invariant CNN archi-                 three formants, which are known to be key determinants
tecture performs equally well to traditional spectrum based               of a vowel quality. The second methodology is a powerful
convolutional models.                                                     class of classification models that has garnered much at-
Keywords: MNIST, convolutional network, pairwise cou-                     tention in automated speech recognition systems, namely
pling, one-on-one classification, binary classification,                  convolutional neural networks (CNN) [5, 6].
dropout                                                                      The goal of this work is to compare, both qualitatively
                                                                          and quantitatively, classification results obtained by the
                                                                          classical phonetic approach based on formants and the
1    Introduction                                                         newer method based on convolutional neural networks
                                                                          with the eye on suitability of adoption of CNN for pho-
The difference in a single English vowel may convey                       netic analyses. A key potential problem is the known de-
nearly a dozen different meanings (e.g. hVd examples in                   pendence of deep neural network models on large amount
[1]). Models that people use to differentiate vowels have                 of data [7]. While large amounts of data are commonly
been under study for many decades, both in phonetics as                   used in ASR research, phonetics datasets are usually much
well as in automated speech recognition (ASR) systems.                    smaller.
Recognition models can be broadly divided between two
paradigms based on their use of the dynamic information
in vowel sounds.                                                          2    Dataset
   First, one may idealize a vowel as a stationary sound.
This is a reasonable approximation since speakers can                     The performance of classifiers is easiest to investigate in
stretch almost any vowel to many times its ordinary du-                   cases when they are faced with a difficult recognition prob-
ration without much effort. This view was adopted for                     lem. A well-known pair of vowels that repeatedly shows
      Copyright c 2019 for this paper by its authors. Use permitted un-
                                                                          up in listening experiments as difficult to distinguish is the
der Creative Commons License Attribution 4.0 International (CC BY         aa/ao pair [1, 3, 8]. Therefore we looked for vowels that
4.0).                                                                     could be confused with this pair of vowels.
 word      aa    ae    ah     ao   ax    eh    er   ow    uh
                                                                                                                             0.0   0.2   0.4   0.6   0.8   1.0
  all      24     0     0    850    0     0     0    2     0
                                                                                               wash                                       water
 dark     858     0    10      6    0     2     0    0     0               2500
                                                                           2000
 wash     408     2     8    454    0     0     2    0     2               1500
 water    104     0    34    724    4     0     0    0    10               1000
                                                                            500
                                                                                               don't                                       oily
Table 1: TIMIT annotations of the first syllables in se-                                                                                                         2500




                                                               frequency
                                                                                                                                                                 2000
lected words in SA1 sentences.                                                                                                                                   1500
                                                                                                                                                                 1000
                                                                                                                                                                 500
 word     ah    ao    ax    ix    ow     oy   uh                                                    all                                   dark
 don’t    10     2     2     2   856      0    4                           2500
                                                                           2000
  oily     0    90     0     0    66    674    0                           1500
                                                                           1000
                                                                            500
Table 2: TIMIT annotations of the first syllables in se-                          0.0   0.2   0.4         0.6    0.8   1.0
lected words in SA2 sentences.                                                                                  internal percentage


                                                               Figure 1: Time evolution of the medians of the first three
   We opted to use the TIMIT corpus [9], which has been
                                                               formants (solid colors). Data of one speaker (MREB0) is
the focus of many automated speech recognition studies.
                                                               indicated in broken lines.
Compared to single syllable laboratory recordings, the cor-
pus has the advantage of being closer to natural speech,
and it comes with annotations dividing speech into indi-       3            Classification models
vidual phonemes.
   We used vowel segments of male speakers from SA1            Model creation was done using R software [17] and Keras
and SA2 sentences of the TIMIT corpus (we exluded fe-          machine learning library [18].
male speakers due to the well-known problem of F1 de-
termination for voices with high fundamental frequency,
as well as due to likely separate models required for dif-     3.1            Frame classification models
ferent sexes [10, 11]). The segments were the first vo-
calic segments in the words “dark”, “wash”, “water”, “all”,    We considered three types of frame classification models.
from SA1 sentences and “don’t”, “oily” from SA2 sen-           The first was the logistic generalized linear model (GLM)
tences. We excluded segments from those instances of           [19, 20], which is of kind commonly used in phonetic
words “oily”, which were annotated with three vocalic          studies of vowel qualities. The second one was convo-
segments. Tables 1 and 2 indicate phonetic annotation of       lutional neural network [21], which is a powerful classi-
the segments chosen by the TIMIT authors. The tables           fier used in image processing, but also commonly applied
indicate that the words are prototypes respectively for seg-   in deep learning ASR systems. We expected that a lin-
ments /ao/, /aa/, /ow/ and /oy/.                               ear model may have a problem capturing consonant-vowel
   In this work we used classes implied by the words,          and vowel-consonant transitions at the starts and ends
rather than the classes provided by TIMIT annotators.          of the segments. Therefore we included also k-nearest
The primary reason is that those classes are less ambigu-      neighbor (k-NN) non-parametric classifier [19, 20], which
ous, and possibly more reliable [12]. This approach also       should be able to learn nonlinear boundaries.
matches current trend in ASR to directly output lexical la-
bels [13, 14, 15]. Our decision makes little difference for    Neural network models Currently, search for neural net-
words “all”, “dark”, “don’t”, “oily”, because these are typ-   work architectures is an active research field. We used ad
ically annotated with a single TIMIT phoneme label. But        hoc designs as shown in Figure 2.
it may affect classification results of the phonemes from         For convolutional networks we used architectures
words “wash”, “water”, which were labelled mostly by           shown in the upper part of Figure 2. All convolutional lay-
/aa/ and /ao/ annotations in TIMIT corpus.                     ers were 1-D with kernel of length 3; we used 5 neurons
   The analysis window was set to be 24ms (384 sam-            on the first layer and 10 on the second. ReLU nonlinear-
ples). For each vowel altogether 21 frames were extracted      ity was used both in convolutional and dense layers. The
equally spaced within the segment duration as indicated        dense layer had 32 neurons. Finally, we distinguish several
by TIMIT segmentation boundaries. Thus there were alto-        variations of the model based on provided spectral input:
gether 109410 samples divided among TRAIN and TEST
datasets approximately in ratio 2:1. Extraction of formants                • CNN-A. The input was 192 point Fourier log-
and LPC spectra was performed using phonTools R pack-                        periodogram.
age [16]. We normalized features for CNN, LSTM and
k-NN models, which we will describe in detail in the next                  • CNN-B. The input was LPC spectrum of order 19,
section.                                                                     sampled at 192 equally spaced frequencies.
   FRAME CNN MODEL                ONSET + OFFSET CNN MODEL            by virtue of ignoring formant amplitudes. To achieve an
                                                                      equivalent property with CNN, one can transform the in-
      CLASSIFICATION                       CLASSIFICATION             put tensor. Thus instead of supplying input vector s repre-
                                                                      senting log-magnitudes of spectral components
         SOFTMAX
                                              SOFTMAX
                                                                                                s = (s1 , s2 , . . . , sM )                (1)
          DENSE
                                                 DENSE                we provide as input the vector N(s) defined as follows:
          CONV2
                                                 CONV2
                                                                                    N(s) := (s2 − s1 , s3 − s1 , . . . , sM − s1 )         (2)

          CONV1
                                                                      The key feature of this transform is that the sounds differ-
                                                 CONV1
                                                                      ing purely in intensity, say by d decibels, map to the same
                                                                      input vector since
      FRAME SPECTRUM                           ONSET
                                             SPECTRUM
                                              OFFSET
                                                                            N(s1 + d, s2 + d, . . . , sM + d) = N(s1 , s2 , . . . , sM )   (3)
                                             SPECTRUM
                                                                      Moreover, the topographic distribution of input vector
                       SEGMENTAL LSTM MODEL
                                                                      components, which is important for CNN, is maintained,
                                                                      and only a very low frequency component is missing,
                            CLASSIFICATION
                                                                      which should not be relevant for vowel identification.

                               SOFTMAX
                                                                      3.2     Segment classification models
                                 TIME                                 We considered two kinds of segmental models. First, we
                             DISTRIBUTED
                                                                      supplied frame information from two frames: the onset at
                                                                      10% and the offset at 90% position. This is inspired by
         h0                  h1                             h20       research of Nearey and Morrison [22], who indicated that
                                                                      onset + offset data fits well with perceptual experiments.
                                         …                            We considered two implementations of the model, a CNN
        LSTM               LSTM                          LSTM
                                                                      with stacked onset+offset spectral information as shown in
                                                                      Figure 2 upper right, and logistic regression (GLM) model.
        INITIAL
                       SPECTRUM AT 5%        …       FINAL SPECTRUM      Second, we trained LSTM (long short-term memory)
      SPECTRUM
                                                                      network [23] with Fourier and LPC spectra (denoted re-
                                                                      spectively LSTM-A and LSTM-B) with architecture as
Figure 2: Schematic layout of the three deep neural net-              shown in the bottom of Figure 2. The LSTM layer used
work architectures considered in the paper.                           4 units.


  • CNN-C. The input was as in CNN-B, but intensity                   4     Results
    normalized per equation (2) below.
                                                                      We have opted for pairwise classification paradigm, where
  • CNN-D. The input was as in CNN-C but limited to                   we classify frames (or segments) from a pair of words at
    components below 3,5kHz.                                          a time. This decision is motivated primarily by our desire
                                                                      to elucidate strengths and weaknesses of various models,
   The purpose of the variations was to try providing data            which may be confounded in the multiclass setting. We
to CNN which is closer to data that formant models use                note that the paradigm has analogues and applications in
(e.g. the third formant F3 almost never exceeds value                 phonetics (concept of minimal word pairs [24], forced A/B
3,5kHz).                                                              choice experimental design [25], or comparing accents by
   Finally, let us explain the rationale behind intensity-            contrasting pairs of vowels in various words [26]).
invariant models CNN-C and CNN-D. If the sound source
is further away from the listener, obviously the sound is             4.1     Summary of accuracy with respect to location
quieter. In the context of frame classification, there should
be no change in prediction if the sound envelope is uni-              Interpretability for classification models is highly desir-
formly shifted by an equal constant. Admittedly, the pre-             able for recognition models in phonetics. For frame classi-
diction may change, if more context were available, be-               fication models it is possible to analyze their performance
cause then a decrease of intensity may constitute a phono-            based on the position within vowel. Summary analysis is
logical contrast (e.g. syllabic stress). Let us note that any         presented in Figure 3, from which we draw the following
model based on formants satisfies this invariance property,           conclusions:
                                                    Average     TIMIT annotations of our dataset (see Tables 1 and 2). The
 Paradigm Model type                       Model
                                                    accuracy    likelihood plots are shown in Figure 4.
                                           CNN-A      83.2 %
      Frame classifi-


                        convolutional                              It is hard to argue which model predicted likelihoods
      cation models

                                           CNN-B      83.2%     that are more correct, since there is no ground truth. Sub-
                        neural network
                                           CNN-C      82.1 %    jectively, one may feel that CNN models are too certain
                        models
                                           CNN-D      83.1%     about their predictions, especially since the predictions are
                        formant-based       k-NN       74 %     considerably less smooth than that of the formants based
                        models              GLM       73.9 %    GLM model. The small size of the dataset may be a pos-
                        onset+offset       CNN-A      96.8 %    sible cause.
      Segmental




                        models              GLM       95.4 %
      models




                        recursive neural   LSTM-A     96.9 %
                        network models     LSTM-B     96.2 %    5    Conclusion
Table 3: Average accuracy of frame and segmental mod-
els. Note that for frame classification, the input to CNN-A     Our experiments indicate that it is feasible to construct
is a 1D tensor consisting of a spectrum, whereas in seg-        CNN models for frame classification for medium size pho-
mental classification, we joined onset and offset spectra to    netic corpora. The resulting models showed superior per-
form a 2D tensor as shown in upper right in Figure 2.           formance to formant-based models even without extensive
                                                                parameter optimization, which is in accordance with find-
                                                                ings of a recent study on Korean vowels [32]. It is also
  • CNN models are uniformly superior to the other two          possible to make CNN models intensity-invariant akin to
    classes of models (k-NN and GLM).                           formant-based models without noticeable loss of perfor-
                                                                mance. The only downside we observed (Section 4.3) was
  • There is negligible performance difference among            that the likelihoods predicted by CNN models lacked tem-
    models CNN-A through CNN-D.                                 poral smoothness compared to the formant model.
  • Formant models (GLM as well as k-NN) in several
    cases closely follow convolutional neural networks
                                                                Acknowledgment
    (e.g. all-dark, or dark-wash, or dark-oily). This indi-
    cates that very often formant positions are sufficient
    for classification.                                         Work on this paper was partially supported by grant
                                                                VEGA 2/0144/18. We are thankful to Peter Tarábek for
                                                                his suggestion to use the stacked input architecture for on-
                                                                set+offset CNN models, which proved much more accu-
4.2    Accuracy improvement with segmental models
                                                                rate than the previously investigated branched architecture.
We trained also segmental models in order to gauge the ef-
fect of restricting classification to frames rather than seg-
ments. The results are summarized in Table 3.                   References
   Clearly, segmental models can make good use of extra
information, since all six segments are well classified. The    [1] Peterson, G.E., Barney, H.L.: Control methods used in a
most likely explanation is the ability of segmental models          study of the vowels, J. Acoust. Soc. Am. 24 (1952) 175–184
to discriminate consonant-vowel/vowel-consonant transi-         [2] Steinberg, J. C., Potter, R. K.: Toward the specification of
tions.                                                              speech, J. Acoust. Soc. Am. 22 (1950)
                                                                [3] Hillenbrand, J., Getty, L.A., Clark, M.J., Wheeler, K.:
                                                                    Acoustic characteristics of American English vowels, J.
4.3    Multi-class classification                                   Acoust. Soc. Am. 97 5 (1995) 3099–3111
                                                                [4] Morrison, G. S., Assmann, P. F.: Vowel inherent spectral
For ASR applications, especially for the systems based              change, Springer (2012)
on hidden Markov models [27], it is desirable to obtain         [5] Abdel-Hamid, O., Mohamed, A-R., Jiang H., Deng L., Penn
multi-class likelihoods. Passage from pairwise models               G., Yu, D.: Convolutional neural networks for speech recog-
to multi-class ones can be achieved by any of multiple              nition, IEEE/ACM Transactions On Audio, Speech, and
pairwise-coupling techniques [28, 29, 30]. We performed             Language Processing, 22 10 (2014) 1533–1545
this coupling using the second method suggested in [29],        [6] Tóth, L.: Convolutional deep rectifier neural nets for phone
which is widely used in machine learning [31]. The multi-           recognition, INTERSPEECH, (2013)
class models were composed of 6 pairwise models trained         [7] Hestess, J. et al: Deep learning scaling is predictable, empir-
on frames from the first syllables of all possible pairs            ically, https://arxiv.org/abs/1712.00409
of words “all”, “dark”, “don’t”, and “oily”. We used            [8] Zahorian, S.A., Jagharghi, A.J.: Spectral-shape features ver-
those four words, because they are the prototypes of the            sus formants as acoustic correlates for vowels, J. Acoust.
four phonemes /ao/, /aa/, /ow/, /oy/, that account for most         Soc. Am. 94 4 (1993)
                                                    GLM                   CNN−B                  CNN−D
                                                    CNN−A                 CNN−C                  k−NN

                                                0.0 0.2 0.4 0.6 0.8 1.0                     0.0 0.2 0.4 0.6 0.8 1.0


                            don't − wash             don't − water        oily − wash           oily − water          wash − water
                    1.0
                    0.8
                    0.6
                    0.4


                             dark − don't             dark − oily         dark − wash           dark − water           don't − oily
 average accuracy




                                                                                                                                            1.0
                                                                                                                                            0.8
                                                                                                                                            0.6
                                                                                                                                            0.4


                               all − dark             all − don't           all − oily           all − wash            all − water
                    1.0
                    0.8
                    0.6
                    0.4


                          0.0 0.2 0.4 0.6 0.8 1.0                     0.0 0.2 0.4 0.6 0.8 1.0                     0.0 0.2 0.4 0.6 0.8 1.0

                                                                      internal percentage


Figure 3: Average accuracy of various classification models based on the position of frame within a vowel, conditioned
on pairs of words. Dotted line indicates 50% accuracy rate.
                                            CNN−A model                                              [12] Šuch, O., Beňuš, Š, Tinajová, A., A new method to com-
                                                     0.0   0.4      0.8                                  bine probability estimates from pairwise binary classifiers,
                                       oily                  wash                water
                                                                                                         ITAT 2015 (2015)
                                                                                               1.0   [13] Graves, A., Jaitly N.: Towards end-to-end speech recogni-
                                                                                               0.8       tion with recurrent neural networks, International conference
                                                                                               0.6
                                                                                               0.4       on machine learning (2014)
   all                                                                                         0.2
   dark                                                                                        0.0   [14] Chan W., Jaitly N., Le Q.V., Vinyals O.: Listen, attend
   dont                                                                                                  and spel: a neural nework for lage vocabulary conversational
   oily                                all                   dark                don't
                          1.0
                                                                                                         speech recognition, ICASSP 2016 (2016)
                          0.8                                                                        [15] Park D.S., Chan W., Zhang Y., Chiu C-C., Zoph B., Cubuk
                          0.6
                          0.4                                                                            E.D., Le Q-V.: SpecAugment: A simple data augmentation
                          0.2
                          0.0                                                                            method for automatic speech recognition, arXiv:1904.08779
                                0.0   0.4     0.8                         0.0    0.4     0.8             (2019)
                                                    internal percentage                              [16] Barreda, S.: phonTools R package, version 0.2-2.1 (2015)
                                            CNN−D model                                              [17] R software for statistical computing, https://www.r-
                                                     0.0   0.4      0.8                                  project.org
                                       oily                  wash                water               [18] Chollet, F. et al, Keras, https://keras.io (2015)
                                                                                               1.0   [19] James, G., Witten, D., Hastie, T., Tibshirani, R.: An intro-
                                                                                               0.8       duction to statistical learning , Springer (2013)
                                                                                               0.6
                                                                                               0.4   [20] Hastie, T., Tibshirani, R., Friedman, J.: The Elements of
   all                                                                                         0.2
   dark                                                                                        0.0       Statistical Learning – Data Mining, Inference, and Predic-
   dont
   oily                                all                   dark                don't                   tion. New York: Springer (2009)
                          1.0                                                                        [21] LeCun Y., Bengio, Y.: Convolutional networks for im-
                          0.8
                          0.6                                                                            ages, speech, and time-series. In M. A. Arbib, editor, The
                          0.4                                                                            Handbook of Brain Theory and Neural Networks. MIT Press
                          0.2
                          0.0                                                                            (1995)
                                0.0   0.4     0.8                         0.0    0.4     0.8         [22] Morrison G. S., Nearey, T. M.: Testing theories of spectral
                                                    internal percentage                                  change, J. Acoust. Soc. Am., 122 (2007), EL15–EL22
                                              GLM model                                              [23] Hochreiter, S. Schmidhuber, J.: Long short-term memory,
                                            0.0 0.2 0.4 0.6 0.8 1.0                                      Neural computation, 9(8) (1997), 1735-–1780
                            oily                      wash                      water                [24] Ladefoged, P.: Vowels and consonants, Blackwell (2001)
                                                                                               1.0   [25] Jeffress, L. A.: Masking, in J. V. Tobias ed., Foundations of
                                                                                               0.8
                                                                                               0.6       modern auditory theory , Academic Press (1970)
                                                                                               0.4
                                                                                               0.2
                                                                                                     [26] Huckvale, M.: ACCDIST: a metric for comparing speaker’s
likelihood




                                                                                               0.0       accents, In Proc. Int. Conf. on Spoken Lang. Proc., Juju Is-
                             all                      dark                      don't                    land, Korea (2004)
             1.0                                                                                     [27] Rabiner, L. R.: A tutorial on hidden Markov models and
             0.8
             0.6                                                                                         selected applications in speech recognition, Proceedings of
             0.4                                                                                         the IEEE, 77, no. 2 (1989) 257—286
             0.2
             0.0                                                                                     [28] Hastie, T., Tibshirani, R.: Classification by pairwise cou-
                   0.0 0.2 0.4 0.6 0.8 1.0                           0.0 0.2 0.4 0.6 0.8 1.0             pling, The Annals of Statistics, 26(2) (1998) 451-–471.
                                            internal percentage
                                                                                                     [29] Wu T-F., Lin C-J., Weng R.C.: Probability estimates for
                                                                                                         multi-class classification by pairwise coupling, Jounral of
Figure 4: Frame likelihoods produced by multiclass mod-                                                  Machine Learning Reserach (2004) 975–1005
els for speaker MREB0.                                                                               [30] Šuch, O., Barreda, S.: Bayes covariant multi-class classifi-
                                                                                                         cation, Pattern Recognition Letters, 84 (2016) 99–106
                                                                                                     [31] Chang, C-C., Lin C-J., LIBSVM: A library for support vec-
[9] Garofolo, J. S. et al: TIMIT Acoustic-phonetic continuous                                            tor machines,ACM Transactions on Intelligent System and
    speech corpus LDC93S1. Web download. Philadelphia: Lin-                                              Technology, vol 2., issue 3 (2011)
    guistic Data Consortium (1993)                                                                   [32] Kim, H., Lee, W. S., Yoo, J., Park, M.,Ahn, K.H.: Origin of
[10] Nearey T.M., Assmann, P.F.: Probabilistic ’sliding-                                                 the higher difficulty in the recognition of vowels compared to
    template’ models for indirect vowel normalization, In Ex-                                            handwritten digits in deep neural networks, J. Korean Phys-
    perimental Approaches to Phonology, Eds. M.J. Solé, P.S.                                             ical Soc. 74 , No.1 (2019) 12–18
    Beddor, and M. Ohala, Oxford University Press (2007) 246–
    269
[11] Barreda S., Nearey, T.M.: The direct and indirect roles of
    fundamental frequency in vowel perception, J. Acoust. Soc.
    Am. 131 (1) (2012) 466–477