=Paper=
{{Paper
|id=Vol-2473/paper30
|storemode=property
|title=A Comparison of Formant and CNN Models for Vowel Frame Recognition
|pdfUrl=https://ceur-ws.org/Vol-2473/paper30.pdf
|volume=Vol-2473
|authors=Ondrej Šuch,Santiago Barreda,Anton Mojsej
|dblpUrl=https://dblp.org/rec/conf/itat/SuchBM19
}}
==A Comparison of Formant and CNN Models for Vowel Frame Recognition==
A comparison of formant and CNN models for vowel frame recognition Ondrej Šuch1,2 , Santiago Barreda3 , and Anton Mojsej2 1 Mathematical Institue, Slovak Academy of Sciences 955 01 Banská Bystrica, Slovakia ondrejs@savbb.sk, 2 Žilinská Univerzita v Žiline, Univerzitná 8215/1, 010 26 Žilina 3 University of California Davis, 956 16 CA, USA Abstract: Models for the classification of vowels are of instance in classical studies of English vowels in [2, 1], or continuing interest in both phonetics and for the develop- [3]. ment of automatic speech recognition (ASR) systems. The Second, one may view a vowel as a dynamic entity phonetics researchers favor linear classifiers based on for- evolving in time to both accommodate its acoustic sur- mants, whereas ASR systems have adopted deep neural roundings as well as to express inherent vowel dynamic. networks in recent years. In our work we compare the This view fits much better to acoustics observed in con- performance of several kinds of convolutional neural net- versational speech, where higher articulation rates lead works (CNN) with the linear classifier on the task of clas- speakers to shorten stationary vowel centers to the point sifying short vowel frames from TIMIT corpus. of nearly eliminating them. This view was already hinted Our primary hypothesis was that the CNN models on in [2], reiterated in [3], and thoroughly examined in [4]. would prove significantly more precise than linear models, A model belonging to the first paradigm is sometimes including during consonant-vowel and vowel-consonant called a frame recognition model, since it can be trained transitions, while obviating the inherent difficulties of for- based on extracted vowel windows (frames), and it can mant tracking. We confirmed the hypothesis, although the categorize a vowel given a previously unseen sound win- improvement was modest. dow (frame). We will call a model belonging to the second Our secondary goal was to investigate the possibility to paradigm a segmental model. mitigate loudness sensitivity of CNN models, and deter- There are two principal methodologies that are used to mine whether the mitigation would have deleterious effect create frame recognition models. For phonetics, the typi- on the classification performance of CNN models. Our ex- cal approach is to create a linear model based on the first periments indicate that the loudness invariant CNN archi- three formants, which are known to be key determinants tecture performs equally well to traditional spectrum based of a vowel quality. The second methodology is a powerful convolutional models. class of classification models that has garnered much at- Keywords: MNIST, convolutional network, pairwise cou- tention in automated speech recognition systems, namely pling, one-on-one classification, binary classification, convolutional neural networks (CNN) [5, 6]. dropout The goal of this work is to compare, both qualitatively and quantitatively, classification results obtained by the classical phonetic approach based on formants and the 1 Introduction newer method based on convolutional neural networks with the eye on suitability of adoption of CNN for pho- The difference in a single English vowel may convey netic analyses. A key potential problem is the known de- nearly a dozen different meanings (e.g. hVd examples in pendence of deep neural network models on large amount [1]). Models that people use to differentiate vowels have of data [7]. While large amounts of data are commonly been under study for many decades, both in phonetics as used in ASR research, phonetics datasets are usually much well as in automated speech recognition (ASR) systems. smaller. Recognition models can be broadly divided between two paradigms based on their use of the dynamic information in vowel sounds. 2 Dataset First, one may idealize a vowel as a stationary sound. This is a reasonable approximation since speakers can The performance of classifiers is easiest to investigate in stretch almost any vowel to many times its ordinary du- cases when they are faced with a difficult recognition prob- ration without much effort. This view was adopted for lem. A well-known pair of vowels that repeatedly shows Copyright c 2019 for this paper by its authors. Use permitted un- up in listening experiments as difficult to distinguish is the der Creative Commons License Attribution 4.0 International (CC BY aa/ao pair [1, 3, 8]. Therefore we looked for vowels that 4.0). could be confused with this pair of vowels. word aa ae ah ao ax eh er ow uh 0.0 0.2 0.4 0.6 0.8 1.0 all 24 0 0 850 0 0 0 2 0 wash water dark 858 0 10 6 0 2 0 0 0 2500 2000 wash 408 2 8 454 0 0 2 0 2 1500 water 104 0 34 724 4 0 0 0 10 1000 500 don't oily Table 1: TIMIT annotations of the first syllables in se- 2500 frequency 2000 lected words in SA1 sentences. 1500 1000 500 word ah ao ax ix ow oy uh all dark don’t 10 2 2 2 856 0 4 2500 2000 oily 0 90 0 0 66 674 0 1500 1000 500 Table 2: TIMIT annotations of the first syllables in se- 0.0 0.2 0.4 0.6 0.8 1.0 lected words in SA2 sentences. internal percentage Figure 1: Time evolution of the medians of the first three We opted to use the TIMIT corpus [9], which has been formants (solid colors). Data of one speaker (MREB0) is the focus of many automated speech recognition studies. indicated in broken lines. Compared to single syllable laboratory recordings, the cor- pus has the advantage of being closer to natural speech, and it comes with annotations dividing speech into indi- 3 Classification models vidual phonemes. We used vowel segments of male speakers from SA1 Model creation was done using R software [17] and Keras and SA2 sentences of the TIMIT corpus (we exluded fe- machine learning library [18]. male speakers due to the well-known problem of F1 de- termination for voices with high fundamental frequency, as well as due to likely separate models required for dif- 3.1 Frame classification models ferent sexes [10, 11]). The segments were the first vo- calic segments in the words “dark”, “wash”, “water”, “all”, We considered three types of frame classification models. from SA1 sentences and “don’t”, “oily” from SA2 sen- The first was the logistic generalized linear model (GLM) tences. We excluded segments from those instances of [19, 20], which is of kind commonly used in phonetic words “oily”, which were annotated with three vocalic studies of vowel qualities. The second one was convo- segments. Tables 1 and 2 indicate phonetic annotation of lutional neural network [21], which is a powerful classi- the segments chosen by the TIMIT authors. The tables fier used in image processing, but also commonly applied indicate that the words are prototypes respectively for seg- in deep learning ASR systems. We expected that a lin- ments /ao/, /aa/, /ow/ and /oy/. ear model may have a problem capturing consonant-vowel In this work we used classes implied by the words, and vowel-consonant transitions at the starts and ends rather than the classes provided by TIMIT annotators. of the segments. Therefore we included also k-nearest The primary reason is that those classes are less ambigu- neighbor (k-NN) non-parametric classifier [19, 20], which ous, and possibly more reliable [12]. This approach also should be able to learn nonlinear boundaries. matches current trend in ASR to directly output lexical la- bels [13, 14, 15]. Our decision makes little difference for Neural network models Currently, search for neural net- words “all”, “dark”, “don’t”, “oily”, because these are typ- work architectures is an active research field. We used ad ically annotated with a single TIMIT phoneme label. But hoc designs as shown in Figure 2. it may affect classification results of the phonemes from For convolutional networks we used architectures words “wash”, “water”, which were labelled mostly by shown in the upper part of Figure 2. All convolutional lay- /aa/ and /ao/ annotations in TIMIT corpus. ers were 1-D with kernel of length 3; we used 5 neurons The analysis window was set to be 24ms (384 sam- on the first layer and 10 on the second. ReLU nonlinear- ples). For each vowel altogether 21 frames were extracted ity was used both in convolutional and dense layers. The equally spaced within the segment duration as indicated dense layer had 32 neurons. Finally, we distinguish several by TIMIT segmentation boundaries. Thus there were alto- variations of the model based on provided spectral input: gether 109410 samples divided among TRAIN and TEST datasets approximately in ratio 2:1. Extraction of formants • CNN-A. The input was 192 point Fourier log- and LPC spectra was performed using phonTools R pack- periodogram. age [16]. We normalized features for CNN, LSTM and k-NN models, which we will describe in detail in the next • CNN-B. The input was LPC spectrum of order 19, section. sampled at 192 equally spaced frequencies. FRAME CNN MODEL ONSET + OFFSET CNN MODEL by virtue of ignoring formant amplitudes. To achieve an equivalent property with CNN, one can transform the in- CLASSIFICATION CLASSIFICATION put tensor. Thus instead of supplying input vector s repre- senting log-magnitudes of spectral components SOFTMAX SOFTMAX s = (s1 , s2 , . . . , sM ) (1) DENSE DENSE we provide as input the vector N(s) defined as follows: CONV2 CONV2 N(s) := (s2 − s1 , s3 − s1 , . . . , sM − s1 ) (2) CONV1 The key feature of this transform is that the sounds differ- CONV1 ing purely in intensity, say by d decibels, map to the same input vector since FRAME SPECTRUM ONSET SPECTRUM OFFSET N(s1 + d, s2 + d, . . . , sM + d) = N(s1 , s2 , . . . , sM ) (3) SPECTRUM Moreover, the topographic distribution of input vector SEGMENTAL LSTM MODEL components, which is important for CNN, is maintained, and only a very low frequency component is missing, CLASSIFICATION which should not be relevant for vowel identification. SOFTMAX 3.2 Segment classification models TIME We considered two kinds of segmental models. First, we DISTRIBUTED supplied frame information from two frames: the onset at 10% and the offset at 90% position. This is inspired by h0 h1 h20 research of Nearey and Morrison [22], who indicated that onset + offset data fits well with perceptual experiments. … We considered two implementations of the model, a CNN LSTM LSTM LSTM with stacked onset+offset spectral information as shown in Figure 2 upper right, and logistic regression (GLM) model. INITIAL SPECTRUM AT 5% … FINAL SPECTRUM Second, we trained LSTM (long short-term memory) SPECTRUM network [23] with Fourier and LPC spectra (denoted re- spectively LSTM-A and LSTM-B) with architecture as Figure 2: Schematic layout of the three deep neural net- shown in the bottom of Figure 2. The LSTM layer used work architectures considered in the paper. 4 units. • CNN-C. The input was as in CNN-B, but intensity 4 Results normalized per equation (2) below. We have opted for pairwise classification paradigm, where • CNN-D. The input was as in CNN-C but limited to we classify frames (or segments) from a pair of words at components below 3,5kHz. a time. This decision is motivated primarily by our desire to elucidate strengths and weaknesses of various models, The purpose of the variations was to try providing data which may be confounded in the multiclass setting. We to CNN which is closer to data that formant models use note that the paradigm has analogues and applications in (e.g. the third formant F3 almost never exceeds value phonetics (concept of minimal word pairs [24], forced A/B 3,5kHz). choice experimental design [25], or comparing accents by Finally, let us explain the rationale behind intensity- contrasting pairs of vowels in various words [26]). invariant models CNN-C and CNN-D. If the sound source is further away from the listener, obviously the sound is 4.1 Summary of accuracy with respect to location quieter. In the context of frame classification, there should be no change in prediction if the sound envelope is uni- Interpretability for classification models is highly desir- formly shifted by an equal constant. Admittedly, the pre- able for recognition models in phonetics. For frame classi- diction may change, if more context were available, be- fication models it is possible to analyze their performance cause then a decrease of intensity may constitute a phono- based on the position within vowel. Summary analysis is logical contrast (e.g. syllabic stress). Let us note that any presented in Figure 3, from which we draw the following model based on formants satisfies this invariance property, conclusions: Average TIMIT annotations of our dataset (see Tables 1 and 2). The Paradigm Model type Model accuracy likelihood plots are shown in Figure 4. CNN-A 83.2 % Frame classifi- convolutional It is hard to argue which model predicted likelihoods cation models CNN-B 83.2% that are more correct, since there is no ground truth. Sub- neural network CNN-C 82.1 % jectively, one may feel that CNN models are too certain models CNN-D 83.1% about their predictions, especially since the predictions are formant-based k-NN 74 % considerably less smooth than that of the formants based models GLM 73.9 % GLM model. The small size of the dataset may be a pos- onset+offset CNN-A 96.8 % sible cause. Segmental models GLM 95.4 % models recursive neural LSTM-A 96.9 % network models LSTM-B 96.2 % 5 Conclusion Table 3: Average accuracy of frame and segmental mod- els. Note that for frame classification, the input to CNN-A Our experiments indicate that it is feasible to construct is a 1D tensor consisting of a spectrum, whereas in seg- CNN models for frame classification for medium size pho- mental classification, we joined onset and offset spectra to netic corpora. The resulting models showed superior per- form a 2D tensor as shown in upper right in Figure 2. formance to formant-based models even without extensive parameter optimization, which is in accordance with find- ings of a recent study on Korean vowels [32]. It is also • CNN models are uniformly superior to the other two possible to make CNN models intensity-invariant akin to classes of models (k-NN and GLM). formant-based models without noticeable loss of perfor- mance. The only downside we observed (Section 4.3) was • There is negligible performance difference among that the likelihoods predicted by CNN models lacked tem- models CNN-A through CNN-D. poral smoothness compared to the formant model. • Formant models (GLM as well as k-NN) in several cases closely follow convolutional neural networks Acknowledgment (e.g. all-dark, or dark-wash, or dark-oily). This indi- cates that very often formant positions are sufficient for classification. Work on this paper was partially supported by grant VEGA 2/0144/18. We are thankful to Peter Tarábek for his suggestion to use the stacked input architecture for on- set+offset CNN models, which proved much more accu- 4.2 Accuracy improvement with segmental models rate than the previously investigated branched architecture. We trained also segmental models in order to gauge the ef- fect of restricting classification to frames rather than seg- ments. The results are summarized in Table 3. References Clearly, segmental models can make good use of extra information, since all six segments are well classified. The [1] Peterson, G.E., Barney, H.L.: Control methods used in a most likely explanation is the ability of segmental models study of the vowels, J. Acoust. Soc. Am. 24 (1952) 175–184 to discriminate consonant-vowel/vowel-consonant transi- [2] Steinberg, J. C., Potter, R. K.: Toward the specification of tions. speech, J. Acoust. Soc. Am. 22 (1950) [3] Hillenbrand, J., Getty, L.A., Clark, M.J., Wheeler, K.: Acoustic characteristics of American English vowels, J. 4.3 Multi-class classification Acoust. Soc. Am. 97 5 (1995) 3099–3111 [4] Morrison, G. S., Assmann, P. F.: Vowel inherent spectral For ASR applications, especially for the systems based change, Springer (2012) on hidden Markov models [27], it is desirable to obtain [5] Abdel-Hamid, O., Mohamed, A-R., Jiang H., Deng L., Penn multi-class likelihoods. Passage from pairwise models G., Yu, D.: Convolutional neural networks for speech recog- to multi-class ones can be achieved by any of multiple nition, IEEE/ACM Transactions On Audio, Speech, and pairwise-coupling techniques [28, 29, 30]. We performed Language Processing, 22 10 (2014) 1533–1545 this coupling using the second method suggested in [29], [6] Tóth, L.: Convolutional deep rectifier neural nets for phone which is widely used in machine learning [31]. The multi- recognition, INTERSPEECH, (2013) class models were composed of 6 pairwise models trained [7] Hestess, J. et al: Deep learning scaling is predictable, empir- on frames from the first syllables of all possible pairs ically, https://arxiv.org/abs/1712.00409 of words “all”, “dark”, “don’t”, and “oily”. We used [8] Zahorian, S.A., Jagharghi, A.J.: Spectral-shape features ver- those four words, because they are the prototypes of the sus formants as acoustic correlates for vowels, J. Acoust. four phonemes /ao/, /aa/, /ow/, /oy/, that account for most Soc. Am. 94 4 (1993) GLM CNN−B CNN−D CNN−A CNN−C k−NN 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 don't − wash don't − water oily − wash oily − water wash − water 1.0 0.8 0.6 0.4 dark − don't dark − oily dark − wash dark − water don't − oily average accuracy 1.0 0.8 0.6 0.4 all − dark all − don't all − oily all − wash all − water 1.0 0.8 0.6 0.4 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 internal percentage Figure 3: Average accuracy of various classification models based on the position of frame within a vowel, conditioned on pairs of words. Dotted line indicates 50% accuracy rate. CNN−A model [12] Šuch, O., Beňuš, Š, Tinajová, A., A new method to com- 0.0 0.4 0.8 bine probability estimates from pairwise binary classifiers, oily wash water ITAT 2015 (2015) 1.0 [13] Graves, A., Jaitly N.: Towards end-to-end speech recogni- 0.8 tion with recurrent neural networks, International conference 0.6 0.4 on machine learning (2014) all 0.2 dark 0.0 [14] Chan W., Jaitly N., Le Q.V., Vinyals O.: Listen, attend dont and spel: a neural nework for lage vocabulary conversational oily all dark don't 1.0 speech recognition, ICASSP 2016 (2016) 0.8 [15] Park D.S., Chan W., Zhang Y., Chiu C-C., Zoph B., Cubuk 0.6 0.4 E.D., Le Q-V.: SpecAugment: A simple data augmentation 0.2 0.0 method for automatic speech recognition, arXiv:1904.08779 0.0 0.4 0.8 0.0 0.4 0.8 (2019) internal percentage [16] Barreda, S.: phonTools R package, version 0.2-2.1 (2015) CNN−D model [17] R software for statistical computing, https://www.r- 0.0 0.4 0.8 project.org oily wash water [18] Chollet, F. et al, Keras, https://keras.io (2015) 1.0 [19] James, G., Witten, D., Hastie, T., Tibshirani, R.: An intro- 0.8 duction to statistical learning , Springer (2013) 0.6 0.4 [20] Hastie, T., Tibshirani, R., Friedman, J.: The Elements of all 0.2 dark 0.0 Statistical Learning – Data Mining, Inference, and Predic- dont oily all dark don't tion. New York: Springer (2009) 1.0 [21] LeCun Y., Bengio, Y.: Convolutional networks for im- 0.8 0.6 ages, speech, and time-series. In M. A. Arbib, editor, The 0.4 Handbook of Brain Theory and Neural Networks. MIT Press 0.2 0.0 (1995) 0.0 0.4 0.8 0.0 0.4 0.8 [22] Morrison G. S., Nearey, T. M.: Testing theories of spectral internal percentage change, J. Acoust. Soc. Am., 122 (2007), EL15–EL22 GLM model [23] Hochreiter, S. Schmidhuber, J.: Long short-term memory, 0.0 0.2 0.4 0.6 0.8 1.0 Neural computation, 9(8) (1997), 1735-–1780 oily wash water [24] Ladefoged, P.: Vowels and consonants, Blackwell (2001) 1.0 [25] Jeffress, L. A.: Masking, in J. V. Tobias ed., Foundations of 0.8 0.6 modern auditory theory , Academic Press (1970) 0.4 0.2 [26] Huckvale, M.: ACCDIST: a metric for comparing speaker’s likelihood 0.0 accents, In Proc. Int. Conf. on Spoken Lang. Proc., Juju Is- all dark don't land, Korea (2004) 1.0 [27] Rabiner, L. R.: A tutorial on hidden Markov models and 0.8 0.6 selected applications in speech recognition, Proceedings of 0.4 the IEEE, 77, no. 2 (1989) 257—286 0.2 0.0 [28] Hastie, T., Tibshirani, R.: Classification by pairwise cou- 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 pling, The Annals of Statistics, 26(2) (1998) 451-–471. internal percentage [29] Wu T-F., Lin C-J., Weng R.C.: Probability estimates for multi-class classification by pairwise coupling, Jounral of Figure 4: Frame likelihoods produced by multiclass mod- Machine Learning Reserach (2004) 975–1005 els for speaker MREB0. [30] Šuch, O., Barreda, S.: Bayes covariant multi-class classifi- cation, Pattern Recognition Letters, 84 (2016) 99–106 [31] Chang, C-C., Lin C-J., LIBSVM: A library for support vec- [9] Garofolo, J. S. et al: TIMIT Acoustic-phonetic continuous tor machines,ACM Transactions on Intelligent System and speech corpus LDC93S1. Web download. Philadelphia: Lin- Technology, vol 2., issue 3 (2011) guistic Data Consortium (1993) [32] Kim, H., Lee, W. S., Yoo, J., Park, M.,Ahn, K.H.: Origin of [10] Nearey T.M., Assmann, P.F.: Probabilistic ’sliding- the higher difficulty in the recognition of vowels compared to template’ models for indirect vowel normalization, In Ex- handwritten digits in deep neural networks, J. Korean Phys- perimental Approaches to Phonology, Eds. M.J. Solé, P.S. ical Soc. 74 , No.1 (2019) 12–18 Beddor, and M. Ohala, Oxford University Press (2007) 246– 269 [11] Barreda S., Nearey, T.M.: The direct and indirect roles of fundamental frequency in vowel perception, J. Acoust. Soc. Am. 131 (1) (2012) 466–477