Phone Recognition Experiments on ArtiPhon with KALDI Piero Cosi Istituto di Scienze e Tecnologie della Cognizione Consiglio Nazionale delle Ricerche Sede Secondaria di Padova – Italy piero.cosi@pd.istc.cnr.it Participants were encouraged to use the train- Abstract ing articulatory data to increase the generaliza- tion performance of their recognition system. English. In this work we present the re- However, we decided not to use them, mainly for sults obtained so far in different recogni- the sake of time, but also because we wanted to tion experiments working on the audio compare the results with those obtained in the only part of the ArtiPhon corpus used for past on different adult and children speech audio- the EVALITA 2016 speech-mismatch only corpora (Cosi & Hosom, 2000; Cosi & Pel- ArtiPhon task. lom, 2005; Cosi, 2008; Cosi, 2009; Cosi et al., 2014; Cosi et al., 2015). Italiano. In questo lavoro si presentano i risultati ottenuti sinora in diversi esperi- 2 Data menti di riconoscimento fonetico utiliz- zanti esclusivamente la sola parte audio We received the ArtiPhon (Canevari et al., 2015) del corpus ArtiPhon utilizzato per il task training data by the Istituto Italiano di Tecnolo- ArtiPhon di EVALITA 2016. gia - Center for Translational Neurophysiology of Speech and Communication (CTNSC) late in July 2016, while the test material was released at 1 Introduction the end of September 2016. The ArtiPhon dataset In the last few years, the automatic speech contains the audio and articulatory data recorded recognition (ASR) technology has achieved re- from three different speakers in citation condi- markable results, mainly thanks to increased tion. In particular for the EVALITA 2016 Ar- training data and computational resources. How- tiPhon - Articulatory Phone Recognition task ever, ASR trained on thousand hours of annotat- only one speaker (cnz - 666 utterances) was ed speech can still perform poorly when training considered. and testing conditions are different (e.g., differ- The audio was sampled at 22050 Hz while ar- ent acoustic environments). This is usually re- ticulatory data were extracted by the use of the ferred to as the mismatch problem. NDI (Northen Digital Instruments, Canada) wave In the ArtiPhon task participants will have to speech electromagnetic articulograph at 400 Hz build a speaker-dependent phone recognition sampling rate. system that will be evaluated on mismatched Four subdirectories are available: speech rates. While training data consists of read  wav_1.0.0: each file contains an audio re- speech where the speaker was required to keep a cording constant speech rate, testing data range from  lab_1.0.0: each file contains phonetic labels slow and hyper-articulated speech to fast and automatically computed using HTK hypo-articulated speech.  ema_1.0.0: each file contains 21 channels: The training dataset contains simultaneous re- coordinate in 3D space (xul yul zul xll yll cordings of audio and vocal tract (i.e., articulato- zll xui yui zui xli yli zli xtb ytb ztb xtm ytm ry) movements recorded with an electromagnetic ztm xtt ytt ztt) articulograph (Canevari et al., 2015). Head movement correction was automatically For both experiments, we used the KALDI performed. First an adaptive median filter with a ASR engine, and we started from the TIMIT rec- window from 10 ms to 50 ms and secondly a ipe, which was adapted to the ArtiPhon Italian smooth elliptic low-pass filter with 20 Hz cutoff data set. frequency were applied to each channel. Deciding when a phone should be considered Unfortunately, we discovered that the audio incorrectly recognized was another evaluation data was completely saturated both in the train- issue. In this work, as illustrated in Table 1, two ing and the test set, thus forcing us to develop set of phones, with 29 and 60 phones respective- various experiments both using the full set of ly, have been selected for the experiments, even phonemes but also a smaller reduced set in order if the second set is far from being realistic given to make more effective and reliable the various the degraded quality of the audio signal. phone recognition experiments. 3 ASR DNN has proven to be an effective alternative to HMM - Gaussian mixture modelisation (GMM) based ASR (HMM-GMM) (Bourlard and Mor- gan, 1994; Hinton et al., 2012) obtaining good performance with context dependent hybrid DNN-HMM (Mohamed et al., 2012; Dahl et al., 2012). Deep Neural Networks (DNNs) are indeed the latest hot topic in speech recognition and new systems such as KALDI (Povey et al., 2011) demonstrated the effectiveness of easily incorpo- rating “Deep Neural Network” (DNN) tech- niques (Bengio, 2009) in order to improve the recognition performance in almost all recogni- tion tasks. DNNs has been already applied on different adults and children Italian speech corpora, ob- taining quite promising results (Cosi, 2015; Ser- izel & Giuliani, 2014; Serizel & Giuliani, 2016). Table 1: 60 and 29 phones set (SAMPA). In this work, the KALDI ASR engine adapted to Italian was adopted as the target ASR system Considering that, in unstressed position, the to be evaluated on the ArtiPhon data set. oppositions /e/ - /E/ and /o/ - /O/ are often neu- At the end we decided not to use the articula- tralized in the Italian language, it was decided to tory data available in the ArtiPhon data set, be- merge these couples of phonemes. Since the oc- cause we wanted to compare the final results of currences of /E/ and /O/ phonemes were so rare this task with those obtained in the past on dif- in the test set, this simplification have had no ferent audio-only corpora which were not charac- influence in the test results. terized by the above cited speech mismatch prob- Then, the acoustic differences between lem. stressed (a1, e1, E1, i1, o1, O1, u1) and un- stressed vowels (a, e, E, i, o, O, u) in Italian are 4 The EVALITA 2016 - ArtiPhon Task subtle and mostly related to their duration. Fur- A speaker dependent experiment characterized thermore, most of the Italian people pronounce by training and test speech type mismatch was vowels according to their regional influences prepared by using the ArtiPhon task training and instead of “correct-standard” pronunciation, if test material. A second speaker independent ex- any, and this sort of inaccuracies is quite com- periment was also set by testing the ArtiPhon test mon. For these reasons, recognition outputs have data using a previously trained ASR acoustic been evaluated using the full 60-phones ArtiPhon model on APASCI (Angelini et al., 1994), thus set as well as a more realistic reduced 29-phones having in this case both speech type and speaker set, which do not count the mistakes between mismatch. stressed and unstressed vowels, geminates vs single phones and /ng/ and /nf/ allphones vs the 60-phones and 29-phones are summarized in Ta- /n/ phoneme. ble 2a and 2b respectively, for all the KALDI In Table 2, the results of the EVALITA 2016 ASR engines, as in the TIMIT recipe. ArtiPhon speaker dependent experiment with the Table 2a: results for the EVALITA 2016 ArtiPhon speaker dependent task in the 60-phones case. Table 2b: results for the EVALITA 2016 ArtiPhon speaker dependent task in the 29-phones case. Table 3: results for the EVALITA 2016 ArtiPhon speaker independent task in the 29-phones case. The results of the EVALITA 2016 ArtiPhon in the system (typically several thousand). The speaker independent experiment using the neural net is trained to predict the posterior acoustic models trained on APASCI with the probability of each context-dependent state. 29-phones are summarized in Table 3. During decoding the output probabilities are All the systems are built on top of MFCC, divided by the prior probability of each state to LDA, MLLT, fMLLR with CMN features 1 - form a “pseudo-likelihood” that is used in see (Rath, et al., 2013) for all acronyms refer- place of the state emission probabilities in the ences - obtained from auxiliary GMM (Gaussi- HMM(see Cosi et al. 2015, for a more detailed an Mixture Model) models. At first, these 40- description). dimensional features are all stored to disk in The Phone Error Rate (PER) was considered order to simplify the training scripts. for computing the score of the recognition pro- Moreover MMI, BMMI, MPE and sMBR2 cess. The PER, which is defined as the sum of training are all supported - see (Rath et al., the deletion (DEL), substitution (SUB) and 2013) for all acronyms references. insertion (INS) percentage of phonemes in the KALDI currently contains also two parallel ASR outcome text with respect to a reference implementations for DNN (Deep Neural Net- transcription was computed by the use of the works) training: “DNN Hybrid (Dan’s)” (Kal- NIST software SCLITE (sctk-WEB). di, WEB-b), (Zhang et al., 2014), (Povey et al., The results shown in Table 3 refer to the 2015) and “DNN Hybrid (Karel's)” (Kaldi, various training and decoding experiments - WEB-a), (Vesely et al., 2013) in Table 3. Both see (Rath et al., 2013) for all acronyms refer- of them are DNNs where the last (output) layer ences: is a softmax layer whose output dimension  MonoPhone (mono); equals the number of context-dependent states  Deltas + Delta-Deltas (tri1); 1  LDA + MLLT (tri2); MFCC: Mel-Frequency Cepstral Coefficients;  LDA + MLLT + SAT (tri3); LDA: Linear Discriminant Analysis; MLTT: Max-  SGMM2 (sgmm2_4); imum Likelihood Linear Transform; fMLLR: fea- ture space Maximum Likelihood Linear Regres-  MMI + SGMM2 (sgmm2_4_mmi_b0.1- sion; CMN: Cepstral Mean Normalization. 4); 2 MMI: Maximum Mutual Information; BMMI:  Dan’s Hybrid DNN (tri4-nnet), Boosted MMI; MPE: Minimum Phone Error; sMBR: State-level Minimum Bayes Risk  system combination, that is Dan’s DNN + KALDI architecture used in previous more SGMM (combine_2_1-4); complex speaker independent adult and chil-  Karel’s Hybrid DNN (dnn4_pretrain- dren speech ASR experiments was simply cho- dbn_dnn); sen.  system combination that is Karel’s DNN + sMBR (dnn4_pretrain-dbn_dnn_1-6). References Angelini, B., Brugnara, F., Falavigna, D., Giuliani, D., Gretter, R., & Omologo, M., 1994. Speaker In the Table, SAT refers to the Speaker Independent Continuous Speech Recognition Adapted Training (SAT), i.e. train on fMLLR- Using an Acoustic-Phonetic Italian Corpus. In adapted features. It can be done on top of ei- Proc. of ICSLP, Yokohama, Japan, Sept. 1994, ther LDA+MLLT, or delta and delta-delta fea- 1391-1394. tures. Bengio, Y., 2009. Learning Deep Architectures for If there are no transforms supplied in the AI, in Foundations and Trends in Machine alignment directory, it will estimate transforms Learning, Vol. 2, No. 1 (2009) 1-127. itself before building the tree (and in any case, Bourlard H:A. & Morgan N., 1994. Connectionist it estimates transforms a number of times dur- Speech Recognition: a Hybrid Approach, vol- ing training). SGMM2 refers instead to Sub- ume 247. Springer. space Gaussian Mixture Models Training (Povey, 2009; Povey, et al. 2011). This train- Canevari, C., Badino, L., Fadiga, L., 2015. A new Italian dataset of parallel acoustic and articulato- ing would normally be called on top of fMLLR ry data, Proceedings of INTERSPEECH, Dres- features obtained from a conventional system, den, Germany, 2015, 2152-2156. but it also works on top of any type of speaker- independent features (based on deltas+delta- Cosi, P., & Hosom, J. P., 2000, High Performance deltas or LDA+MLLT). General Purpose Phonetic Recognition for Ital- ian, in Proceedings of ICSLP 2000, Beijing, 527- 530, 2000. 5 Conclusions Cosi, P. & Pellom, B., 2005. Italian Children’s As expected, due to the degraded clipped Speech Recognition For Advanced Interactive quality of the training and test audio signal, the Literacy Tutors, in Proceedings of INTER- 60-phones set is far from being realistic for SPEECH 2005, Lisbon, Portugal, 2201-2204, obtaining optimum recognition performance 2005. even in the speaker dependent case (ArtiPhon Cosi, P., 2008. Recent Advances in Sonic Italian training and test material). Children’s Speech Recognition for Interactive On the contrary, if the reduced 29-phones Literacy Tutors, in Proceedings of 1st Workshop set is used, the phone recognition performance On Child, Computer and Interaction (WOCCI- is quite good and more than sufficient to build 2008), Chania, Greece, 2008. an effective ASR system if a language model Cosi, P., 2009. On the Development of Matched could be incorporated. and Mismatched Italian Children’s Speech Moreover, also in the speaker independent Recognition Systems, in Proceedings of INTER- case (APASCI training material and ArtiPhon SPEECH 2009, Brighton, UK, 540-543, 2009. test material) the performance are not too bad Cosi, P., Nicolao, M., Paci, G., Sommavilla, G., & even in these speech type and speaker mis- Tesser, F., 2014. Comparing Open Source ASR match conditions, thus confirming the effec- Toolkits on Italian Children Speech, in Proceed- tiveness and the good quality of the system ings of Workshop On Child, Computer and In- trained on APASCI material. teraction (WOCCI-2014), Satellite Event of IN- In these experiments, the DNNs results do TERSPEECH 2014, Singapore, September 19, not overcome those of the classic systems and 2014. we can hypothesize that this is due partially to Cosi, P., Paci G., Sommavilla G., & Tesser F., the low quality of the signal, and also to the 2015. KALDI: Yet another ASR Toolkit? Expe- size of the corpus which is probably not suffi- riments on Italian Children Speech. In Il farsi e il cient to make the system learn all the variables disfarsi del linguaggio. L’emergere, il mutamen- characterizing the network. Moreover, the to e la patologia della struttura sonora del lin- DNN architecture was not specifically tuned to guaggio, 2015. the ArtiPhon data but instead the default Dahl, G.E., Yu, D., Deng,L. & Acero, A., 2012. Vesely, K., Ghoshal, A., Burget, L., & Povey, D., Context-Dependent Pre-Trained Deep Neural 2013. Sequence-discriminative training of deep Networks for Large-Vocabulary Speech Recog- neural networks, in Proceedings of INTER- nition. IEEE Transactions on Audio, Speech, and SPEECH 2013, 2345-2349. Language Processing, Jan. 2012, 20(1):30–42. Zhang, X, Trmal, J., Povey, D., & Khudanpur, S., Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mo- 2014. Improving Deep Neural Network Acoustic hamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Models Using Generalized Maxout Networks, in Nguyen, P., Sainath, T.N. & Kingsbury, B., Proceedings of. ICASSP 2014, 215-219. 2012. Deep Neural Networks for Acoustic Mod- eling in Speech Recognition. IEEE Signal Pro- cessing Magazine, Nov. 2012, 29(6):82-97. Kaldi-WEBa - Karel’s DNN implementation: http://KALDI.sourceforge.net/dnn1.html Kaldi-WEBb - Dan's DNN implementation: http://KALDI.sourceforge.net/dnn2.html. Mohamed, A., Dahl, G.E. & Hinton, G., 2012. Acoustic Modeling Using Deep Belief Net- works. IEEE Transactions on Audio, Speech, and Language Processing, Jan. 2012, 20(1):14- 22. Povey D., (2009). Subspace Gaussian Mixture Models for Speech Recognition, Tech. Rep. MSR-TR-2009-64, Microsoft Research, 2009. Povey, D., Burget, L., Agarwal, M., Akyazi, P., Feng, K., Ghoshal, A:, Glembek, O., Goel, N.K., Karafiát, M., Rastrow, A., Rose, R.C., Schwarz, P., Thomas, S., (2011). The Subspace Gaussian Mixture Mode - A Structured Model for Speech Recognition, Computer Speech & Language, vol. 25, no. 2, pp. 404–439, April 2011. Povey, D., Ghoshal, A. et al., 2001. The KALDI Speech Recognition Toolkit, in Proceedings of ASRU, 2011 (IEEE Catalog No.: CFP11SRW- USB). Povey, D., Zhang, X., & Khudanpur, S., 2014. Par- allel Training of DNNs with Natural Gradient and Parameter Averaging, in Proceedings of ICLR 2015, International Conference on Learn- ing Representations (arXiv:1410.7455). Rath, S. P., Povey, D., Vesely, K., & Cernocky, J., 2013. Improved feature processing for Deep Neural Networks, in Proceedings of INTER- SPEECH 2013, 109-113. sctk-WEB - Speech Recognition Scoring Toolkit https://www.nist.gov/itl/iad/mig/tools. Serizel R., Giuliani D. (2014). Deep neural network adaptation for children's and adults' speech recognition. In Proceedings of ClicIt 2014, 1st Italian Conference on Computational Linguis- tics, Pisa, Italy, 2014. Serizel R., Giuliani D. (2016). Deep-neural network approaches to speech recognition in heterogene- ous groups of speakers including children, in Natural Language Engineering, April 2016.