Phone Recognition Experiments on
                                ArtiPhon with KALDI


                                              Piero Cosi
                          Istituto di Scienze e Tecnologie della Cognizione
                                 Consiglio Nazionale delle Ricerche
                                  Sede Secondaria di Padova – Italy
                                       piero.cosi@pd.istc.cnr.it


                                                         Participants were encouraged to use the train-
                    Abstract                           ing articulatory data to increase the generaliza-
                                                       tion performance of their recognition system.
    English. In this work we present the re-           However, we decided not to use them, mainly for
    sults obtained so far in different recogni-        the sake of time, but also because we wanted to
    tion experiments working on the audio              compare the results with those obtained in the
    only part of the ArtiPhon corpus used for          past on different adult and children speech audio-
    the EVALITA 2016 speech-mismatch                   only corpora (Cosi & Hosom, 2000; Cosi & Pel-
    ArtiPhon task.                                     lom, 2005; Cosi, 2008; Cosi, 2009; Cosi et al.,
                                                       2014; Cosi et al., 2015).
    Italiano. In questo lavoro si presentano i
    risultati ottenuti sinora in diversi esperi-       2    Data
    menti di riconoscimento fonetico utiliz-
    zanti esclusivamente la sola parte audio           We received the ArtiPhon (Canevari et al., 2015)
    del corpus ArtiPhon utilizzato per il task         training data by the Istituto Italiano di Tecnolo-
    ArtiPhon di EVALITA 2016.                          gia - Center for Translational Neurophysiology
                                                       of Speech and Communication (CTNSC) late in
                                                       July 2016, while the test material was released at
1    Introduction
                                                       the end of September 2016. The ArtiPhon dataset
In the last few years, the automatic speech            contains the audio and articulatory data recorded
recognition (ASR) technology has achieved re-          from three different speakers in citation condi-
markable results, mainly thanks to increased           tion. In particular for the EVALITA 2016 Ar-
training data and computational resources. How-        tiPhon - Articulatory Phone Recognition task
ever, ASR trained on thousand hours of annotat-        only one speaker (cnz - 666 utterances) was
ed speech can still perform poorly when training       considered.
and testing conditions are different (e.g., differ-       The audio was sampled at 22050 Hz while ar-
ent acoustic environments). This is usually re-        ticulatory data were extracted by the use of the
ferred to as the mismatch problem.                     NDI (Northen Digital Instruments, Canada) wave
   In the ArtiPhon task participants will have to      speech electromagnetic articulograph at 400 Hz
build a speaker-dependent phone recognition            sampling rate.
system that will be evaluated on mismatched               Four subdirectories are available:
speech rates. While training data consists of read        wav_1.0.0: each file contains an audio re-
speech where the speaker was required to keep a             cording
constant speech rate, testing data range from             lab_1.0.0: each file contains phonetic labels
slow and hyper-articulated speech to fast and               automatically computed using HTK
hypo-articulated speech.                                  ema_1.0.0: each file contains 21 channels:
  The training dataset contains simultaneous re-            coordinate in 3D space (xul yul zul xll yll
cordings of audio and vocal tract (i.e., articulato-        zll xui yui zui xli yli zli xtb ytb ztb xtm ytm
ry) movements recorded with an electromagnetic              ztm xtt ytt ztt)
articulograph (Canevari et al., 2015).
   Head movement correction was automatically           For both experiments, we used the KALDI
performed. First an adaptive median filter with a    ASR engine, and we started from the TIMIT rec-
window from 10 ms to 50 ms and secondly a            ipe, which was adapted to the ArtiPhon Italian
smooth elliptic low-pass filter with 20 Hz cutoff    data set.
frequency were applied to each channel.                 Deciding when a phone should be considered
   Unfortunately, we discovered that the audio       incorrectly recognized was another evaluation
data was completely saturated both in the train-     issue. In this work, as illustrated in Table 1, two
ing and the test set, thus forcing us to develop     set of phones, with 29 and 60 phones respective-
various experiments both using the full set of       ly, have been selected for the experiments, even
phonemes but also a smaller reduced set in order     if the second set is far from being realistic given
to make more effective and reliable the various      the degraded quality of the audio signal.
phone recognition experiments.

3   ASR
DNN has proven to be an effective alternative to
HMM - Gaussian mixture modelisation (GMM)
based ASR (HMM-GMM) (Bourlard and Mor-
gan, 1994; Hinton et al., 2012) obtaining good
performance with context dependent hybrid
DNN-HMM (Mohamed et al., 2012; Dahl et al.,
2012).
   Deep Neural Networks (DNNs) are indeed the
latest hot topic in speech recognition and new
systems such as KALDI (Povey et al., 2011)
demonstrated the effectiveness of easily incorpo-
rating “Deep Neural Network” (DNN) tech-
niques (Bengio, 2009) in order to improve the
recognition performance in almost all recogni-
tion tasks.
   DNNs has been already applied on different
adults and children Italian speech corpora, ob-
taining quite promising results (Cosi, 2015; Ser-
izel & Giuliani, 2014; Serizel & Giuliani, 2016).          Table 1: 60 and 29 phones set (SAMPA).
   In this work, the KALDI ASR engine adapted
to Italian was adopted as the target ASR system         Considering that, in unstressed position, the
to be evaluated on the ArtiPhon data set.            oppositions /e/ - /E/ and /o/ - /O/ are often neu-
   At the end we decided not to use the articula-    tralized in the Italian language, it was decided to
tory data available in the ArtiPhon data set, be-    merge these couples of phonemes. Since the oc-
cause we wanted to compare the final results of      currences of /E/ and /O/ phonemes were so rare
this task with those obtained in the past on dif-    in the test set, this simplification have had no
ferent audio-only corpora which were not charac-     influence in the test results.
terized by the above cited speech mismatch prob-        Then, the acoustic differences between
lem.                                                 stressed (a1, e1, E1, i1, o1, O1, u1) and un-
                                                     stressed vowels (a, e, E, i, o, O, u) in Italian are
4   The EVALITA 2016 - ArtiPhon Task                 subtle and mostly related to their duration. Fur-
A speaker dependent experiment characterized         thermore, most of the Italian people pronounce
by training and test speech type mismatch was        vowels according to their regional influences
prepared by using the ArtiPhon task training and     instead of “correct-standard” pronunciation, if
test material. A second speaker independent ex-      any, and this sort of inaccuracies is quite com-
periment was also set by testing the ArtiPhon test   mon. For these reasons, recognition outputs have
data using a previously trained ASR acoustic         been evaluated using the full 60-phones ArtiPhon
model on APASCI (Angelini et al., 1994), thus        set as well as a more realistic reduced 29-phones
having in this case both speech type and speaker     set, which do not count the mistakes between
mismatch.                                            stressed and unstressed vowels, geminates vs
single phones and /ng/ and /nf/ allphones vs the       60-phones and 29-phones are summarized in Ta-
/n/ phoneme.                                           ble 2a and 2b respectively, for all the KALDI
   In Table 2, the results of the EVALITA 2016         ASR engines, as in the TIMIT recipe.
ArtiPhon speaker dependent experiment with the


       Table 2a: results for the EVALITA 2016 ArtiPhon speaker dependent task in the 60-phones case.


       Table 2b: results for the EVALITA 2016 ArtiPhon speaker dependent task in the 29-phones case.
       Table 3: results for the EVALITA 2016 ArtiPhon speaker independent task in the 29-phones case.


   The results of the EVALITA 2016 ArtiPhon               in the system (typically several thousand). The
speaker independent experiment using the                  neural net is trained to predict the posterior
acoustic models trained on APASCI with the                probability of each context-dependent state.
29-phones are summarized in Table 3.                      During decoding the output probabilities are
   All the systems are built on top of MFCC,              divided by the prior probability of each state to
LDA, MLLT, fMLLR with CMN features 1 -                    form a “pseudo-likelihood” that is used in
see (Rath, et al., 2013) for all acronyms refer-          place of the state emission probabilities in the
ences - obtained from auxiliary GMM (Gaussi-              HMM(see Cosi et al. 2015, for a more detailed
an Mixture Model) models. At first, these 40-             description).
dimensional features are all stored to disk in               The Phone Error Rate (PER) was considered
order to simplify the training scripts.                   for computing the score of the recognition pro-
    Moreover MMI, BMMI, MPE and sMBR2                     cess. The PER, which is defined as the sum of
training are all supported - see (Rath et al.,            the deletion (DEL), substitution (SUB) and
2013) for all acronyms references.                        insertion (INS) percentage of phonemes in the
 KALDI currently contains also two parallel               ASR outcome text with respect to a reference
implementations for DNN (Deep Neural Net-                 transcription was computed by the use of the
works) training: “DNN Hybrid (Dan’s)” (Kal-               NIST software SCLITE (sctk-WEB).
di, WEB-b), (Zhang et al., 2014), (Povey et al.,             The results shown in Table 3 refer to the
2015) and “DNN Hybrid (Karel's)” (Kaldi,                  various training and decoding experiments -
WEB-a), (Vesely et al., 2013) in Table 3. Both            see (Rath et al., 2013) for all acronyms refer-
of them are DNNs where the last (output) layer            ences:
is a softmax layer whose output dimension                   MonoPhone (mono);
equals the number of context-dependent states               Deltas + Delta-Deltas (tri1);
1
                                                            LDA + MLLT (tri2);
    MFCC: Mel-Frequency Cepstral Coefficients;              LDA + MLLT + SAT (tri3);
LDA: Linear Discriminant Analysis; MLTT: Max-
                                                            SGMM2 (sgmm2_4);
imum Likelihood Linear Transform; fMLLR: fea-
ture space Maximum Likelihood Linear Regres-                MMI + SGMM2 (sgmm2_4_mmi_b0.1-
sion; CMN: Cepstral Mean Normalization.                      4);
2
  MMI: Maximum Mutual Information; BMMI:                    Dan’s Hybrid DNN (tri4-nnet),
Boosted MMI; MPE: Minimum Phone Error;
sMBR: State-level Minimum Bayes Risk
     system combination, that is Dan’s DNN +        KALDI architecture used in previous more
      SGMM (combine_2_1-4);                          complex speaker independent adult and chil-
     Karel’s Hybrid DNN (dnn4_pretrain-             dren speech ASR experiments was simply cho-
      dbn_dnn);                                      sen.
     system combination that is Karel’s DNN
      + sMBR (dnn4_pretrain-dbn_dnn_1-6).            References
                                                     Angelini, B., Brugnara, F., Falavigna, D., Giuliani,
                                                       D., Gretter, R., & Omologo, M., 1994. Speaker
   In the Table, SAT refers to the Speaker             Independent Continuous Speech Recognition
Adapted Training (SAT), i.e. train on fMLLR-           Using an Acoustic-Phonetic Italian Corpus. In
adapted features. It can be done on top of ei-         Proc. of ICSLP, Yokohama, Japan, Sept. 1994,
ther LDA+MLLT, or delta and delta-delta fea-           1391-1394.
tures.                                               Bengio, Y., 2009. Learning Deep Architectures for
   If there are no transforms supplied in the          AI, in Foundations and Trends in Machine
alignment directory, it will estimate transforms       Learning, Vol. 2, No. 1 (2009) 1-127.
itself before building the tree (and in any case,    Bourlard H:A. & Morgan N., 1994. Connectionist
it estimates transforms a number of times dur-         Speech Recognition: a Hybrid Approach, vol-
ing training). SGMM2 refers instead to Sub-            ume 247. Springer.
space Gaussian Mixture Models Training
(Povey, 2009; Povey, et al. 2011). This train-       Canevari, C., Badino, L., Fadiga, L., 2015. A new
                                                       Italian dataset of parallel acoustic and articulato-
ing would normally be called on top of fMLLR
                                                       ry data, Proceedings of INTERSPEECH, Dres-
features obtained from a conventional system,          den, Germany, 2015, 2152-2156.
but it also works on top of any type of speaker-
independent features (based on deltas+delta-         Cosi, P., & Hosom, J. P., 2000, High Performance
deltas or LDA+MLLT).                                   General Purpose Phonetic Recognition for Ital-
                                                       ian, in Proceedings of ICSLP 2000, Beijing, 527-
                                                       530, 2000.
5     Conclusions
                                                     Cosi, P. & Pellom, B., 2005. Italian Children’s
As expected, due to the degraded clipped               Speech Recognition For Advanced Interactive
quality of the training and test audio signal, the     Literacy Tutors, in Proceedings of INTER-
60-phones set is far from being realistic for          SPEECH 2005, Lisbon, Portugal, 2201-2204,
obtaining optimum recognition performance              2005.
even in the speaker dependent case (ArtiPhon         Cosi, P., 2008. Recent Advances in Sonic Italian
training and test material).                           Children’s Speech Recognition for Interactive
   On the contrary, if the reduced 29-phones           Literacy Tutors, in Proceedings of 1st Workshop
set is used, the phone recognition performance         On Child, Computer and Interaction (WOCCI-
is quite good and more than sufficient to build        2008), Chania, Greece, 2008.
an effective ASR system if a language model          Cosi, P., 2009. On the Development of Matched
could be incorporated.                                 and Mismatched Italian Children’s Speech
   Moreover, also in the speaker independent           Recognition Systems, in Proceedings of INTER-
case (APASCI training material and ArtiPhon            SPEECH 2009, Brighton, UK, 540-543, 2009.
test material) the performance are not too bad
                                                     Cosi, P., Nicolao, M., Paci, G., Sommavilla, G., &
even in these speech type and speaker mis-             Tesser, F., 2014. Comparing Open Source ASR
match conditions, thus confirming the effec-           Toolkits on Italian Children Speech, in Proceed-
tiveness and the good quality of the system            ings of Workshop On Child, Computer and In-
trained on APASCI material.                            teraction (WOCCI-2014), Satellite Event of IN-
   In these experiments, the DNNs results do           TERSPEECH 2014, Singapore, September 19,
not overcome those of the classic systems and          2014.
we can hypothesize that this is due partially to     Cosi, P., Paci G., Sommavilla G., & Tesser F.,
the low quality of the signal, and also to the         2015. KALDI: Yet another ASR Toolkit? Expe-
size of the corpus which is probably not suffi-        riments on Italian Children Speech. In Il farsi e il
cient to make the system learn all the variables       disfarsi del linguaggio. L’emergere, il mutamen-
characterizing the network. Moreover, the              to e la patologia della struttura sonora del lin-
DNN architecture was not specifically tuned to         guaggio, 2015.
the ArtiPhon data but instead the default
Dahl, G.E., Yu, D., Deng,L. & Acero, A., 2012.        Vesely, K., Ghoshal, A., Burget, L., & Povey, D.,
  Context-Dependent Pre-Trained Deep Neural             2013. Sequence-discriminative training of deep
  Networks for Large-Vocabulary Speech Recog-           neural networks, in Proceedings of INTER-
  nition. IEEE Transactions on Audio, Speech, and       SPEECH 2013, 2345-2349.
  Language Processing, Jan. 2012, 20(1):30–42.
                                                      Zhang, X, Trmal, J., Povey, D., & Khudanpur, S.,
Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mo-           2014. Improving Deep Neural Network Acoustic
  hamed, A., Jaitly, N., Senior, A., Vanhoucke, V.,     Models Using Generalized Maxout Networks, in
  Nguyen, P., Sainath, T.N. & Kingsbury, B.,            Proceedings of. ICASSP 2014, 215-219.
  2012. Deep Neural Networks for Acoustic Mod-
  eling in Speech Recognition. IEEE Signal Pro-
  cessing Magazine, Nov. 2012, 29(6):82-97.
Kaldi-WEBa - Karel’s DNN implementation:
  http://KALDI.sourceforge.net/dnn1.html
Kaldi-WEBb - Dan's DNN implementation:
  http://KALDI.sourceforge.net/dnn2.html.
Mohamed, A., Dahl, G.E. & Hinton, G., 2012.
  Acoustic Modeling Using Deep Belief Net-
  works. IEEE Transactions on Audio, Speech,
  and Language Processing, Jan. 2012, 20(1):14-
  22.
Povey D., (2009). Subspace Gaussian Mixture
  Models for Speech Recognition, Tech. Rep.
  MSR-TR-2009-64, Microsoft Research, 2009.
Povey, D., Burget, L., Agarwal, M., Akyazi, P.,
  Feng, K., Ghoshal, A:, Glembek, O., Goel, N.K.,
  Karafiát, M., Rastrow, A., Rose, R.C., Schwarz,
  P., Thomas, S., (2011). The Subspace Gaussian
  Mixture Mode - A Structured Model for Speech
  Recognition, Computer Speech & Language,
  vol. 25, no. 2, pp. 404–439, April 2011.
Povey, D., Ghoshal, A. et al., 2001. The KALDI
  Speech Recognition Toolkit, in Proceedings of
  ASRU, 2011 (IEEE Catalog No.: CFP11SRW-
  USB).
Povey, D., Zhang, X., & Khudanpur, S., 2014. Par-
  allel Training of DNNs with Natural Gradient
  and Parameter Averaging, in Proceedings of
  ICLR 2015, International Conference on Learn-
  ing Representations (arXiv:1410.7455).
Rath, S. P., Povey, D., Vesely, K., & Cernocky, J.,
  2013. Improved feature processing for Deep
  Neural Networks, in Proceedings of INTER-
  SPEECH 2013, 109-113.
sctk-WEB - Speech Recognition Scoring Toolkit
   https://www.nist.gov/itl/iad/mig/tools.
Serizel R., Giuliani D. (2014). Deep neural network
  adaptation for children's and adults' speech
  recognition. In Proceedings of ClicIt 2014, 1st
  Italian Conference on Computational Linguis-
  tics, Pisa, Italy, 2014.
Serizel R., Giuliani D. (2016). Deep-neural network
  approaches to speech recognition in heterogene-
  ous groups of speakers including children, in
  Natural Language Engineering, April 2016.