<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Phone Recognition Experiments on ArtiPhon with KALDI</article-title>
      </title-group>
      <abstract>
        <p>English. In this work we present the results obtained so far in different recognition experiments working on the audio only part of the ArtiPhon corpus used for the EVALITA 2016 speech-mismatch ArtiPhon task. Italiano. In questo lavoro si presentano i risultati ottenuti sinora in diversi esperimenti di riconoscimento fonetico utilizzanti esclusivamente la sola parte audio del corpus ArtiPhon utilizzato per il task ArtiPhon di EVALITA 2016.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In the last few years, the automatic speech
recognition (ASR) technology has achieved
remarkable results, mainly thanks to increased
training data and computational resources.
However, ASR trained on thousand hours of
annotated speech can still perform poorly when training
and testing conditions are different (e.g.,
different acoustic environments). This is usually
referred to as the mismatch problem.</p>
      <p>In the ArtiPhon task participants will have to
build a speaker-dependent phone recognition
system that will be evaluated on mismatched
speech rates. While training data consists of read
speech where the speaker was required to keep a
constant speech rate, testing data range from
slow and hyper-articulated speech to fast and
hypo-articulated speech.</p>
      <p>
        The training dataset contains simultaneous
recordings of audio and vocal tract (i.e.,
articulatory) movements recorded with an electromagnetic
articulograph
        <xref ref-type="bibr" rid="ref4">(Canevari et al., 2015)</xref>
        .
      </p>
      <p>
        Participants were encouraged to use the
training articulatory data to increase the
generalization performance of their recognition system.
However, we decided not to use them, mainly for
the sake of time, but also because we wanted to
compare the results with those obtained in the
past on different adult and children speech
audioonly corpora
        <xref ref-type="bibr" rid="ref10 ref5 ref6 ref7 ref8 ref9">(Cosi &amp; Hosom, 2000; Cosi &amp;
Pellom, 2005; Cosi, 2008; Cosi, 2009; Cosi et al.,
2014; Cosi et al., 2015)</xref>
        .
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Data</title>
      <p>
        We received the ArtiPhon
        <xref ref-type="bibr" rid="ref4">(Canevari et al., 2015)</xref>
        training data by the Istituto Italiano di
Tecnologia - Center for Translational Neurophysiology
of Speech and Communication (CTNSC) late in
July 2016, while the test material was released at
the end of September 2016. The ArtiPhon dataset
contains the audio and articulatory data recorded
from three different speakers in citation
condition. In particular for the EVALITA 2016
ArtiPhon - Articulatory Phone Recognition task
only one speaker (cnz - 666 utterances) was
considered.
      </p>
      <p>The audio was sampled at 22050 Hz while
articulatory data were extracted by the use of the
NDI (Northen Digital Instruments, Canada) wave
speech electromagnetic articulograph at 400 Hz
sampling rate.</p>
      <p>Four subdirectories are available:
 wav_1.0.0: each file contains an audio
recording
 lab_1.0.0: each file contains phonetic labels
automatically computed using HTK
 ema_1.0.0: each file contains 21 channels:
coordinate in 3D space (xul yul zul xll yll
zll xui yui zui xli yli zli xtb ytb ztb xtm ytm
ztm xtt ytt ztt)</p>
      <p>Head movement correction was automatically
performed. First an adaptive median filter with a
window from 10 ms to 50 ms and secondly a
smooth elliptic low-pass filter with 20 Hz cutoff
frequency were applied to each channel.</p>
      <p>Unfortunately, we discovered that the audio
data was completely saturated both in the
training and the test set, thus forcing us to develop
various experiments both using the full set of
phonemes but also a smaller reduced set in order
to make more effective and reliable the various
phone recognition experiments.
3</p>
      <p>
        ASR
DNN has proven to be an effective alternative to
HMM - Gaussian mixture modelisation (GMM)
based ASR (HMM-GMM)
        <xref ref-type="bibr" rid="ref12 ref3">(Bourlard and
Morgan, 1994; Hinton et al., 2012)</xref>
        obtaining good
performance with context dependent hybrid
DNN-HMM
        <xref ref-type="bibr" rid="ref11 ref15 ref15">(Mohamed et al., 2012; Dahl et al.,
2012)</xref>
        .
      </p>
      <p>
        Deep Neural Networks (DNNs) are indeed the
latest hot topic in speech recognition and new
systems such as KALDI
        <xref ref-type="bibr" rid="ref17 ref18">(Povey et al., 2011)</xref>
        demonstrated the effectiveness of easily
incorporating “Deep Neural Network” (DNN)
techniques
        <xref ref-type="bibr" rid="ref2">(Bengio, 2009)</xref>
        in order to improve the
recognition performance in almost all
recognition tasks.
      </p>
      <p>
        DNNs has been already applied on different
adults and children Italian speech corpora,
obtaining quite promising results
        <xref ref-type="bibr" rid="ref10 ref21 ref22">(Cosi, 2015;
Serizel &amp; Giuliani, 2014; Serizel &amp; Giuliani, 2016)</xref>
        .
      </p>
      <p>In this work, the KALDI ASR engine adapted
to Italian was adopted as the target ASR system
to be evaluated on the ArtiPhon data set.</p>
      <p>At the end we decided not to use the
articulatory data available in the ArtiPhon data set,
because we wanted to compare the final results of
this task with those obtained in the past on
different audio-only corpora which were not
characterized by the above cited speech mismatch
problem.
4</p>
      <p>
        The EVALITA 2016 - ArtiPhon Task
A speaker dependent experiment characterized
by training and test speech type mismatch was
prepared by using the ArtiPhon task training and
test material. A second speaker independent
experiment was also set by testing the ArtiPhon test
data using a previously trained ASR acoustic
model on APASCI
        <xref ref-type="bibr" rid="ref1">(Angelini et al., 1994)</xref>
        , thus
having in this case both speech type and speaker
mismatch.
      </p>
      <p>For both experiments, we used the KALDI
ASR engine, and we started from the TIMIT
recipe, which was adapted to the ArtiPhon Italian
data set.</p>
      <p>Deciding when a phone should be considered
incorrectly recognized was another evaluation
issue. In this work, as illustrated in Table 1, two
set of phones, with 29 and 60 phones
respectively, have been selected for the experiments, even
if the second set is far from being realistic given
the degraded quality of the audio signal.</p>
      <p>Considering that, in unstressed position, the
oppositions /e/ - /E/ and /o/ - /O/ are often
neutralized in the Italian language, it was decided to
merge these couples of phonemes. Since the
occurrences of /E/ and /O/ phonemes were so rare
in the test set, this simplification have had no
influence in the test results.</p>
      <p>Then, the acoustic differences between
stressed (a1, e1, E1, i1, o1, O1, u1) and
unstressed vowels (a, e, E, i, o, O, u) in Italian are
subtle and mostly related to their duration.
Furthermore, most of the Italian people pronounce
vowels according to their regional influences
instead of “correct-standard” pronunciation, if
any, and this sort of inaccuracies is quite
common. For these reasons, recognition outputs have
been evaluated using the full 60-phones ArtiPhon
set as well as a more realistic reduced 29-phones
set, which do not count the mistakes between
stressed and unstressed vowels, geminates vs
single phones and /ng/ and /nf/ allphones vs the
/n/ phoneme.</p>
      <p>In Table 2, the results of the EVALITA 2016
ArtiPhon speaker dependent experiment with the
60-phones and 29-phones are summarized in
Table 2a and 2b respectively, for all the KALDI
ASR engines, as in the TIMIT recipe.</p>
      <p>The results of the EVALITA 2016 ArtiPhon
speaker independent experiment using the
acoustic models trained on APASCI with the
29-phones are summarized in Table 3.</p>
      <p>
        All the systems are built on top of MFCC,
LDA, MLLT, fMLLR with CMN features1
see
        <xref ref-type="bibr" rid="ref20">(Rath, et al., 2013)</xref>
        for all acronyms
references - obtained from auxiliary GMM
(Gaussian Mixture Model) models. At first, these
40dimensional features are all stored to disk in
order to simplify the training scripts.
      </p>
      <p>
        Moreover MMI, BMMI, MPE and sMBR2
training are all supported - see
        <xref ref-type="bibr" rid="ref20">(Rath et al.,
2013)</xref>
        for all acronyms references.
      </p>
      <p>
        KALDI currently contains also two parallel
implementations for DNN (Deep Neural
Networks) training: “DNN Hybrid (Dan’s)”
(Kaldi, WEB-b),
        <xref ref-type="bibr" rid="ref19 ref24">(Zhang et al., 2014)</xref>
        ,
        <xref ref-type="bibr" rid="ref19">(Povey et al.,
2015)</xref>
        and “DNN Hybrid (Karel's)” (Kaldi,
WEB-a),
        <xref ref-type="bibr" rid="ref23">(Vesely et al., 2013)</xref>
        in Table 3. Both
of them are DNNs where the last (output) layer
is a softmax layer whose output dimension
equals the number of context-dependent states
1 MFCC: Mel-Frequency Cepstral Coefficients;
LDA: Linear Discriminant Analysis; MLTT:
Maximum Likelihood Linear Transform; fMLLR:
feature space Maximum Likelihood Linear
Regression; CMN: Cepstral Mean Normalization.
2 MMI: Maximum Mutual Information; BMMI:
Boosted MMI; MPE: Minimum Phone Error;
sMBR: State-level Minimum Bayes Risk
in the system (typically several thousand). The
neural net is trained to predict the posterior
probability of each context-dependent state.
During decoding the output probabilities are
divided by the prior probability of each state to
form a “pseudo-likelihood” that is used in
place of the state emission probabilities in the
HMM
        <xref ref-type="bibr" rid="ref10">(see Cosi et al. 2015, for a more detailed
description)</xref>
        .
      </p>
      <p>The Phone Error Rate (PER) was considered
for computing the score of the recognition
process. The PER, which is defined as the sum of
the deletion (DEL), substitution (SUB) and
insertion (INS) percentage of phonemes in the
ASR outcome text with respect to a reference
transcription was computed by the use of the
NIST software SCLITE (sctk-WEB).</p>
      <p>
        The results shown in Table 3 refer to the
various training and decoding experiments
see
        <xref ref-type="bibr" rid="ref20">(Rath et al., 2013)</xref>
        for all acronyms
references:







      </p>
    </sec>
    <sec id="sec-3">
      <title>MonoPhone (mono);</title>
      <p>Deltas + Delta-Deltas (tri1);
LDA + MLLT (tri2);
LDA + MLLT + SAT (tri3);
SGMM2 (sgmm2_4);
MMI + SGMM2
(sgmm2_4_mmi_b0.14);
Dan’s Hybrid DNN (tri4-nnet),
 system combination, that is Dan’s DNN +</p>
      <p>SGMM (combine_2_1-4);
 Karel’s Hybrid DNN
(dnn4_pretraindbn_dnn);
 system combination that is Karel’s DNN
+ sMBR (dnn4_pretrain-dbn_dnn_1-6).</p>
      <p>In the Table, SAT refers to the Speaker
Adapted Training (SAT), i.e. train on
fMLLRadapted features. It can be done on top of
either LDA+MLLT, or delta and delta-delta
features.</p>
      <p>
        If there are no transforms supplied in the
alignment directory, it will estimate transforms
itself before building the tree (and in any case,
it estimates transforms a number of times
during training). SGMM2 refers instead to
Subspace Gaussian Mixture Models Training
        <xref ref-type="bibr" rid="ref16 ref17 ref18">(Povey, 2009; Povey, et al. 2011)</xref>
        . This
training would normally be called on top of fMLLR
features obtained from a conventional system,
but it also works on top of any type of
speakerindependent features (based on
deltas+deltadeltas or LDA+MLLT).
5
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>As expected, due to the degraded clipped
quality of the training and test audio signal, the
60-phones set is far from being realistic for
obtaining optimum recognition performance
even in the speaker dependent case (ArtiPhon
training and test material).</p>
      <p>On the contrary, if the reduced 29-phones
set is used, the phone recognition performance
is quite good and more than sufficient to build
an effective ASR system if a language model
could be incorporated.</p>
      <p>Moreover, also in the speaker independent
case (APASCI training material and ArtiPhon
test material) the performance are not too bad
even in these speech type and speaker
mismatch conditions, thus confirming the
effectiveness and the good quality of the system
trained on APASCI material.</p>
      <p>In these experiments, the DNNs results do
not overcome those of the classic systems and
we can hypothesize that this is due partially to
the low quality of the signal, and also to the
size of the corpus which is probably not
sufficient to make the system learn all the variables
characterizing the network. Moreover, the
DNN architecture was not specifically tuned to
the ArtiPhon data but instead the default
KALDI architecture used in previous more
complex speaker independent adult and
children speech ASR experiments was simply
chosen.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Angelini</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brugnara</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Falavigna</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giuliani</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gretter</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Omologo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <year>1994</year>
          .
          <article-title>Speaker Independent Continuous Speech Recognition Using an Acoustic-Phonetic Italian Corpus</article-title>
          .
          <source>In Proc. of ICSLP</source>
          , Yokohama, Japan, Sept.
          <year>1994</year>
          ,
          <fpage>1391</fpage>
          -
          <lpage>1394</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <year>2009</year>
          .
          <article-title>Learning Deep Architectures for AI, in Foundations and Trends in Machine Learning</article-title>
          , Vol.
          <volume>2</volume>
          , No.
          <volume>1</volume>
          (
          <issue>2009</issue>
          )
          <fpage>1</fpage>
          -
          <lpage>127</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Bourlard</surname>
            <given-names>H:A.</given-names>
          </string-name>
          &amp; Morgan N.,
          <year>1994</year>
          .
          <article-title>Connectionist Speech Recognition: a Hybrid Approach</article-title>
          , volume
          <volume>247</volume>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Canevari</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Badino</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fadiga</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <year>2015</year>
          .
          <article-title>A new Italian dataset of parallel acoustic and articulatory data</article-title>
          ,
          <source>Proceedings of INTERSPEECH</source>
          , Dresden, Germany,
          <year>2015</year>
          ,
          <fpage>2152</fpage>
          -
          <lpage>2156</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Cosi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Hosom</surname>
            ,
            <given-names>J. P.</given-names>
          </string-name>
          ,
          <year>2000</year>
          ,
          <string-name>
            <given-names>High</given-names>
            <surname>Performance</surname>
          </string-name>
          <article-title>General Purpose Phonetic Recognition for Italian</article-title>
          ,
          <source>in Proceedings of ICSLP</source>
          <year>2000</year>
          , Beijing,
          <fpage>527</fpage>
          -
          <lpage>530</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Cosi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Pellom</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <year>2005</year>
          .
          <article-title>Italian Children's Speech Recognition For Advanced Interactive Literacy Tutors</article-title>
          ,
          <source>in Proceedings of INTERSPEECH</source>
          <year>2005</year>
          , Lisbon, Portugal,
          <fpage>2201</fpage>
          -
          <lpage>2204</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Cosi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <year>2008</year>
          .
          <article-title>Recent Advances in Sonic Italian Children's Speech Recognition for Interactive Literacy Tutors</article-title>
          ,
          <source>in Proceedings of 1st Workshop On Child, Computer and Interaction (WOCCI2008)</source>
          , Chania, Greece,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Cosi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <year>2009</year>
          .
          <article-title>On the Development of Matched and Mismatched Italian Children's Speech Recognition Systems</article-title>
          ,
          <source>in Proceedings of INTERSPEECH</source>
          <year>2009</year>
          ,
          <article-title>Brighton</article-title>
          , UK,
          <fpage>540</fpage>
          -
          <lpage>543</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Cosi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nicolao</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paci</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sommavilla</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Tesser</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <year>2014</year>
          .
          <article-title>Comparing Open Source ASR Toolkits on Italian Children Speech</article-title>
          , in Proceedings of Workshop On Child, Computer and Interaction (WOCCI-
          <year>2014</year>
          ),
          <source>Satellite Event of INTERSPEECH</source>
          <year>2014</year>
          , Singapore,
          <year>September 19</year>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Cosi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paci</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sommavilla</surname>
            <given-names>G.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Tesser</surname>
            <given-names>F.</given-names>
          </string-name>
          ,
          <year>2015</year>
          . KALDI:
          <article-title>Yet another ASR Toolkit? Experiments on Italian Children Speech. In Il farsi e il disfarsi del linguaggio. L'emergere, il mutamento e la patologia della struttura sonora del linguaggio</article-title>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Dahl</surname>
            ,
            <given-names>G.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Acero</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <year>2012</year>
          .
          <article-title>Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition</article-title>
          .
          <source>IEEE Transactions on Audio, Speech, and Language Processing</source>
          , Jan.
          <year>2012</year>
          ,
          <volume>20</volume>
          (
          <issue>1</issue>
          ):
          <fpage>30</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dahl</surname>
            ,
            <given-names>G.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohamed</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jaitly</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Senior</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanhoucke</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sainath</surname>
            ,
            <given-names>T.N.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Kingsbury</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <year>2012</year>
          .
          <article-title>Deep Neural Networks for Acoustic Modeling in Speech Recognition</article-title>
          .
          <source>IEEE Signal Processing Magazine</source>
          , Nov.
          <year>2012</year>
          ,
          <volume>29</volume>
          (
          <issue>6</issue>
          ):
          <fpage>82</fpage>
          -
          <lpage>97</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Kaldi-WEBa - Karel's DNN</surname>
          </string-name>
          implementation: http://KALDI.sourceforge.net/dnn1.html
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Kaldi-WEBb - Dan's DNN</surname>
          </string-name>
          implementation: http://KALDI.sourceforge.net/dnn2.html.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Mohamed</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dahl</surname>
            ,
            <given-names>G.E.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <year>2012</year>
          .
          <article-title>Acoustic Modeling Using Deep Belief Networks</article-title>
          .
          <source>IEEE Transactions on Audio, Speech, and Language Processing</source>
          , Jan.
          <year>2012</year>
          ,
          <volume>20</volume>
          (
          <issue>1</issue>
          ):
          <fpage>14</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Povey D.</surname>
          </string-name>
          , (
          <year>2009</year>
          ).
          <article-title>Subspace Gaussian Mixture Models for Speech Recognition</article-title>
          ,
          <source>Tech. Rep. MSR-TR-2009-64</source>
          , Microsoft Research,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Povey</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burget</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Akyazi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghoshal</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          :,
          <string-name>
            <surname>Glembek</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goel</surname>
            ,
            <given-names>N.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karafiát</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rastrow</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rose</surname>
            ,
            <given-names>R.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schwarz</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thomas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , (
          <year>2011</year>
          ).
          <article-title>The Subspace Gaussian Mixture Mode - A Structured Model for Speech Recognition</article-title>
          ,
          <source>Computer Speech &amp; Language</source>
          , vol.
          <volume>25</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>404</fpage>
          -
          <lpage>439</lpage>
          ,
          <year>April 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Povey</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghoshal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          et al.,
          <year>2001</year>
          .
          <article-title>The KALDI Speech Recognition Toolkit</article-title>
          ,
          <source>in Proceedings of ASRU</source>
          ,
          <year>2011</year>
          (IEEE Catalog No.: CFP11SRWUSB).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Povey</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Khudanpur</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <year>2014</year>
          .
          <article-title>Parallel Training of DNNs with Natural Gradient and Parameter Averaging</article-title>
          ,
          <source>in Proceedings of ICLR 2015, International Conference on Learning Representations (arXiv:1410.7455).</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Rath</surname>
            ,
            <given-names>S. P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Povey</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vesely</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Cernocky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <year>2013</year>
          .
          <article-title>Improved feature processing for Deep Neural Networks</article-title>
          ,
          <source>in Proceedings of INTERSPEECH</source>
          <year>2013</year>
          ,
          <volume>109</volume>
          -
          <fpage>113</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Serizel R.</given-names>
            ,
            <surname>Giuliani</surname>
          </string-name>
          <string-name>
            <surname>D.</surname>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Deep neural network adaptation for children's and adults' speech recognition</article-title>
          .
          <source>In Proceedings of ClicIt</source>
          <year>2014</year>
          , 1st Italian Conference on Computational Linguistics, Pisa, Italy,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Serizel R.</given-names>
            ,
            <surname>Giuliani</surname>
          </string-name>
          <string-name>
            <surname>D.</surname>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Deep-neural network approaches to speech recognition in heterogeneous groups of speakers including children</article-title>
          ,
          <source>in Natural Language Engineering</source>
          ,
          <year>April 2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Vesely</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghoshal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burget</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Povey</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <year>2013</year>
          .
          <article-title>Sequence-discriminative training of deep neural networks</article-title>
          ,
          <source>in Proceedings of INTERSPEECH</source>
          <year>2013</year>
          ,
          <volume>2345</volume>
          -
          <fpage>2349</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trmal</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Povey</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Khudanpur</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <year>2014</year>
          .
          <article-title>Improving Deep Neural Network Acoustic Models Using Generalized Maxout Networks</article-title>
          ,
          <source>in Proceedings of. ICASSP</source>
          <year>2014</year>
          ,
          <volume>215</volume>
          -
          <fpage>219</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>