ININ submission to Zero Cost ASR task at MediaEval 2016
  Tejas Godambe, Naresh Kumar, Pavan Kumar, Veera Raghavendra, Aravind Ganapathiraju
                                            Interactive Intelligence India Private Limited
                                                          Hyderabad, India
                     {tejas.godambe, naresh.kumar, pavan.kumar, veera.raghavendra,
                                    aravind.ganapathiraju}@inin.com

ABSTRACT                                                             3. RESULTS AND DISCUSSION
This paper details the experiments conducted to train an as good
performing Vietnamese speech recognition system as possible
using public domain data only, as a part of the Zero Cost task at
                                                                     3.1 Preliminary Analysis
MediEval 2016. We explored techniques related to audio pre-               The sequence of experiments performed, and the gains/loss
processing, use of speaker’s pitch information, data perturbation,   incurred in WER with each of them are detailed below. Table 1
for building subspace Gaussian mixture acoustic model which is       shows the WER and the word error rate reduction (WERR)
known for estimating robust parameters when the amount of data       achieved for each individual experiment. The WER was
is less, and also unsupervised adaptation, RNN language model        calculated on a very small dev local data set which comprised
based lattice rescoring and system combination using ROVER tec       of 21 utterances only.
hnique.
                                                                     1.   Using tri-phone model: We first trained the tri-phone model
                                                                          with 2000 senones and total 20k Gaussians to see whether we
1. INTRODUCTION                                                           are able to replicate the baseline result. This gave a WER of
                                                                          37.0%
     The goal of the zero cost ASR task is to bring researchers
                                                                     2.   Truncating silence in training data: Preliminary
together on the topic of training ASR systems using only data
                                                                          observation of a few wave files showed presence of long
available in the public domain. In particular, this year’s task
                                                                          silences, which usually corrupts the acoustic model. A
consisted of the development of an LVCSR for Vietnamese
                                                                          WERR of 9.6% was achieved when the tri-phone model was
language which is a rare enough language but with sufficient
                                                                          trained after truncating long silences to 0.3 sec in the training
enough public data to work with. More details on this task can
                                                                          data. Henceforth, for all experiments, we used the training
be found in [1].
                                                                          data with truncated silences. This also reduced the size of the
     Section 2 outlines the steps followed for building the final         training data from around 13 hours to around 7 hours.
system. Section 3 describes in detail each experiment we
                                                                     3.   Truncating silence in test data: Inspired by the above gain,
conducted, and also discusses the loss/gain achieved in accuracy
                                                                          we truncated long silences to 0.3 sec in the test data too,
with it. We conclude the paper in Section 4.
                                                                          before decoding. But, surprisingly, this increased the WER to
                                                                          50.3%. Hence, in the future experiments, truncating silences
2. APPROACH                                                               in the test data was avoided.
                                                                     4.   Using SGMM model: SGMM model is known to estimate
      We used the Kaldi ASR toolkit [2] for building the system.
                                                                          robust parameters and perform better than a simple tri-phone
As lexicon was not provided, graphemes were used as phonemes.
                                                                          model, especially when the size of training data is small. A
There were 96 unique phonemes. The below steps were followed
                                                                          WERR of 9.4% was achieved upon migrating from tri-phone
for the development of the final system.
                                                                          model to SGMM.
 1.   Truncate long silences in training data to 0.3 sec.            5.   Using DNN model: DNNs are the state-of-the-art. But, it has
 2.   Augment data with speed perturbed versions (of speed                been observed that they yield poorer or comparable results to
      factors 0.9 and 1.1) of itself [3].                                 SGMM when the size of training data is of small. We trained
 3.   Extract MFCCs along with pitch information [4].                     a basic DNN containing 429 nodes in the input layer (5
 4.   Build SGMM acoustic model [5].                                      context frames), three hidden layers 512:256:512 with 256
 5.   Construct a 5-gram language model (LM) from the training            being the bottleneck layer, and containing 930 output nodes,
      text.                                                               optimized using stochastic gradient descent to minimize the
 6.   Perform unsupervised adaptation, i.e. decode test                   cross-entropy. But, this increased the WER to 23.5%.
      utterances with above system, and add them to the training          Though DNNs could have been made to perform better than
      data along with their approximate hypothesized                      SGMMs using proper regularization, because of time
      transcriptions. Three copies of test data (of speed factors         constraints, we stuck to the SGMM acoustic model.
      0.9, 1.0, 1.1) were added.
 7.   Generate lattices and rescore them with RNN based              6.   Using position independent phones: This experiment was
      language model [6].                                                 to see how the use of position independent phones fares
 8.   Do final decoding                                                   against using position-dependent phones. Not so surprisingly,
                                                                          this     step    degraded      the    WER        by      1%.
                                                                          So, position-dependent phones were used for further
Copyright is held by the author/owner(s).                                 experiments.
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands.
7.   Unsupervised adaptation: In unsupervised adaptation, we             Table 1: Sequence of experiments performed with individual
     folded in the test data comprising of 332 utterances with their     WER and WERR
     approximate hypotheses (obtained by decoding with SGMM
                                                                         Row     Experiment                   WER (%)      WERR (%)
     in the previous run) into the training data, and re-trained the     no.
     SGMM acoustic model. This gave 2.0% WERR.
                                                                         1       Training the tri-phone       37.0
8.   Audio augmentation 1: Inspired by [3], speed of the
                                                                                 model
     original training data was perturbed by factors of 0.9 and 1.1,
     and these perturbed copies were augmented to the original           2       Truncating silence in        27.4         37.0-27.4=9.6
     training data. This helped achieve 1.1% WERR.                               training data
9.   Audio augmentation 2: Here, four perturbed copies of                3       Truncating    silence   in   50.3         27.4-50.3=-22.9
     speed factors 0.8, 0.9, 1.1 and 1.2 were augmented to the                   test data
     original training data. This gave 0.8% WERR, which is less          4       Using SGMM model             18.1         27.4-18.1=9.3
     than 1.1% achieved in the previous experiment. Hence, for
     the final system, we augmented original data with perturbed         5       Using DNN model              23.5         18.1-23.5=-5.4
     copies of speed factors 0.9 and 1.1 only.                           6       Using position               19.1         18.1-19.1=-1.0
10. Using pitch information: The confusing words in the                          independent phones
    hypothesis seemed to be acoustically close as many                   7       Unsupervised adaptation      16.1         18.1-16.1=2.0
    confusing pairs differed by just one phone. For some words,
    it appeared that the confusions are occurring because of             8       Audio Augmentation 1         17.0         18.1-17.0=1.1
    different tonal manifestation of the same phone. This gave           9       Audio Augmentation 2         17.3         18.1-17.3=0.8
    the idea of using pitch information along with traditional
    MFCCs as explained in [4]. This gave 1.2% WERR, and                  10      Using pitch information      16.9         18.1-16.9=1.2
    helped to eliminate a few recurring confusions.                      11      Using 5 gram LM              16.1         18.1-16.1=2.0
11. Using 5 gram LM: Next, higher order N-grams were tried in            12      Using 7 gram LM              16.6         18.1-16.6=1.5
    order to put more constraints on the hypothesis and
                                                                         13      Combined system              13.8
    consequently improve the WER. Use of 5-gram LM instead
    of trigram LM helped achieve 2.0% WERR.                              14      Rescoring lattices using     13.5         13.8-13.5.3
12. Using 7 gram LM: Inspired by the above gain, even higher                     RNN LM
    order N-gram such as 7 grams were experimented. This gave            15      ROVER                        13.5         13.5-13.5=0.0
    1.5% WERR which is less than 2.0% achieved with 5 grams.
    Hence, in the final system, 5-gram LM was used.
13. Combined system: For the final system we combined all the            4. CONCLUSION
    things that improvement such as truncating silence in the
                                                                               In this task, we confronted a real-world problem of building
    training data, using SGMM, unsupervised adaptation, data
                                                                         ASR system from public domain data containing noises and
    augmentation with speed factors 0.9, and 1.1, using pitch and
                                                                         having imperfect transcripts. The data was inherently small in
    using 5 gram LM. This combined system gave WER=13.8%.
                                                                         size. So, the problem of noisy acoustics and imperfect transcripts
14. Rescoring lattices using RNN-LM: Motivation behind                   was multiplied with that of low-resource one. In our system, we
    using RNN LM [6] was to see how much gain we can                     tried to look at different aspects of ASR system building like
    achieve by putting more constraints (apart from the 5-gram           audio pre-processing, data perturbation, using pitch information,
    LM) from the LM side using a model which captures long-              acoustic modeling, language modeling using higher order N-
    term dependencies in text in a distinct manner than that done        grams, unsupervised adaptation, lattice-rescoring, and system
    by N-grams. The lattices were re-scored using RNN LM, but            combination. Each of the above techniques contributed their share
    it gave only 0.3% improvement. Probably limited amount of            toward bringing down the WER of the final system.
    training text prevented getting full advantage of RNN-LM.
15. Hypothesis combination; ROVER [7] is a well-known
    technique to combine hypotheses from multiple different
                                                                         5. ACKNOWLEDGEMENTS
    systems. Individual systems which had given improvements                  We thank the event and task organizers for their prompt
    were combined with the above discussed combined system,              responses to our queries related to the task.
    but this did not yield better results than the combined system.


3.2 Final Results
     In total, the test data comprised of 332 utterances, which
contained utterances from ELSA, forvo.com, rhinospike.com and
youtube.com. The percent WER achieved by our system on the
above individual test data sets in the respective order are 5.7, 72.5,
25.3 and 91.4. The average WER is 51.2. While our system did
well on data from ELSA and rhinospike.com, it did relatively poor
on data from forvo.com and youtube.com.
                                                                         Conference on Acoustics, Speech and Signal Processing
6. REFERENCES                                                            (ICASSP). IEEE, 2014.
                                                                     [5] Povey, Daniel, et al. "Subspace Gaussian mixture models for
[1] Szoke, I., Anguera, X., 2016, Zero cost speech recognition           speech recognition." 2010 IEEE International Conference on
    task at MediaEval 2016, In Working Notes, Proceedings of             Acoustics, Speech and Signal Processing. IEEE, 2010.
    the MediaEval 2016 Workshop, Hilversum, Netherlands, 20-
                                                                     [6] Mikolov, Tomas, et al. "Rnnlm-recurrent neural network
    21 Oct 2016
                                                                         language modeling toolkit." Proc. of the 2011 ASRU
[2]   Povey, et al. 2011. The Kaldi speech recognition toolkit. In       Workshop. 2011.
      Proceedings of ASRU, 2011.
                                                                     [7] Fiscus, Jonathan G. "A post-processing system to yield
[3] Ko, Tom, et al. Audio augmentation for speech recognition            reduced word error rates: Recognizer output voting error
    Proceedings of INTERSPEECH. 2015.                                    reduction (ROVER)." Automatic Speech Recognition and
[4] Ghahremani, Pegah, et al. A pitch extraction algorithm tuned         Understanding, 1997. Proceedings. 1997 IEEE Workshop
    for automatic speech recognition." 2014 IEEE International           on. IEEE, 1997