<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ININ submission to Zero Cost ASR task at MediaEval 2016</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tejas Godambe</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Naresh Kumar</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pavan Kumar</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Veera Raghavendra</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aravind Ganapathiraju</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>India</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>This paper details the experiments conducted to train an as good performing Vietnamese speech recognition system as possible using public domain data only, as a part of the Zero Cost task at MediEval 2016. We explored techniques related to audio preprocessing, use of speaker's pitch information, data perturbation, for building subspace Gaussian mixture acoustic model which is known for estimating robust parameters when the amount of data is less, and also unsupervised adaptation, RNN language model based lattice rescoring and system combination using ROVER tec hnique.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The goal of the zero cost ASR task is to bring researchers
together on the topic of training ASR systems using only data
available in the public domain. In particular, this year’s task
consisted of the development of an LVCSR for Vietnamese
language which is a rare enough language but with sufficient
enough public data to work with. More details on this task can
be found in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Section 2 outlines the steps followed for building the final
system. Section 3 describes in detail each experiment we
conducted, and also discusses the loss/gain achieved in accuracy
with it. We conclude the paper in Section 4.</p>
    </sec>
    <sec id="sec-2">
      <title>2. APPROACH</title>
      <p>
        We used the Kaldi ASR toolkit [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for building the system.
As lexicon was not provided, graphemes were used as phonemes.
There were 96 unique phonemes. The below steps were followed
for the development of the final system.
      </p>
      <p>1.
2.
3.
4.
5.</p>
      <p>Truncate long silences in training data to 0.3 sec.</p>
      <p>
        Augment data with speed perturbed versions (of speed
factors 0.9 and 1.1) of itself [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Extract MFCCs along with pitch information [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Build SGMM acoustic model [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>Construct a 5-gram language model (LM) from the training
text.</p>
      <p>Perform unsupervised adaptation, i.e. decode test
utterances with above system, and add them to the training
data along with their approximate hypothesized
transcriptions. Three copies of test data (of speed factors
0.9, 1.0, 1.1) were added.</p>
      <p>
        Generate lattices and rescore them with RNN based
language model [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>Do final decoding
1.
2.
3.
4.
5.
6.</p>
    </sec>
    <sec id="sec-3">
      <title>3. RESULTS AND DISCUSSION</title>
    </sec>
    <sec id="sec-4">
      <title>3.1 Preliminary Analysis</title>
      <p>The sequence of experiments performed, and the gains/loss
incurred in WER with each of them are detailed below. Table 1
shows the WER and the word error rate reduction (WERR)
achieved for each individual experiment. The WER was
calculated on a very small dev local data set which comprised
of 21 utterances only.</p>
      <p>Using tri-phone model: We first trained the tri-phone model
with 2000 senones and total 20k Gaussians to see whether we
are able to replicate the baseline result. This gave a WER of
37.0%</p>
      <sec id="sec-4-1">
        <title>Truncating silence in training data: Preliminary</title>
        <p>observation of a few wave files showed presence of long
silences, which usually corrupts the acoustic model. A
WERR of 9.6% was achieved when the tri-phone model was
trained after truncating long silences to 0.3 sec in the training
data. Henceforth, for all experiments, we used the training
data with truncated silences. This also reduced the size of the
training data from around 13 hours to around 7 hours.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Truncating silence in test data: Inspired by the above gain,</title>
        <p>we truncated long silences to 0.3 sec in the test data too,
before decoding. But, surprisingly, this increased the WER to
50.3%. Hence, in the future experiments, truncating silences
in the test data was avoided.</p>
        <p>Using SGMM model: SGMM model is known to estimate
robust parameters and perform better than a simple tri-phone
model, especially when the size of training data is small. A
WERR of 9.4% was achieved upon migrating from tri-phone
model to SGMM.</p>
        <p>Using DNN model: DNNs are the state-of-the-art. But, it has
been observed that they yield poorer or comparable results to
SGMM when the size of training data is of small. We trained
a basic DNN containing 429 nodes in the input layer (5
context frames), three hidden layers 512:256:512 with 256
being the bottleneck layer, and containing 930 output nodes,
optimized using stochastic gradient descent to minimize the
cross-entropy. But, this increased the WER to 23.5%.
Though DNNs could have been made to perform better than
SGMMs using proper regularization, because of time
constraints, we stuck to the SGMM acoustic model.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Using position independent phones: This experiment was</title>
        <p>to see how the use of position independent phones fares
against using position-dependent phones. Not so surprisingly,
this step degraded the WER by 1%.
So, position-dependent phones were used for further
experiments.</p>
        <p>Unsupervised adaptation: In unsupervised adaptation, we
folded in the test data comprising of 332 utterances with their
approximate hypotheses (obtained by decoding with SGMM
in the previous run) into the training data, and re-trained the
SGMM acoustic model. This gave 2.0% WERR.</p>
        <p>
          Audio augmentation 1: Inspired by [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], speed of the
original training data was perturbed by factors of 0.9 and 1.1,
and these perturbed copies were augmented to the original
training data. This helped achieve 1.1% WERR.
        </p>
        <p>
          Audio augmentation 2: Here, four perturbed copies of
speed factors 0.8, 0.9, 1.1 and 1.2 were augmented to the
original training data. This gave 0.8% WERR, which is less
than 1.1% achieved in the previous experiment. Hence, for
the final system, we augmented original data with perturbed
copies of speed factors 0.9 and 1.1 only.
10. Using pitch information: The confusing words in the
hypothesis seemed to be acoustically close as many
confusing pairs differed by just one phone. For some words,
it appeared that the confusions are occurring because of
different tonal manifestation of the same phone. This gave
the idea of using pitch information along with traditional
MFCCs as explained in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. This gave 1.2% WERR, and
helped to eliminate a few recurring confusions.
11. Using 5 gram LM: Next, higher order N-grams were tried in
order to put more constraints on the hypothesis and
consequently improve the WER. Use of 5-gram LM instead
of trigram LM helped achieve 2.0% WERR.
12. Using 7 gram LM: Inspired by the above gain, even higher
order N-gram such as 7 grams were experimented. This gave
1.5% WERR which is less than 2.0% achieved with 5 grams.
        </p>
        <p>
          Hence, in the final system, 5-gram LM was used.
13. Combined system: For the final system we combined all the
things that improvement such as truncating silence in the
training data, using SGMM, unsupervised adaptation, data
augmentation with speed factors 0.9, and 1.1, using pitch and
using 5 gram LM. This combined system gave WER=13.8%.
14. Rescoring lattices using RNN-LM: Motivation behind
using RNN LM [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] was to see how much gain we can
achieve by putting more constraints (apart from the 5-gram
LM) from the LM side using a model which captures
longterm dependencies in text in a distinct manner than that done
by N-grams. The lattices were re-scored using RNN LM, but
it gave only 0.3% improvement. Probably limited amount of
training text prevented getting full advantage of RNN-LM.
15. Hypothesis combination; ROVER [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] is a well-known
technique to combine hypotheses from multiple different
systems. Individual systems which had given improvements
were combined with the above discussed combined system,
but this did not yield better results than the combined system.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>3.2 Final Results</title>
      <p>In total, the test data comprised of 332 utterances, which
contained utterances from ELSA, forvo.com, rhinospike.com and
youtube.com. The percent WER achieved by our system on the
above individual test data sets in the respective order are 5.7, 72.5,
25.3 and 91.4. The average WER is 51.2. While our system did
well on data from ELSA and rhinospike.com, it did relatively poor
on data from forvo.com and youtube.com.</p>
      <p>Row
no.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15</p>
    </sec>
    <sec id="sec-6">
      <title>4. CONCLUSION</title>
      <p>In this task, we confronted a real-world problem of building
ASR system from public domain data containing noises and
having imperfect transcripts. The data was inherently small in
size. So, the problem of noisy acoustics and imperfect transcripts
was multiplied with that of low-resource one. In our system, we
tried to look at different aspects of ASR system building like
audio pre-processing, data perturbation, using pitch information,
acoustic modeling, language modeling using higher order
Ngrams, unsupervised adaptation, lattice-rescoring, and system
combination. Each of the above techniques contributed their share
toward bringing down the WER of the final system.</p>
    </sec>
    <sec id="sec-7">
      <title>5. ACKNOWLEDGEMENTS</title>
      <p>We thank the event and task organizers for their prompt
responses to our queries related to the task.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Szoke</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anguera</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <year>2016</year>
          ,
          <article-title>Zero cost speech recognition task at MediaEval 2016</article-title>
          , In Working Notes,
          <source>Proceedings of the MediaEval 2016 Workshop</source>
          , Hilversum, Netherlands,
          <fpage>20</fpage>
          -
          <lpage>21</lpage>
          Oct 2016
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Povey</surname>
          </string-name>
          , et al.
          <year>2011</year>
          .
          <article-title>The Kaldi speech recognition toolkit</article-title>
          .
          <source>In Proceedings of ASRU</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Ko</surname>
          </string-name>
          , Tom, et al.
          <article-title>Audio augmentation for speech recognition</article-title>
          <source>Proceedings of INTERSPEECH</source>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Ghahremani</surname>
          </string-name>
          ,
          <string-name>
            <surname>Pegah</surname>
          </string-name>
          , et al.
          <article-title>A pitch extraction algorithm tuned for automatic speech recognition."</article-title>
          2014 IEEE International
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5] Povey,
          <string-name>
            <surname>Daniel</surname>
          </string-name>
          , et al.
          <article-title>"Subspace Gaussian mixture models for speech recognition</article-title>
          .
          <source>" 2010 IEEE International Conference on Acoustics, Speech and Signal Processing</source>
          . IEEE,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <surname>Tomas</surname>
          </string-name>
          , et al.
          <article-title>"Rnnlm-recurrent neural network language modeling toolkit</article-title>
          .
          <source>" Proc. of the 2011 ASRU Workshop</source>
          .
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Fiscus</surname>
          </string-name>
          , Jonathan G.
          <article-title>"A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER)." Automatic Speech Recognition and Understanding, 1997</article-title>
          . Proceedings.
          <source>1997 IEEE Workshop on. IEEE</source>
          ,
          <year>1997</year>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>