<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The NNI Vietnamese Speech Recognition System for MediaEval 2016</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lei Wang</string-name>
          <email>wangl@i2r.a-star.edu.sg</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chongjia Ni</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cheung-Chi Leung</string-name>
          <email>ccleung@i2r.a-star.edu.sg</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Changhuai You</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lei Xie</string-name>
          <email>lxie@nwpu.edu.cn</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haihua Xu</string-name>
          <email>haihuaxu@ntu.edu.sg</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiong Xiao</string-name>
          <email>xiaoxiong@ntu.edu.sg</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tin Lay Nwe</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eng Siong Chng</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bin Ma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haizhou Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Infocomm Research (I</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Nanyang Technological University (NTU)</institution>
          ,
          <country country="SG">Singapore</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>National University of Singapore (NUS)</institution>
          ,
          <country country="SG">Singapore</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Northwestern Polytechnical University (NWPU)</institution>
          ,
          <addr-line>Xi'an</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>STAR</institution>
          ,
          <country country="SG">Singapore</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>This paper provides an overall description of the Vietnamese speech recognition system developed by the joint team for MediaEval 2016. The submitted system consisted of 3 subsystems, and adopted different deep neural network-based techniques such as fMLLR transformed bottleneck features, sequence training, etc. Besides the acoustic modeling techniques, speech data augmentation was also examined to develop a more robust acoustic model. The I2R team collected a number of text resources from the Internet and made them available to other participants in the task. The web text crawled from the Internet was used to train a 5-gram language model. The submitted system obtained the token error rate (TER) of 15.1, 23.0 and 50.5 on Devel local set, Devel set and Test set, respectively.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The zero-cost speech recognition task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] at MediaEval 2016
aims to build Vietnamese automatic speech recognition (ASR)
systems using publicly available multimedia resources (such as
texts, audios, videos, and dictionaries). About 10 hour transcribed
speech data was provided by the task organizers. The provided
data came from 3 different sources and were recorded in different
environments.
      </p>
      <p>Our submitted system consists of 3 sub-systems: 1) a
DNNHMM system with MFCC; 2) a DNN-HMM with data
augmentation; 3) a DNN-HMM with bottleneck features (BNFs).
To build the acoustic models, tonal information was involved in
the front-end processing, and bottleneck features (BNFs) were
used. Traditional GMM-HMM models were used to align the
speech frames to the phonetic transcription, and Deep Neural
Network (DNN) models were trained using cross-entropy
criterion, followed by sequence training based on state-level
minimum Bayes risk (sMBR). To improve the robustness of our
acoustic model, data augmentation was attempted. To build a
language model (LM), web page data were crawled from the
Internet. Other publicly available text resources were also
involved, and we made them to be accessible by all the
participants.</p>
    </sec>
    <sec id="sec-2">
      <title>2. DATA CONTRIBUTION</title>
      <p>The I2R team collected and contributed a number of text
resources listed as below:
 890 thousand URLs of Vietnamese web pages crawled
using a large number and variety of keywords and
phrases;
 An XML dump of Vietnamese Wikipedia's articles [2]
and its cleaned text;
 4 Vietnamese word lists [3] of different sizes.</p>
      <p>The above text data were made available to other participants
in the task. The corresponding data pack, i2r-data-pack, is also
available for download at
https://github.com/viet-asr/i2r-datapack-v1.0.</p>
    </sec>
    <sec id="sec-3">
      <title>3. APPROACHES</title>
      <p>This section describes the acoustic modeling of the 3
subsystems, as well as the text data and lexicon used for language
modeling. The 3 sub-systems share the same lexicon and language
model in decoding, which are described in Section 3.4. The
hypotheses of the 3 sub-systems were fused to a single system for
the final submission using the ROVER algorithm [4].
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>DNN-HMM System with MFCC</title>
      <p>
        We used 56-dimensional acoustic features consisting of
13dimensional MFCC, 1-dimensional F0, and their derived deltas,
acceleration and third-order deltas, as the input of a DNN-HMM
hybrid system [5]. The acoustic model considers 94 graphemes
which were discovered from the training transcription as
monophones. The context-dependent triphones were modeled by
801 senones. The final model was trained with cross-entropy
criterion and sMBR [
        <xref ref-type="bibr" rid="ref2">6</xref>
        ], on top of a GHM-HMM model trained
using maximum mutual information (MMI) [
        <xref ref-type="bibr" rid="ref3">7</xref>
        ]. The DNN
structure consists of 5 layers with 1024 nodes per layer. The total
duration of the training corpus is about 10 hours, and it is
provided by the organizer.
      </p>
    </sec>
    <sec id="sec-5">
      <title>3.2 DNN-HMM System with Data</title>
    </sec>
    <sec id="sec-6">
      <title>Augmentation</title>
      <p>Among the 10 hours of Vietnamese training data, some
utterances are relatively clean, some have been filtered through
denoising algorithms, and most of them are contaminated by
different kinds of background noise. To improve the robustness of
our recognition system against noisy speech, we augmented the
training utterances by corrupting each original utterance with
noise and applying speech enhancement on each original
utterance. After data augmentation, the total amount of training
data was increased by 2 times.</p>
      <p>
        Different kinds of background noise were extracted from the
training utterances using a voice-activity-detection algorithm.
Representative noise segments were selected and randomly added
into the original training utterances. Speech enhancement includes
two main estimation modules: the estimation of speech and the
estimation of noise. We used the modified version of
log-spectralamplitude (LSA) minimum mean square error (MMSE) algorithm
as the speech estimator [
        <xref ref-type="bibr" rid="ref4">8</xref>
        ]. The quality of estimated speech with
the same speech estimator heavily depends on the accuracy of the
estimation of the noise statistics. To improve the performance in
non-stationary background noise condition, we adopted a
minimum searching with the speech presence probability (SPP)
for noise estimation [
        <xref ref-type="bibr" rid="ref5">9</xref>
        ].
      </p>
      <p>With the augmented training data, we trained another
DNNHMM hybrid system with the same network structure (i.e. the
same numbers of hidden layers, hidden units per layers, and tied
states) as the system in Section 3.1. During recognition, we used
the original development/test utterances for decoding.</p>
    </sec>
    <sec id="sec-7">
      <title>3.3 DNN-HMM System with BNFs</title>
      <p>
        Another DNN-HMM hybrid system used bottleneck features
(BNFs) and fMLLR features as its input. This type of BNF-based
systems [
        <xref ref-type="bibr" rid="ref6 ref7">10-11</xref>
        ] is commonly used in the limited training data
condition. In the bottleneck feature extraction, 13-dimensional
MFCC and 2-dimensional F0-related features were extracted.
Nine adjacent frames of features were then concatenated and
applied with LDA+MLLT+fMLLR transform. MLLT makes the
features to be better modeled by diagonal-covariance Gaussians.
The resultant 40-dimensional fMLLR features were used for BNF
extraction from a DNN consisting of 6 hidden layers with 1024
nodes in each non-bottleneck layer, and 42-dimensional BNFs
were obtained.
      </p>
      <p>The 42-dimensional BNFs and the 40-dimensional fMLLR
features were then concatenated to form 82-dimensional features.
Then fMLLR transform was applied again (60-dim) to normalize
inter-speaker variability. The final 60-dimensional features were
used as the input of another DNN. This DNN contains 6 layers,
each layer contains 1024 nodes, and the output layer contains
2073 senones. The final model was trained with cross entropy
criterion and sMBR, on top of a GHM-HMM model trained using
MMI.</p>
    </sec>
    <sec id="sec-8">
      <title>3.4 Lexicon and Language Model</title>
      <p>The grapheme-based lexicon contains about 11,000
Vietnamese syllables and English words which occur in the
training transcription and the 74,000 Vietnamese word list in
i2rdata-pack.</p>
      <p>A 5-gram LM was trained using the following 4 data sources:
1) 7GB of text extracted from the list of web pages in
i2rdata-pack;
2) 750MB of text from Wikipedia's articles in
i2r-datapack;
3) 90MB download of Vietnamese-English subtitles
released by BUT;
4) Transcription of training utterances.</p>
      <p>The final LM was obtained by linear interpolation of four
LMs, each of which was trained using one of the above data
source. The interpolation weights were optimized using the
transcript of the development data set (Devel local). Perplexity
and TER were reduced on Devel local set in our preliminary
systems when the web data were included in the language model
training.</p>
    </sec>
    <sec id="sec-9">
      <title>4. RESULTS AND DISCUSSION</title>
      <p>Table 1 summarizes the ASR performance of each system on
3 different test sets. We observed that the performance on Devel
local set is not inconsistent to that on the other 2 sets. We believe
that it is because of the small amount of data (~13 minutes)
involved in Devel local set. Moreover, since Devel local set is a
small sub-set of Devel set, our analysis will focus on Devel and
Test. However, note that most of system configurations were
tuned on the Devel local set, to avoid the frequent upload of our
results to the leader board.</p>
      <p>The BNF-based system obviously has the best performance
among all sub-systems probably due to the contribution of more
robust bottleneck features and speaker normalization by fMLLR.
The data augmentation technique improves the system
performance by relatively 1.5% and 2.2% on Devel and Test sets,
respectively. Data augmentation provides varieties of training
speech data with noisy background so that it improves the
robustness of the acoustic model. Moreover, we believe that the
resultant acoustic model is more robust against unseen data, e.g.
the surprise data in the Test set.</p>
      <p>The fused system has the overall best performance, and it can
be attributed to that the 3 sub-systems are complementary.</p>
    </sec>
    <sec id="sec-10">
      <title>5. CONCLUSION</title>
      <p>This work describes the acoustic modeling of 3 sub-systems
and the approach for language modeling under the limited training
data condition. We relied on the provided training corpus to build
acoustic models, and effort was made to collect web text data to
build an LM.</p>
      <p>We reported the ASR performance which was achieved by
the deadline of the task. In the future work, we will examine the
data augmentation on the BNF-based system. We will further
investigate to use the speech data contributed by other
participants.
[2]
https://dumps.wikimedia.org/viwiki/20160501/viwiki20160501-pages-meta-current.xml.bz2
[3]
http://www.informatik.uni</p>
      <p>leipzig.de/~duc/software/misc/wordlist.html
[4] J. Fiscus, “A post processing system to yield reduced word
error rates: Recognizer Output Voting Error Reduction
(ROVER),” in Proc. ASRU 1997, 1997, pp. 347- 354.
[5] G. E. Dahl, D. Yu, L. Deng, and A. Acero,
“Contextdependent pre-trained deep neural networks for
largevocabulary speech recognition,” IEEE Trans. Audio, Speech,
and Language Processing, vol.20, no.1, pp. 30-42, Jan. 2012.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Igor</given-names>
            <surname>Szoke</surname>
          </string-name>
          and Xavier Anguera, “Zero-Cost
          <source>Speech Recognition Task at Mediaeval</source>
          <year>2016</year>
          ,” in Proc. MediaEval 2016 Workshop, Hilversum, Netherlands, Oct.
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Veselý</surname>
          </string-name>
          , Karel, Arnab Ghoshal, Lukás Burget, and Daniel Povey, “
          <article-title>Sequence-discriminative training of deep neural networks,”</article-title>
          <source>in Proc. Interspeech</source>
          <year>2013</year>
          , pp.
          <fpage>2345</fpage>
          -
          <lpage>2349</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Povey</surname>
          </string-name>
          , “
          <article-title>Discriminative training for large vocabulary speech recognition,”</article-title>
          <source>Ph.D. dissertation</source>
          , Cambridge University Engineering Dept,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Gemello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Mana</surname>
          </string-name>
          , and R. D. Mori, “
          <article-title>Automatic Speech Recognition with a Modified Ephraim-Malah Rule,” in IEEE Signal Processomg Letters</article-title>
          , vol.
          <volume>13</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>56</fpage>
          -
          <lpage>59</lpage>
          , Jan.
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rangachari</surname>
          </string-name>
          and
          <string-name>
            <given-names>P. C.</given-names>
            <surname>Loizou</surname>
          </string-name>
          , “
          <article-title>A noise-estimation algorithm for highly non-stationary environments,” Speech Communication</article-title>
          , vol.
          <volume>48</volume>
          , pp.
          <fpage>220</fpage>
          -
          <lpage>231</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Chongjia</surname>
            <given-names>Ni</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cheung-Chi</surname>
            <given-names>Leung</given-names>
          </string-name>
          , Lei Wang,
          <string-name>
            <given-names>Nancy F.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and Bin Ma, “
          <article-title>Unsupervised data selection and word-morph mixed language model for Tamil low-resource keyword search,”</article-title>
          <source>in Proc. ICASSP</source>
          <year>2015</year>
          , Brisbane, Australia,
          <year>April 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Chongjia</surname>
            <given-names>Ni</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cheung-Chi</surname>
            <given-names>Leung</given-names>
          </string-name>
          , Lei Wang, Haibo Liu,
          <string-name>
            <given-names>Feng</given-names>
            <surname>Rao</surname>
          </string-name>
          , Li Lu,
          <string-name>
            <given-names>Nancy F.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Bin Ma, and
          <string-name>
            <given-names>Haizhou</given-names>
            <surname>Li</surname>
          </string-name>
          , “
          <article-title>Cross-lingual deep neural network based submodular unbiased data selection for low-resource keyword search,”</article-title>
          <source>in Proc. ICASSP</source>
          <year>2016</year>
          , Shanghai, China,
          <year>March 2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>