The NNI Vietnamese Speech Recognition System for
                       MediaEval 2016
        Lei Wang1, Chongjia Ni1, Cheung-Chi Leung1, Changhuai You1, Lei Xie2, Haihua Xu3,
               Xiong Xiao3, Tin Lay Nwe1, Eng Siong Chng3, Bin Ma1, Haizhou Li1,4
                                      1                                 2
                               Institute for Infocomm Research (I R), A*STAR, Singapore
                                  2
                              Northwestern Polytechnical University (NWPU), Xi’an, China
                                 3
                                  Nanyang Technological University (NTU), Singapore
                                   4
                                    National University of Singapore (NUS), Singapore
                {wangl,ccleung}@i2r.a-star.edu.sg, lxie@nwpu.edu.cn, {haihuaxu,xiaoxiong}@ntu.edu.sg


ABSTRACT                                                                        phrases;
This paper provides an overall description of the Vietnamese                   An XML dump of Vietnamese Wikipedia's articles [2]
speech recognition system developed by the joint team for                       and its cleaned text;
MediaEval 2016. The submitted system consisted of 3 sub-                       4 Vietnamese word lists [3] of different sizes.
systems, and adopted different deep neural network-based                   The above text data were made available to other participants
techniques such as fMLLR transformed bottleneck features,             in the task. The corresponding data pack, i2r-data-pack, is also
sequence training, etc. Besides the acoustic modeling techniques,     available for download at https://github.com/viet-asr/i2r-data-
speech data augmentation was also examined to develop a more          pack-v1.0.
robust acoustic model. The I2R team collected a number of text
resources from the Internet and made them available to other
participants in the task. The web text crawled from the Internet      3. APPROACHES
was used to train a 5-gram language model. The submitted system             This section describes the acoustic modeling of the 3 sub-
obtained the token error rate (TER) of 15.1, 23.0 and 50.5 on         systems, as well as the text data and lexicon used for language
Devel local set, Devel set and Test set, respectively.                modeling. The 3 sub-systems share the same lexicon and language
                                                                      model in decoding, which are described in Section 3.4. The
                                                                      hypotheses of the 3 sub-systems were fused to a single system for
                                                                      the final submission using the ROVER algorithm [4].
1. INTRODUCTION
      The zero-cost speech recognition task [1] at MediaEval 2016
aims to build Vietnamese automatic speech recognition (ASR)           3.1 DNN-HMM System with MFCC
systems using publicly available multimedia resources (such as              We used 56-dimensional acoustic features consisting of 13-
texts, audios, videos, and dictionaries). About 10 hour transcribed   dimensional MFCC, 1-dimensional F0, and their derived deltas,
speech data was provided by the task organizers. The provided         acceleration and third-order deltas, as the input of a DNN-HMM
data came from 3 different sources and were recorded in different     hybrid system [5]. The acoustic model considers 94 graphemes
environments.                                                         which were discovered from the training transcription as
      Our submitted system consists of 3 sub-systems: 1) a DNN-       monophones. The context-dependent triphones were modeled by
HMM system with MFCC; 2) a DNN-HMM with data                          801 senones. The final model was trained with cross-entropy
augmentation; 3) a DNN-HMM with bottleneck features (BNFs).           criterion and sMBR [6], on top of a GHM-HMM model trained
To build the acoustic models, tonal information was involved in       using maximum mutual information (MMI) [7]. The DNN
the front-end processing, and bottleneck features (BNFs) were         structure consists of 5 layers with 1024 nodes per layer. The total
used. Traditional GMM-HMM models were used to align the               duration of the training corpus is about 10 hours, and it is
speech frames to the phonetic transcription, and Deep Neural          provided by the organizer.
Network (DNN) models were trained using cross-entropy
criterion, followed by sequence training based on state-level
minimum Bayes risk (sMBR). To improve the robustness of our           3.2 DNN-HMM System with Data
acoustic model, data augmentation was attempted. To build a           Augmentation
language model (LM), web page data were crawled from the                    Among the 10 hours of Vietnamese training data, some
Internet. Other publicly available text resources were also           utterances are relatively clean, some have been filtered through
involved, and we made them to be accessible by all the                denoising algorithms, and most of them are contaminated by
participants.                                                         different kinds of background noise. To improve the robustness of
                                                                      our recognition system against noisy speech, we augmented the
                                                                      training utterances by corrupting each original utterance with
2. DATA CONTRIBUTION                                                  noise and applying speech enhancement on each original
     The I2R team collected and contributed a number of text
                                                                      utterance. After data augmentation, the total amount of training
resources listed as below:
                                                                      data was increased by 2 times.
         890 thousand URLs of Vietnamese web pages crawled                 Different kinds of background noise were extracted from the
          using a large number and variety of keywords and            training utterances using a voice-activity-detection algorithm.
                                                                      Representative noise segments were selected and randomly added
Copyright is held by the author/owner(s).
                                                                      into the original training utterances. Speech enhancement includes
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands.
two main estimation modules: the estimation of speech and the        Table 1: ASR performance of different sub-systems and the
estimation of noise. We used the modified version of log-spectral-   fused system
amplitude (LSA) minimum mean square error (MMSE) algorithm                                    LM                TER (%)
as the speech estimator [8]. The quality of estimated speech with                                         Devel    Devel   Test
the same speech estimator heavily depends on the accuracy of the                                          local
estimation of the noise statistics. To improve the performance in    MFCC-DNN-HMM        11K lexicon       17.4     26.9  55.1
non-stationary background noise condition, we adopted a              + Data Augmentation 5-gram LM         18.4     26.5  53.9
minimum searching with the speech presence probability (SPP)
                                                                     BNF-DNN-HMM                           18.5     25.5  50.9
for noise estimation [9].
                                                                     Fusion of 3 systems                   15.1     23.0  50.5
     With the augmented training data, we trained another DNN-
HMM hybrid system with the same network structure (i.e. the
                                                                          Table 1 summarizes the ASR performance of each system on
same numbers of hidden layers, hidden units per layers, and tied
                                                                     3 different test sets. We observed that the performance on Devel
states) as the system in Section 3.1. During recognition, we used
                                                                     local set is not inconsistent to that on the other 2 sets. We believe
the original development/test utterances for decoding.
                                                                     that it is because of the small amount of data (~13 minutes)
                                                                     involved in Devel local set. Moreover, since Devel local set is a
3.3 DNN-HMM System with BNFs                                         small sub-set of Devel set, our analysis will focus on Devel and
      Another DNN-HMM hybrid system used bottleneck features         Test. However, note that most of system configurations were
(BNFs) and fMLLR features as its input. This type of BNF-based       tuned on the Devel local set, to avoid the frequent upload of our
systems [10-11] is commonly used in the limited training data        results to the leader board.
condition. In the bottleneck feature extraction, 13-dimensional           The BNF-based system obviously has the best performance
MFCC and 2-dimensional F0-related features were extracted.           among all sub-systems probably due to the contribution of more
Nine adjacent frames of features were then concatenated and          robust bottleneck features and speaker normalization by fMLLR.
applied with LDA+MLLT+fMLLR transform. MLLT makes the                The data augmentation technique improves the system
features to be better modeled by diagonal-covariance Gaussians.      performance by relatively 1.5% and 2.2% on Devel and Test sets,
The resultant 40-dimensional fMLLR features were used for BNF        respectively. Data augmentation provides varieties of training
extraction from a DNN consisting of 6 hidden layers with 1024        speech data with noisy background so that it improves the
nodes in each non-bottleneck layer, and 42-dimensional BNFs          robustness of the acoustic model. Moreover, we believe that the
were obtained.                                                       resultant acoustic model is more robust against unseen data, e.g.
      The 42-dimensional BNFs and the 40-dimensional fMLLR           the surprise data in the Test set.
features were then concatenated to form 82-dimensional features.          The fused system has the overall best performance, and it can
Then fMLLR transform was applied again (60-dim) to normalize         be attributed to that the 3 sub-systems are complementary.
inter-speaker variability. The final 60-dimensional features were
used as the input of another DNN. This DNN contains 6 layers,
each layer contains 1024 nodes, and the output layer contains
                                                                     5. CONCLUSION
                                                                          This work describes the acoustic modeling of 3 sub-systems
2073 senones. The final model was trained with cross entropy
                                                                     and the approach for language modeling under the limited training
criterion and sMBR, on top of a GHM-HMM model trained using
                                                                     data condition. We relied on the provided training corpus to build
MMI.
                                                                     acoustic models, and effort was made to collect web text data to
                                                                     build an LM.
3.4 Lexicon and Language Model                                            We reported the ASR performance which was achieved by
      The grapheme-based lexicon contains about 11,000               the deadline of the task. In the future work, we will examine the
Vietnamese syllables and English words which occur in the            data augmentation on the BNF-based system. We will further
training transcription and the 74,000 Vietnamese word list in i2r-   investigate to use the speech data contributed by other
data-pack.                                                           participants.
      A 5-gram LM was trained using the following 4 data sources:
      1) 7GB of text extracted from the list of web pages in i2r-
           data-pack;
                                                                     REFERENCES
                                                                     [1] Igor Szoke and Xavier Anguera, “Zero-Cost Speech
      2) 750MB of text from Wikipedia's articles in i2r-data-
                                                                         Recognition Task at Mediaeval 2016,” in Proc. MediaEval
           pack;
                                                                         2016 Workshop, Hilversum, Netherlands, Oct. 2016.
      3) 90MB download of Vietnamese-English subtitles
           released by BUT;                                          [2] https://dumps.wikimedia.org/viwiki/20160501/viwiki-
      4) Transcription of training utterances.                           20160501-pages-meta-current.xml.bz2
      The final LM was obtained by linear interpolation of four      [3] http://www.informatik.uni-
LMs, each of which was trained using one of the above data               leipzig.de/~duc/software/misc/wordlist.html
source. The interpolation weights were optimized using the
transcript of the development data set (Devel local). Perplexity     [4] J. Fiscus, “A post processing system to yield reduced word
and TER were reduced on Devel local set in our preliminary               error rates: Recognizer Output Voting Error Reduction
systems when the web data were included in the language model            (ROVER),” in Proc. ASRU 1997, 1997, pp. 347- 354.
training.                                                            [5] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-
                                                                         dependent pre-trained deep neural networks for large-
                                                                         vocabulary speech recognition,” IEEE Trans. Audio, Speech,
4. RESULTS AND DISCUSSION                                                and Language Processing, vol.20, no.1, pp. 30-42, Jan. 2012.
[6] Veselý, Karel, Arnab Ghoshal, Lukás Burget, and Daniel       [10] Chongjia Ni, Cheung-Chi Leung, Lei Wang, Nancy F. Chen,
    Povey, “Sequence-discriminative training of deep neural           and Bin Ma, “Unsupervised data selection and word-morph
    networks,” in Proc. Interspeech 2013, pp. 2345-2349, 2013.        mixed language model for Tamil low-resource keyword
[7] D. Povey, “Discriminative training for large vocabulary           search,” in Proc. ICASSP 2015, Brisbane, Australia, April
    speech recognition,” Ph.D. dissertation, Cambridge                2015.
    University Engineering Dept, 2003.                           [11] Chongjia Ni, Cheung-Chi Leung, Lei Wang, Haibo Liu, Feng
[8] R. Gemello, F. Mana, and R. D. Mori, “Automatic Speech            Rao, Li Lu, Nancy F. Chen, Bin Ma, and Haizhou Li,
    Recognition with a Modified Ephraim-Malah Rule,” in               “Cross-lingual deep neural network based submodular
    IEEE Signal Processomg Letters, vol.13, no.1, pp. 56-59,          unbiased data selection for low-resource keyword search,” in
    Jan. 2006.                                                        Proc. ICASSP 2016, Shanghai, China, March 2016.

[9] S. Rangachari and P. C. Loizou, “A noise-estimation
    algorithm for highly non-stationary environments,” Speech
    Communication, vol.48, pp. 220-231, 2006.