The NNI Vietnamese Speech Recognition System for MediaEval 2016 Lei Wang1, Chongjia Ni1, Cheung-Chi Leung1, Changhuai You1, Lei Xie2, Haihua Xu3, Xiong Xiao3, Tin Lay Nwe1, Eng Siong Chng3, Bin Ma1, Haizhou Li1,4 1 2 Institute for Infocomm Research (I R), A*STAR, Singapore 2 Northwestern Polytechnical University (NWPU), Xi’an, China 3 Nanyang Technological University (NTU), Singapore 4 National University of Singapore (NUS), Singapore {wangl,ccleung}@i2r.a-star.edu.sg, lxie@nwpu.edu.cn, {haihuaxu,xiaoxiong}@ntu.edu.sg ABSTRACT phrases; This paper provides an overall description of the Vietnamese  An XML dump of Vietnamese Wikipedia's articles [2] speech recognition system developed by the joint team for and its cleaned text; MediaEval 2016. The submitted system consisted of 3 sub-  4 Vietnamese word lists [3] of different sizes. systems, and adopted different deep neural network-based The above text data were made available to other participants techniques such as fMLLR transformed bottleneck features, in the task. The corresponding data pack, i2r-data-pack, is also sequence training, etc. Besides the acoustic modeling techniques, available for download at https://github.com/viet-asr/i2r-data- speech data augmentation was also examined to develop a more pack-v1.0. robust acoustic model. The I2R team collected a number of text resources from the Internet and made them available to other participants in the task. The web text crawled from the Internet 3. APPROACHES was used to train a 5-gram language model. The submitted system This section describes the acoustic modeling of the 3 sub- obtained the token error rate (TER) of 15.1, 23.0 and 50.5 on systems, as well as the text data and lexicon used for language Devel local set, Devel set and Test set, respectively. modeling. The 3 sub-systems share the same lexicon and language model in decoding, which are described in Section 3.4. The hypotheses of the 3 sub-systems were fused to a single system for the final submission using the ROVER algorithm [4]. 1. INTRODUCTION The zero-cost speech recognition task [1] at MediaEval 2016 aims to build Vietnamese automatic speech recognition (ASR) 3.1 DNN-HMM System with MFCC systems using publicly available multimedia resources (such as We used 56-dimensional acoustic features consisting of 13- texts, audios, videos, and dictionaries). About 10 hour transcribed dimensional MFCC, 1-dimensional F0, and their derived deltas, speech data was provided by the task organizers. The provided acceleration and third-order deltas, as the input of a DNN-HMM data came from 3 different sources and were recorded in different hybrid system [5]. The acoustic model considers 94 graphemes environments. which were discovered from the training transcription as Our submitted system consists of 3 sub-systems: 1) a DNN- monophones. The context-dependent triphones were modeled by HMM system with MFCC; 2) a DNN-HMM with data 801 senones. The final model was trained with cross-entropy augmentation; 3) a DNN-HMM with bottleneck features (BNFs). criterion and sMBR [6], on top of a GHM-HMM model trained To build the acoustic models, tonal information was involved in using maximum mutual information (MMI) [7]. The DNN the front-end processing, and bottleneck features (BNFs) were structure consists of 5 layers with 1024 nodes per layer. The total used. Traditional GMM-HMM models were used to align the duration of the training corpus is about 10 hours, and it is speech frames to the phonetic transcription, and Deep Neural provided by the organizer. Network (DNN) models were trained using cross-entropy criterion, followed by sequence training based on state-level minimum Bayes risk (sMBR). To improve the robustness of our 3.2 DNN-HMM System with Data acoustic model, data augmentation was attempted. To build a Augmentation language model (LM), web page data were crawled from the Among the 10 hours of Vietnamese training data, some Internet. Other publicly available text resources were also utterances are relatively clean, some have been filtered through involved, and we made them to be accessible by all the denoising algorithms, and most of them are contaminated by participants. different kinds of background noise. To improve the robustness of our recognition system against noisy speech, we augmented the training utterances by corrupting each original utterance with 2. DATA CONTRIBUTION noise and applying speech enhancement on each original The I2R team collected and contributed a number of text utterance. After data augmentation, the total amount of training resources listed as below: data was increased by 2 times.  890 thousand URLs of Vietnamese web pages crawled Different kinds of background noise were extracted from the using a large number and variety of keywords and training utterances using a voice-activity-detection algorithm. Representative noise segments were selected and randomly added Copyright is held by the author/owner(s). into the original training utterances. Speech enhancement includes MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands. two main estimation modules: the estimation of speech and the Table 1: ASR performance of different sub-systems and the estimation of noise. We used the modified version of log-spectral- fused system amplitude (LSA) minimum mean square error (MMSE) algorithm LM TER (%) as the speech estimator [8]. The quality of estimated speech with Devel Devel Test the same speech estimator heavily depends on the accuracy of the local estimation of the noise statistics. To improve the performance in MFCC-DNN-HMM 11K lexicon 17.4 26.9 55.1 non-stationary background noise condition, we adopted a + Data Augmentation 5-gram LM 18.4 26.5 53.9 minimum searching with the speech presence probability (SPP) BNF-DNN-HMM 18.5 25.5 50.9 for noise estimation [9]. Fusion of 3 systems 15.1 23.0 50.5 With the augmented training data, we trained another DNN- HMM hybrid system with the same network structure (i.e. the Table 1 summarizes the ASR performance of each system on same numbers of hidden layers, hidden units per layers, and tied 3 different test sets. We observed that the performance on Devel states) as the system in Section 3.1. During recognition, we used local set is not inconsistent to that on the other 2 sets. We believe the original development/test utterances for decoding. that it is because of the small amount of data (~13 minutes) involved in Devel local set. Moreover, since Devel local set is a 3.3 DNN-HMM System with BNFs small sub-set of Devel set, our analysis will focus on Devel and Another DNN-HMM hybrid system used bottleneck features Test. However, note that most of system configurations were (BNFs) and fMLLR features as its input. This type of BNF-based tuned on the Devel local set, to avoid the frequent upload of our systems [10-11] is commonly used in the limited training data results to the leader board. condition. In the bottleneck feature extraction, 13-dimensional The BNF-based system obviously has the best performance MFCC and 2-dimensional F0-related features were extracted. among all sub-systems probably due to the contribution of more Nine adjacent frames of features were then concatenated and robust bottleneck features and speaker normalization by fMLLR. applied with LDA+MLLT+fMLLR transform. MLLT makes the The data augmentation technique improves the system features to be better modeled by diagonal-covariance Gaussians. performance by relatively 1.5% and 2.2% on Devel and Test sets, The resultant 40-dimensional fMLLR features were used for BNF respectively. Data augmentation provides varieties of training extraction from a DNN consisting of 6 hidden layers with 1024 speech data with noisy background so that it improves the nodes in each non-bottleneck layer, and 42-dimensional BNFs robustness of the acoustic model. Moreover, we believe that the were obtained. resultant acoustic model is more robust against unseen data, e.g. The 42-dimensional BNFs and the 40-dimensional fMLLR the surprise data in the Test set. features were then concatenated to form 82-dimensional features. The fused system has the overall best performance, and it can Then fMLLR transform was applied again (60-dim) to normalize be attributed to that the 3 sub-systems are complementary. inter-speaker variability. The final 60-dimensional features were used as the input of another DNN. This DNN contains 6 layers, each layer contains 1024 nodes, and the output layer contains 5. CONCLUSION This work describes the acoustic modeling of 3 sub-systems 2073 senones. The final model was trained with cross entropy and the approach for language modeling under the limited training criterion and sMBR, on top of a GHM-HMM model trained using data condition. We relied on the provided training corpus to build MMI. acoustic models, and effort was made to collect web text data to build an LM. 3.4 Lexicon and Language Model We reported the ASR performance which was achieved by The grapheme-based lexicon contains about 11,000 the deadline of the task. In the future work, we will examine the Vietnamese syllables and English words which occur in the data augmentation on the BNF-based system. We will further training transcription and the 74,000 Vietnamese word list in i2r- investigate to use the speech data contributed by other data-pack. participants. A 5-gram LM was trained using the following 4 data sources: 1) 7GB of text extracted from the list of web pages in i2r- data-pack; REFERENCES [1] Igor Szoke and Xavier Anguera, “Zero-Cost Speech 2) 750MB of text from Wikipedia's articles in i2r-data- Recognition Task at Mediaeval 2016,” in Proc. MediaEval pack; 2016 Workshop, Hilversum, Netherlands, Oct. 2016. 3) 90MB download of Vietnamese-English subtitles released by BUT; [2] https://dumps.wikimedia.org/viwiki/20160501/viwiki- 4) Transcription of training utterances. 20160501-pages-meta-current.xml.bz2 The final LM was obtained by linear interpolation of four [3] http://www.informatik.uni- LMs, each of which was trained using one of the above data leipzig.de/~duc/software/misc/wordlist.html source. The interpolation weights were optimized using the transcript of the development data set (Devel local). Perplexity [4] J. Fiscus, “A post processing system to yield reduced word and TER were reduced on Devel local set in our preliminary error rates: Recognizer Output Voting Error Reduction systems when the web data were included in the language model (ROVER),” in Proc. ASRU 1997, 1997, pp. 347- 354. training. [5] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context- dependent pre-trained deep neural networks for large- vocabulary speech recognition,” IEEE Trans. Audio, Speech, 4. RESULTS AND DISCUSSION and Language Processing, vol.20, no.1, pp. 30-42, Jan. 2012. [6] Veselý, Karel, Arnab Ghoshal, Lukás Burget, and Daniel [10] Chongjia Ni, Cheung-Chi Leung, Lei Wang, Nancy F. Chen, Povey, “Sequence-discriminative training of deep neural and Bin Ma, “Unsupervised data selection and word-morph networks,” in Proc. Interspeech 2013, pp. 2345-2349, 2013. mixed language model for Tamil low-resource keyword [7] D. Povey, “Discriminative training for large vocabulary search,” in Proc. ICASSP 2015, Brisbane, Australia, April speech recognition,” Ph.D. dissertation, Cambridge 2015. University Engineering Dept, 2003. [11] Chongjia Ni, Cheung-Chi Leung, Lei Wang, Haibo Liu, Feng [8] R. Gemello, F. Mana, and R. D. Mori, “Automatic Speech Rao, Li Lu, Nancy F. Chen, Bin Ma, and Haizhou Li, Recognition with a Modified Ephraim-Malah Rule,” in “Cross-lingual deep neural network based submodular IEEE Signal Processomg Letters, vol.13, no.1, pp. 56-59, unbiased data selection for low-resource keyword search,” in Jan. 2006. Proc. ICASSP 2016, Shanghai, China, March 2016. [9] S. Rangachari and P. C. Loizou, “A noise-estimation algorithm for highly non-stationary environments,” Speech Communication, vol.48, pp. 220-231, 2006.