ININ submission to Zero Cost ASR task at MediaEval 2016 Tejas Godambe, Naresh Kumar, Pavan Kumar, Veera Raghavendra, Aravind Ganapathiraju Interactive Intelligence India Private Limited Hyderabad, India {tejas.godambe, naresh.kumar, pavan.kumar, veera.raghavendra, aravind.ganapathiraju}@inin.com ABSTRACT 3. RESULTS AND DISCUSSION This paper details the experiments conducted to train an as good performing Vietnamese speech recognition system as possible using public domain data only, as a part of the Zero Cost task at 3.1 Preliminary Analysis MediEval 2016. We explored techniques related to audio pre- The sequence of experiments performed, and the gains/loss processing, use of speaker’s pitch information, data perturbation, incurred in WER with each of them are detailed below. Table 1 for building subspace Gaussian mixture acoustic model which is shows the WER and the word error rate reduction (WERR) known for estimating robust parameters when the amount of data achieved for each individual experiment. The WER was is less, and also unsupervised adaptation, RNN language model calculated on a very small dev local data set which comprised based lattice rescoring and system combination using ROVER tec of 21 utterances only. hnique. 1. Using tri-phone model: We first trained the tri-phone model with 2000 senones and total 20k Gaussians to see whether we 1. INTRODUCTION are able to replicate the baseline result. This gave a WER of 37.0% The goal of the zero cost ASR task is to bring researchers 2. Truncating silence in training data: Preliminary together on the topic of training ASR systems using only data observation of a few wave files showed presence of long available in the public domain. In particular, this year’s task silences, which usually corrupts the acoustic model. A consisted of the development of an LVCSR for Vietnamese WERR of 9.6% was achieved when the tri-phone model was language which is a rare enough language but with sufficient trained after truncating long silences to 0.3 sec in the training enough public data to work with. More details on this task can data. Henceforth, for all experiments, we used the training be found in [1]. data with truncated silences. This also reduced the size of the Section 2 outlines the steps followed for building the final training data from around 13 hours to around 7 hours. system. Section 3 describes in detail each experiment we 3. Truncating silence in test data: Inspired by the above gain, conducted, and also discusses the loss/gain achieved in accuracy we truncated long silences to 0.3 sec in the test data too, with it. We conclude the paper in Section 4. before decoding. But, surprisingly, this increased the WER to 50.3%. Hence, in the future experiments, truncating silences 2. APPROACH in the test data was avoided. 4. Using SGMM model: SGMM model is known to estimate We used the Kaldi ASR toolkit [2] for building the system. robust parameters and perform better than a simple tri-phone As lexicon was not provided, graphemes were used as phonemes. model, especially when the size of training data is small. A There were 96 unique phonemes. The below steps were followed WERR of 9.4% was achieved upon migrating from tri-phone for the development of the final system. model to SGMM. 1. Truncate long silences in training data to 0.3 sec. 5. Using DNN model: DNNs are the state-of-the-art. But, it has 2. Augment data with speed perturbed versions (of speed been observed that they yield poorer or comparable results to factors 0.9 and 1.1) of itself [3]. SGMM when the size of training data is of small. We trained 3. Extract MFCCs along with pitch information [4]. a basic DNN containing 429 nodes in the input layer (5 4. Build SGMM acoustic model [5]. context frames), three hidden layers 512:256:512 with 256 5. Construct a 5-gram language model (LM) from the training being the bottleneck layer, and containing 930 output nodes, text. optimized using stochastic gradient descent to minimize the 6. Perform unsupervised adaptation, i.e. decode test cross-entropy. But, this increased the WER to 23.5%. utterances with above system, and add them to the training Though DNNs could have been made to perform better than data along with their approximate hypothesized SGMMs using proper regularization, because of time transcriptions. Three copies of test data (of speed factors constraints, we stuck to the SGMM acoustic model. 0.9, 1.0, 1.1) were added. 7. Generate lattices and rescore them with RNN based 6. Using position independent phones: This experiment was language model [6]. to see how the use of position independent phones fares 8. Do final decoding against using position-dependent phones. Not so surprisingly, this step degraded the WER by 1%. So, position-dependent phones were used for further Copyright is held by the author/owner(s). experiments. MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands. 7. Unsupervised adaptation: In unsupervised adaptation, we Table 1: Sequence of experiments performed with individual folded in the test data comprising of 332 utterances with their WER and WERR approximate hypotheses (obtained by decoding with SGMM Row Experiment WER (%) WERR (%) in the previous run) into the training data, and re-trained the no. SGMM acoustic model. This gave 2.0% WERR. 1 Training the tri-phone 37.0 8. Audio augmentation 1: Inspired by [3], speed of the model original training data was perturbed by factors of 0.9 and 1.1, and these perturbed copies were augmented to the original 2 Truncating silence in 27.4 37.0-27.4=9.6 training data. This helped achieve 1.1% WERR. training data 9. Audio augmentation 2: Here, four perturbed copies of 3 Truncating silence in 50.3 27.4-50.3=-22.9 speed factors 0.8, 0.9, 1.1 and 1.2 were augmented to the test data original training data. This gave 0.8% WERR, which is less 4 Using SGMM model 18.1 27.4-18.1=9.3 than 1.1% achieved in the previous experiment. Hence, for the final system, we augmented original data with perturbed 5 Using DNN model 23.5 18.1-23.5=-5.4 copies of speed factors 0.9 and 1.1 only. 6 Using position 19.1 18.1-19.1=-1.0 10. Using pitch information: The confusing words in the independent phones hypothesis seemed to be acoustically close as many 7 Unsupervised adaptation 16.1 18.1-16.1=2.0 confusing pairs differed by just one phone. For some words, it appeared that the confusions are occurring because of 8 Audio Augmentation 1 17.0 18.1-17.0=1.1 different tonal manifestation of the same phone. This gave 9 Audio Augmentation 2 17.3 18.1-17.3=0.8 the idea of using pitch information along with traditional MFCCs as explained in [4]. This gave 1.2% WERR, and 10 Using pitch information 16.9 18.1-16.9=1.2 helped to eliminate a few recurring confusions. 11 Using 5 gram LM 16.1 18.1-16.1=2.0 11. Using 5 gram LM: Next, higher order N-grams were tried in 12 Using 7 gram LM 16.6 18.1-16.6=1.5 order to put more constraints on the hypothesis and 13 Combined system 13.8 consequently improve the WER. Use of 5-gram LM instead of trigram LM helped achieve 2.0% WERR. 14 Rescoring lattices using 13.5 13.8-13.5.3 12. Using 7 gram LM: Inspired by the above gain, even higher RNN LM order N-gram such as 7 grams were experimented. This gave 15 ROVER 13.5 13.5-13.5=0.0 1.5% WERR which is less than 2.0% achieved with 5 grams. Hence, in the final system, 5-gram LM was used. 13. Combined system: For the final system we combined all the 4. CONCLUSION things that improvement such as truncating silence in the In this task, we confronted a real-world problem of building training data, using SGMM, unsupervised adaptation, data ASR system from public domain data containing noises and augmentation with speed factors 0.9, and 1.1, using pitch and having imperfect transcripts. The data was inherently small in using 5 gram LM. This combined system gave WER=13.8%. size. So, the problem of noisy acoustics and imperfect transcripts 14. Rescoring lattices using RNN-LM: Motivation behind was multiplied with that of low-resource one. In our system, we using RNN LM [6] was to see how much gain we can tried to look at different aspects of ASR system building like achieve by putting more constraints (apart from the 5-gram audio pre-processing, data perturbation, using pitch information, LM) from the LM side using a model which captures long- acoustic modeling, language modeling using higher order N- term dependencies in text in a distinct manner than that done grams, unsupervised adaptation, lattice-rescoring, and system by N-grams. The lattices were re-scored using RNN LM, but combination. Each of the above techniques contributed their share it gave only 0.3% improvement. Probably limited amount of toward bringing down the WER of the final system. training text prevented getting full advantage of RNN-LM. 15. Hypothesis combination; ROVER [7] is a well-known technique to combine hypotheses from multiple different 5. ACKNOWLEDGEMENTS systems. Individual systems which had given improvements We thank the event and task organizers for their prompt were combined with the above discussed combined system, responses to our queries related to the task. but this did not yield better results than the combined system. 3.2 Final Results In total, the test data comprised of 332 utterances, which contained utterances from ELSA, forvo.com, rhinospike.com and youtube.com. The percent WER achieved by our system on the above individual test data sets in the respective order are 5.7, 72.5, 25.3 and 91.4. The average WER is 51.2. While our system did well on data from ELSA and rhinospike.com, it did relatively poor on data from forvo.com and youtube.com. Conference on Acoustics, Speech and Signal Processing 6. REFERENCES (ICASSP). IEEE, 2014. [5] Povey, Daniel, et al. "Subspace Gaussian mixture models for [1] Szoke, I., Anguera, X., 2016, Zero cost speech recognition speech recognition." 2010 IEEE International Conference on task at MediaEval 2016, In Working Notes, Proceedings of Acoustics, Speech and Signal Processing. IEEE, 2010. the MediaEval 2016 Workshop, Hilversum, Netherlands, 20- [6] Mikolov, Tomas, et al. "Rnnlm-recurrent neural network 21 Oct 2016 language modeling toolkit." Proc. of the 2011 ASRU [2] Povey, et al. 2011. The Kaldi speech recognition toolkit. In Workshop. 2011. Proceedings of ASRU, 2011. [7] Fiscus, Jonathan G. "A post-processing system to yield [3] Ko, Tom, et al. Audio augmentation for speech recognition reduced word error rates: Recognizer output voting error Proceedings of INTERSPEECH. 2015. reduction (ROVER)." Automatic Speech Recognition and [4] Ghahremani, Pegah, et al. A pitch extraction algorithm tuned Understanding, 1997. Proceedings. 1997 IEEE Workshop for automatic speech recognition." 2014 IEEE International on. IEEE, 1997