BUT Zero-Cost Speech Recognition 2016 System Description Miroslav Skácel, Martin Karafiát, Lucas Ondel, Albert Uchytil, Igor Szöke BUT Speech@FIT Brno University of Technology, Czech Republic {iskacel, karafiat, iondel, xuchyt03, szoke}@fit.vutbr.cz ABSTRACT Numerals composed of digits were expanded to its tex- This paper describes our work on developing speech recog- tual form. We took textual transcription of basic numerals nizers for Vietnamese. It focuses on procedures to prepare (0,1,2,..,100,1000,...) in Vietnamese language. The proce- provided data precisely. We aim on analysis of the textual dure to compose a number in Vietnamese is simple com- transcriptions in particular. Methods to filter out defective pared to some other languages and follows very logical rules. data to improve performance of final system are proposed Thus, the textual translation was iteratively created for ev- and described in detail. We also propose cleaning of other ery number composed of digits. textual data used for language modeling. Several architec- A transcription was a simple text file for each of audio files. tures are investigated to reach both sub-tasks goals. The There was no information about transcription alignment so achieved results are discussed. it could match the whole audio. Audio files longer than 1 minute were discarded from first iteration of LVCSR train- ing (described in 3.1) due to high memory demands. We 1. INTRODUCTION used the alignment obtained during training to split tran- For the Zero-Cost 2016 Speech Recognition task, we devel- scriptions into smaller segments. For every detected silence oped one Large Vocabulary Continuous Speech Recognition longer than 0.5 s, the segment was divided. If the segment (LVCSR) system and one subword system for on-time sub- lasted longer than 15 seconds, it was also split by first possi- mission and two more LVCSR systems for late submission. ble silence detected regardless the duration. This allowed us LVCSR systems were based on our previous knowledge from to utilize the whole training data set with acceptable mem- Babel Program [1][2]. We presented two types of LVCSRs. ory demands during training. The first type uses Gaussian Mixture Model (GMM), Hidden In the next step, we focused on defective audio and im- Markov Model (HMM) and Deep Neural Network (DNN) proper transcription texts. The average log-likelihood of from Babel 2014 [1]. The second one adopted Bidirectional speech frames was calculated by accumulating the log-likelihood Long Short-Term Memory (BLSTM) approach from Babel from the first iteration system for all speech frames and di- 2016 [2]. Our goal was to modify and apply existing Ba- viding by the number of frames in the given audio. The same bel LVCSR systems for this year’s target language that was was done for silence frames. The log-likelihood of speech was Vietnamese. For subword sub-task, we exploited acoustic very low when the audio contained a silence/noise only, the unit discovery model. See [3] for more details on each of the transcription did not strongly correspond to the audio or sub-tasks. a part of the transcription was missing. Therefore, we dis- carded defective files from further training by ad-hoc thresh- 2. DATA PREPARATION old set to -100. We used 92 % of training data after cleaning. The given mix of audio data, transcripts and additional 2.3 Language Models (LMs) texts were preprocessed and elementary cleaned before train- ing of our systems. Three LMs were created for this sub-task. The first LM for on-time submitted LVCSR system was trained on text 2.1 Audio taken from training set transcriptions. For BLSTM system, the original 16kHz audio was used. The second LM was trained on Vietnamese subtitles. We For GMM/DNN based system, the original audio was down- took provided wordlist to create set of Vietnamese letters sampled from 16kHz to 8kHz to fit our training scripts. We to filter out words from other languages. Again, punctua- also used the information about the audio length to process tion and quotation marks, brackets and other symbols were transcription texts later. eliminated from the text. Numerals composed of digits were transformed to textual notation in the same way we did 2.2 Transcriptions previously. Sentences comprising less than 3 words were All other symbols but letters from Vietnamese alphabet discarded as well. The text was converted to uppercase. (punctuation marks, brackets, etc.) were removed from tran- We were provided with a set of URLs which headed to the scriptions and text was converted to uppercase. websites in Vietnamese. We extracted the inner text from all of the HTML tags. However, the data contained a lot of unusable text. We removed the lines containing any special Copyright is held by the author/owner(s). MediaEval 2016 Workshop, October 20-21, 2016, Hilversum, chars and numbers at first. After that, we created a wordlist Netherlands from already cleaned up data and did a filtering according to Devel Test System all (ELSA / Forvo / RhinoSpike) all (ELSA / Forvo / RhinoSpike / YouTube) P-BUT - Babel Kaldi BLSTM 16kHz 17.9 (6.4 / 58.1 / 15.8) 48.0 (4.9 / 55.7 / 35.4 / 87.2) L-BUT - Babel Kaldi BLSTM 16kHz - LM tune 17.6 (6.2 / 56.4 / 16.9) 46.3 (4.6 / 52.6 / 32.2 / 84.7) L-BUT - Babel GMM/DNN 8kHz 36.1 (29.7 / 68.5 / 23.4) 55.7 (28.0 / 59.3 / 44.9 / 81.4) Table 1: Results of the LVCSR systems for overall score and single test subsets (shown in parentheses) in WER metric. System labeled by P was submitted on-time; L denotes late submission systems. Devel Test System all (ELSA / Forvo / RhinoSpike) all (ELSA / Forvo / RhinoSpike / YouTube) P-BUT AUD phone-loop 5.08 (6.45 / 8.76 / 14.19) 4.56 (5.52 / 9.59 / 18.49 / 7.59) Table 2: Results of the subword system for overall score and single test subsets (shown in parentheses) in NMI metric. it. The duplicate lines were removed. In total, we obtained 4. SUBWORD SYSTEM about 460k sentences to create our third LM. The acoustic unit discovery (AUD) model presented in [4] These three LMs were combined together in a linear way aims at segmenting and clustering unlabeled speech data (denoted as LM tune in Table 1). into phone-like categories. It is similar to a phone-loop model in which each phone-like unit is modeled by an HMM. 3. LVCSR SYSTEMS This phone-loop model is fully Bayesian in the sense that: We developed two different LVCSR system for the first • it incorporates a prior distribution over the parameters sub-task - GMM/DNN and BLSTM architectures. of the HMMs 3.1 GMM/DNN • it has a prior distribution over the units modeled by a The automatic speech recognition (ASR) system devel- Dirichlet process [5]. oped for Babel [1] focuses on languages with limited amount of training data. This architecture uses Stacked Bottle-Neck Informally, the Dirichlet process prior can be seen as a Neural Network (SBN NN) for feature extraction that over- standard Dirichlet distribution prior for a Bayesian mixture comes standard Bottle-Neck features. It contains two con- with an infinite number of components. However, we assume secutive NNs. The first one has four hidden layers with 1500 that our N data samples have been generated with only M units each except the bottle-neck layer. The BN layer is the components (M ≤ N ) from the infinite mixture. Hence, third hidden with 80 neurons. It outputs 21 frames that are the model is no longer restricted to have a fixed number of downsampled and taken as an input to the second NN. This components but instead can learn its complexity (i.e. num- NN has the same structure. The bottle-neck layer consist of ber of units used M ) according to the training data. The 30 neurons. It outputs SBN features that are used to train priors over the GMM weights, Gaussian mean and (diago- GMM-HMM system. nal) covariance matrix are a Dirichlet and a Normal-Gamma This HMM-based speech recognition system works with density respectively and were initialized as described in [6]. tied-state triphones and uses standard maximum likelihood See [4] for the Variational Bayesian treatment of this model. technique for training. Word transcriptions are get using 3-gram LM taken from cleaned training texts. To perform speaker adaptation, we trained GMM sys- 5. CONCLUSION tem on NN input features. The Discrete Cosine Trans- The primary on-time system based on BLSTM using orig- form (DCT) follows to decorrelate Mel-filterbank features inal 16kHz audio and trained on the original transcriptions (FBANK). The speaker independent GMM-HMM system resulted in overall 48 % WER for the test set data. For is done by single-pass retraining using these FBANKs. Fi- ELSA test subset, the WER reached 4.9 % which is partic- nally, Constrained Maximum Likelihood Linear Regression ularly good result. This subset data probably fits perfectly (CMLLR) transform is estimated for each speaker. to the training set. On the contrary, the unseen YouTube We trained systems in iterative manner. In the first it- test subset resulted in 87.2 % WER which is the worst score eration, the simple monophone model was trained to get out of test subsets. alignment of the text to the audio. In the second iteration, The improved late submitted BLSTM system using the we got the final full system. combination of LMs showed overall 1.7 % WER improve- ment on the test data. The improvement on every single 3.2 BLSTM test subsets is nearly 3% WER (except ELSA subset). The ASR system developed for Babel 2016 [2] focuses on The late GMM/DNN system using 8kHz audio and cleaned model training using BLSTM networks. The BLSTM sys- texts ended up with the worst overall score 55.7 % WER for tem does not overperform the classical architecture but is the test set. Compared to our best BLSTM system, the more stable during training. The BLSTM network archi- score decreased by overall 9.4 % WER. The results for sin- tecture consists of 3 hidden layers in both directions where gle databases (seen during the training) are following: the there are 512 memory units in each layer and 300 neurons decrease of 23.4 % for ELSA; the decrease of 6.7 % for Forvo; in the projection layer. the decrease of 12.6 % for RhinoSpike. The only score im- The transcriptions were not cleaned and were taken as is provement of 3.3 % was for YouTube subset. The conclusion to train this system. The system was created in Kaldi and is that GMM/DNN system is more robust on unseen data. it is denoted as Babel Kaldi BLSTM in Table 1. 6. REFERENCES [1] Martin Karafiát, František Grézl, Mirko Hannemann, and Jan Černocký. BUT Neural Network Features for Spontaneous Vietnamese in BABEL. In Proceedings of ICASSP 2014, pages 5659–5663. IEEE Signal Processing Society, 2014. [2] Martin Karafiát, Murali Karthick Baskar, Pavel Matějka, Karel Veselý, František Grézl, and Jan ”Honza” Černocký. Multilingual BLSTM and Speaker-Specific Vector Adaptation in 2016 BUT Babel System. Accepted at SLT 2016, 2016. [3] Igor Szöke and Xavier Anguera. Zero-Cost Speech Recognition Task at Mediaeval 2016. In Working Notes Proceedings of the Mediaeval 2016 Workshop, Hilversum, Netherlands, October 20-21 2016. [4] Lucas Ondel, Lukáš Burget, and Jan Černocký. Variational Inference for Acoustic Unit Discovery. In Procedia Computer Science, volume 2016, pages 80–86. Elsevier Science, 2016. [5] Charles E. Antoniak. Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems. Annals of Statistics, 2(6), November 1974. [6] Chia-ying Lee and James Glass. A Nonparametric Bayesian Approach to Acoustic Model Discovery. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL ’12, pages 40–49, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.