=Paper= {{Paper |id=Vol-1739/MediaEval_2016_paper_48 |storemode=property |title=BUT Zero-Cost Speech Recognition 2016 System Description |pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_48.pdf |volume=Vol-1739 |dblpUrl=https://dblp.org/rec/conf/mediaeval/SkacelKOUS16 }} ==BUT Zero-Cost Speech Recognition 2016 System Description== https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_48.pdf
          BUT Zero-Cost Speech Recognition 2016 System
                           Description

               Miroslav Skácel, Martin Karafiát, Lucas Ondel, Albert Uchytil, Igor Szöke
                                                   BUT Speech@FIT
                                     Brno University of Technology, Czech Republic
                           {iskacel, karafiat, iondel, xuchyt03, szoke}@fit.vutbr.cz


ABSTRACT                                                           Numerals composed of digits were expanded to its tex-
This paper describes our work on developing speech recog-       tual form. We took textual transcription of basic numerals
nizers for Vietnamese. It focuses on procedures to prepare      (0,1,2,..,100,1000,...) in Vietnamese language. The proce-
provided data precisely. We aim on analysis of the textual      dure to compose a number in Vietnamese is simple com-
transcriptions in particular. Methods to filter out defective   pared to some other languages and follows very logical rules.
data to improve performance of final system are proposed        Thus, the textual translation was iteratively created for ev-
and described in detail. We also propose cleaning of other      ery number composed of digits.
textual data used for language modeling. Several architec-         A transcription was a simple text file for each of audio files.
tures are investigated to reach both sub-tasks goals. The       There was no information about transcription alignment so
achieved results are discussed.                                 it could match the whole audio. Audio files longer than 1
                                                                minute were discarded from first iteration of LVCSR train-
                                                                ing (described in 3.1) due to high memory demands. We
1.    INTRODUCTION                                              used the alignment obtained during training to split tran-
   For the Zero-Cost 2016 Speech Recognition task, we devel-    scriptions into smaller segments. For every detected silence
oped one Large Vocabulary Continuous Speech Recognition         longer than 0.5 s, the segment was divided. If the segment
(LVCSR) system and one subword system for on-time sub-          lasted longer than 15 seconds, it was also split by first possi-
mission and two more LVCSR systems for late submission.         ble silence detected regardless the duration. This allowed us
LVCSR systems were based on our previous knowledge from         to utilize the whole training data set with acceptable mem-
Babel Program [1][2]. We presented two types of LVCSRs.         ory demands during training.
The first type uses Gaussian Mixture Model (GMM), Hidden           In the next step, we focused on defective audio and im-
Markov Model (HMM) and Deep Neural Network (DNN)                proper transcription texts. The average log-likelihood of
from Babel 2014 [1]. The second one adopted Bidirectional       speech frames was calculated by accumulating the log-likelihood
Long Short-Term Memory (BLSTM) approach from Babel              from the first iteration system for all speech frames and di-
2016 [2]. Our goal was to modify and apply existing Ba-         viding by the number of frames in the given audio. The same
bel LVCSR systems for this year’s target language that was      was done for silence frames. The log-likelihood of speech was
Vietnamese. For subword sub-task, we exploited acoustic         very low when the audio contained a silence/noise only, the
unit discovery model. See [3] for more details on each of the   transcription did not strongly correspond to the audio or
sub-tasks.                                                      a part of the transcription was missing. Therefore, we dis-
                                                                carded defective files from further training by ad-hoc thresh-
2.    DATA PREPARATION                                          old set to -100. We used 92 % of training data after cleaning.
  The given mix of audio data, transcripts and additional       2.3    Language Models (LMs)
texts were preprocessed and elementary cleaned before train-
ing of our systems.                                                Three LMs were created for this sub-task. The first LM
                                                                for on-time submitted LVCSR system was trained on text
2.1   Audio                                                     taken from training set transcriptions.
  For BLSTM system, the original 16kHz audio was used.             The second LM was trained on Vietnamese subtitles. We
For GMM/DNN based system, the original audio was down-          took provided wordlist to create set of Vietnamese letters
sampled from 16kHz to 8kHz to fit our training scripts. We      to filter out words from other languages. Again, punctua-
also used the information about the audio length to process     tion and quotation marks, brackets and other symbols were
transcription texts later.                                      eliminated from the text. Numerals composed of digits were
                                                                transformed to textual notation in the same way we did
2.2   Transcriptions                                            previously. Sentences comprising less than 3 words were
  All other symbols but letters from Vietnamese alphabet        discarded as well. The text was converted to uppercase.
(punctuation marks, brackets, etc.) were removed from tran-        We were provided with a set of URLs which headed to the
scriptions and text was converted to uppercase.                 websites in Vietnamese. We extracted the inner text from
                                                                all of the HTML tags. However, the data contained a lot of
                                                                unusable text. We removed the lines containing any special
Copyright is held by the author/owner(s).
MediaEval 2016 Workshop, October 20-21, 2016, Hilversum,        chars and numbers at first. After that, we created a wordlist
Netherlands                                                     from already cleaned up data and did a filtering according to
                                                                  Devel                                   Test
     System
                                                    all (ELSA / Forvo / RhinoSpike)   all (ELSA / Forvo / RhinoSpike / YouTube)
     P-BUT - Babel Kaldi BLSTM 16kHz                      17.9 (6.4 / 58.1 / 15.8)           48.0 (4.9 / 55.7 / 35.4 / 87.2)
     L-BUT - Babel Kaldi BLSTM 16kHz - LM tune            17.6 (6.2 / 56.4 / 16.9)           46.3 (4.6 / 52.6 / 32.2 / 84.7)
     L-BUT - Babel GMM/DNN 8kHz                          36.1 (29.7 / 68.5 / 23.4)           55.7 (28.0 / 59.3 / 44.9 / 81.4)
Table 1: Results of the LVCSR systems for overall score and single test subsets (shown in parentheses) in WER metric. System
labeled by P was submitted on-time; L denotes late submission systems.

                                                                  Devel                                    Test
     System
                                                    all (ELSA / Forvo / RhinoSpike)   all (ELSA / Forvo / RhinoSpike / YouTube)
     P-BUT AUD phone-loop                                5.08 (6.45 / 8.76 / 14.19)          4.56 (5.52 / 9.59 / 18.49 / 7.59)
     Table 2: Results of the subword system for overall score and single test subsets (shown in parentheses) in NMI metric.


it. The duplicate lines were removed. In total, we obtained        4.    SUBWORD SYSTEM
about 460k sentences to create our third LM.                         The acoustic unit discovery (AUD) model presented in [4]
   These three LMs were combined together in a linear way          aims at segmenting and clustering unlabeled speech data
(denoted as LM tune in Table 1).                                   into phone-like categories. It is similar to a phone-loop
                                                                   model in which each phone-like unit is modeled by an HMM.
3.     LVCSR SYSTEMS                                               This phone-loop model is fully Bayesian in the sense that:
  We developed two different LVCSR system for the first
                                                                        • it incorporates a prior distribution over the parameters
sub-task - GMM/DNN and BLSTM architectures.
                                                                          of the HMMs
3.1     GMM/DNN
                                                                        • it has a prior distribution over the units modeled by a
   The automatic speech recognition (ASR) system devel-                   Dirichlet process [5].
oped for Babel [1] focuses on languages with limited amount
of training data. This architecture uses Stacked Bottle-Neck         Informally, the Dirichlet process prior can be seen as a
Neural Network (SBN NN) for feature extraction that over-          standard Dirichlet distribution prior for a Bayesian mixture
comes standard Bottle-Neck features. It contains two con-          with an infinite number of components. However, we assume
secutive NNs. The first one has four hidden layers with 1500       that our N data samples have been generated with only M
units each except the bottle-neck layer. The BN layer is the       components (M ≤ N ) from the infinite mixture. Hence,
third hidden with 80 neurons. It outputs 21 frames that are        the model is no longer restricted to have a fixed number of
downsampled and taken as an input to the second NN. This           components but instead can learn its complexity (i.e. num-
NN has the same structure. The bottle-neck layer consist of        ber of units used M ) according to the training data. The
30 neurons. It outputs SBN features that are used to train         priors over the GMM weights, Gaussian mean and (diago-
GMM-HMM system.                                                    nal) covariance matrix are a Dirichlet and a Normal-Gamma
   This HMM-based speech recognition system works with             density respectively and were initialized as described in [6].
tied-state triphones and uses standard maximum likelihood          See [4] for the Variational Bayesian treatment of this model.
technique for training. Word transcriptions are get using
3-gram LM taken from cleaned training texts.
   To perform speaker adaptation, we trained GMM sys-              5.    CONCLUSION
tem on NN input features. The Discrete Cosine Trans-                  The primary on-time system based on BLSTM using orig-
form (DCT) follows to decorrelate Mel-filterbank features          inal 16kHz audio and trained on the original transcriptions
(FBANK). The speaker independent GMM-HMM system                    resulted in overall 48 % WER for the test set data. For
is done by single-pass retraining using these FBANKs. Fi-          ELSA test subset, the WER reached 4.9 % which is partic-
nally, Constrained Maximum Likelihood Linear Regression            ularly good result. This subset data probably fits perfectly
(CMLLR) transform is estimated for each speaker.                   to the training set. On the contrary, the unseen YouTube
   We trained systems in iterative manner. In the first it-        test subset resulted in 87.2 % WER which is the worst score
eration, the simple monophone model was trained to get             out of test subsets.
alignment of the text to the audio. In the second iteration,          The improved late submitted BLSTM system using the
we got the final full system.                                      combination of LMs showed overall 1.7 % WER improve-
                                                                   ment on the test data. The improvement on every single
3.2     BLSTM                                                      test subsets is nearly 3% WER (except ELSA subset).
   The ASR system developed for Babel 2016 [2] focuses on             The late GMM/DNN system using 8kHz audio and cleaned
model training using BLSTM networks. The BLSTM sys-                texts ended up with the worst overall score 55.7 % WER for
tem does not overperform the classical architecture but is         the test set. Compared to our best BLSTM system, the
more stable during training. The BLSTM network archi-              score decreased by overall 9.4 % WER. The results for sin-
tecture consists of 3 hidden layers in both directions where       gle databases (seen during the training) are following: the
there are 512 memory units in each layer and 300 neurons           decrease of 23.4 % for ELSA; the decrease of 6.7 % for Forvo;
in the projection layer.                                           the decrease of 12.6 % for RhinoSpike. The only score im-
   The transcriptions were not cleaned and were taken as is        provement of 3.3 % was for YouTube subset. The conclusion
to train this system. The system was created in Kaldi and          is that GMM/DNN system is more robust on unseen data.
it is denoted as Babel Kaldi BLSTM in Table 1.
6.   REFERENCES
[1] Martin Karafiát, František Grézl, Mirko Hannemann,
    and Jan Černocký. BUT Neural Network Features for
    Spontaneous Vietnamese in BABEL. In Proceedings of
    ICASSP 2014, pages 5659–5663. IEEE Signal
    Processing Society, 2014.
[2] Martin Karafiát, Murali Karthick Baskar, Pavel
    Matějka, Karel Veselý, František Grézl, and
    Jan ”Honza” Černocký. Multilingual BLSTM and
    Speaker-Specific Vector Adaptation in 2016 BUT Babel
    System. Accepted at SLT 2016, 2016.
[3] Igor Szöke and Xavier Anguera. Zero-Cost Speech
    Recognition Task at Mediaeval 2016. In Working Notes
    Proceedings of the Mediaeval 2016 Workshop,
    Hilversum, Netherlands, October 20-21 2016.
[4] Lucas Ondel, Lukáš Burget, and Jan Černocký.
    Variational Inference for Acoustic Unit Discovery. In
    Procedia Computer Science, volume 2016, pages 80–86.
    Elsevier Science, 2016.
[5] Charles E. Antoniak. Mixtures of Dirichlet Processes
    with Applications to Bayesian Nonparametric
    Problems. Annals of Statistics, 2(6), November 1974.
[6] Chia-ying Lee and James Glass. A Nonparametric
    Bayesian Approach to Acoustic Model Discovery. In
    Proceedings of the 50th Annual Meeting of the
    Association for Computational Linguistics: Long
    Papers - Volume 1, ACL ’12, pages 40–49, Stroudsburg,
    PA, USA, 2012. Association for Computational
    Linguistics.