<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BUT Zero-Cost Speech Recognition 2016 System Description</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Miroslav Skácel</institution>
          ,
          <addr-line>Martin Karafiát, Lucas Ondel, Albert Uchytil, Igor Szöke BUT</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>This paper describes our work on developing speech recognizers for Vietnamese. It focuses on procedures to prepare provided data precisely. We aim on analysis of the textual transcriptions in particular. Methods to lter out defective data to improve performance of nal system are proposed and described in detail. We also propose cleaning of other textual data used for language modeling. Several architectures are investigated to reach both sub-tasks goals. The achieved results are discussed.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        For the Zero-Cost 2016 Speech Recognition task, we
developed one Large Vocabulary Continuous Speech Recognition
(LVCSR) system and one subword system for on-time
submission and two more LVCSR systems for late submission.
LVCSR systems were based on our previous knowledge from
Babel Program [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We presented two types of LVCSRs.
The rst type uses Gaussian Mixture Model (GMM), Hidden
Markov Model (HMM) and Deep Neural Network (DNN)
from Babel 2014 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The second one adopted Bidirectional
Long Short-Term Memory (BLSTM) approach from Babel
2016 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Our goal was to modify and apply existing
Babel LVCSR systems for this year's target language that was
Vietnamese. For subword sub-task, we exploited acoustic
unit discovery model. See [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for more details on each of the
sub-tasks.
      </p>
    </sec>
    <sec id="sec-2">
      <title>DATA PREPARATION</title>
      <p>The given mix of audio data, transcripts and additional
texts were preprocessed and elementary cleaned before
training of our systems.</p>
    </sec>
    <sec id="sec-3">
      <title>Audio</title>
      <p>For BLSTM system, the original 16kHz audio was used.
For GMM/DNN based system, the original audio was
downsampled from 16kHz to 8kHz to t our training scripts. We
also used the information about the audio length to process
transcription texts later.</p>
    </sec>
    <sec id="sec-4">
      <title>Transcriptions</title>
      <p>All other symbols but letters from Vietnamese alphabet
(punctuation marks, brackets, etc.) were removed from
transcriptions and text was converted to uppercase.</p>
      <p>Numerals composed of digits were expanded to its
textual form. We took textual transcription of basic numerals
(0,1,2,..,100,1000,...) in Vietnamese language. The
procedure to compose a number in Vietnamese is simple
compared to some other languages and follows very logical rules.
Thus, the textual translation was iteratively created for
every number composed of digits.</p>
      <p>A transcription was a simple text le for each of audio les.
There was no information about transcription alignment so
it could match the whole audio. Audio les longer than 1
minute were discarded from rst iteration of LVCSR
training (described in 3.1) due to high memory demands. We
used the alignment obtained during training to split
transcriptions into smaller segments. For every detected silence
longer than 0.5 s, the segment was divided. If the segment
lasted longer than 15 seconds, it was also split by rst
possible silence detected regardless the duration. This allowed us
to utilize the whole training data set with acceptable
memory demands during training.</p>
      <p>In the next step, we focused on defective audio and
improper transcription texts. The average log-likelihood of
speech frames was calculated by accumulating the log-likelihood
from the rst iteration system for all speech frames and
dividing by the number of frames in the given audio. The same
was done for silence frames. The log-likelihood of speech was
very low when the audio contained a silence/noise only, the
transcription did not strongly correspond to the audio or
a part of the transcription was missing. Therefore, we
discarded defective les from further training by ad-hoc
threshold set to -100. We used 92 % of training data after cleaning.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Language Models (LMs)</title>
      <p>Three LMs were created for this sub-task. The rst LM
for on-time submitted LVCSR system was trained on text
taken from training set transcriptions.</p>
      <p>The second LM was trained on Vietnamese subtitles. We
took provided wordlist to create set of Vietnamese letters
to lter out words from other languages. Again,
punctuation and quotation marks, brackets and other symbols were
eliminated from the text. Numerals composed of digits were
transformed to textual notation in the same way we did
previously. Sentences comprising less than 3 words were
discarded as well. The text was converted to uppercase.</p>
      <p>We were provided with a set of URLs which headed to the
websites in Vietnamese. We extracted the inner text from
all of the HTML tags. However, the data contained a lot of
unusable text. We removed the lines containing any special
chars and numbers at rst. After that, we created a wordlist
from already cleaned up data and did a ltering according to
Devel Test
all (ELSA / Forvo / RhinoSpike) all (ELSA / Forvo / RhinoSpike / YouTube)
P-BUT - Babel Kaldi BLSTM 16kHz
L-BUT - Babel Kaldi BLSTM 16kHz - LM tune
L-BUT - Babel GMM/DNN 8kHz
System</p>
      <p>P-BUT AUD phone-loop
it. The duplicate lines were removed. In total, we obtained
about 460k sentences to create our third LM.</p>
      <p>These three LMs were combined together in a linear way
(denoted as LM tune in Table 1).</p>
    </sec>
    <sec id="sec-6">
      <title>LVCSR SYSTEMS</title>
      <p>We developed two di erent LVCSR system for the rst
sub-task - GMM/DNN and BLSTM architectures.
3.1</p>
      <p>GMM/DNN</p>
      <p>
        The automatic speech recognition (ASR) system
developed for Babel [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] focuses on languages with limited amount
of training data. This architecture uses Stacked Bottle-Neck
Neural Network (SBN NN) for feature extraction that
overcomes standard Bottle-Neck features. It contains two
consecutive NNs. The rst one has four hidden layers with 1500
units each except the bottle-neck layer. The BN layer is the
third hidden with 80 neurons. It outputs 21 frames that are
downsampled and taken as an input to the second NN. This
NN has the same structure. The bottle-neck layer consist of
30 neurons. It outputs SBN features that are used to train
GMM-HMM system.
      </p>
      <p>This HMM-based speech recognition system works with
tied-state triphones and uses standard maximum likelihood
technique for training. Word transcriptions are get using
3-gram LM taken from cleaned training texts.</p>
      <p>To perform speaker adaptation, we trained GMM
system on NN input features. The Discrete Cosine
Transform (DCT) follows to decorrelate Mel- lterbank features
(FBANK). The speaker independent GMM-HMM system
is done by single-pass retraining using these FBANKs.
Finally, Constrained Maximum Likelihood Linear Regression
(CMLLR) transform is estimated for each speaker.</p>
      <p>We trained systems in iterative manner. In the rst
iteration, the simple monophone model was trained to get
alignment of the text to the audio. In the second iteration,
we got the nal full system.
3.2</p>
    </sec>
    <sec id="sec-7">
      <title>BLSTM</title>
      <p>
        The ASR system developed for Babel 2016 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] focuses on
model training using BLSTM networks. The BLSTM
system does not overperform the classical architecture but is
more stable during training. The BLSTM network
architecture consists of 3 hidden layers in both directions where
there are 512 memory units in each layer and 300 neurons
in the projection layer.
      </p>
      <p>The transcriptions were not cleaned and were taken as is
to train this system. The system was created in Kaldi and
it is denoted as Babel Kaldi BLSTM in Table 1.</p>
    </sec>
    <sec id="sec-8">
      <title>4. SUBWORD SYSTEM</title>
      <p>
        The acoustic unit discovery (AUD) model presented in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
aims at segmenting and clustering unlabeled speech data
into phone-like categories. It is similar to a phone-loop
model in which each phone-like unit is modeled by an HMM.
This phone-loop model is fully Bayesian in the sense that:
it incorporates a prior distribution over the parameters
of the HMMs
it has a prior distribution over the units modeled by a
Dirichlet process [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Informally, the Dirichlet process prior can be seen as a
standard Dirichlet distribution prior for a Bayesian mixture
with an in nite number of components. However, we assume
that our N data samples have been generated with only M
components (M N ) from the in nite mixture. Hence,
the model is no longer restricted to have a xed number of
components but instead can learn its complexity (i.e.
number of units used M ) according to the training data. The
priors over the GMM weights, Gaussian mean and
(diagonal) covariance matrix are a Dirichlet and a Normal-Gamma
density respectively and were initialized as described in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
See [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for the Variational Bayesian treatment of this model.
5.
      </p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSION</title>
      <p>The primary on-time system based on BLSTM using
original 16kHz audio and trained on the original transcriptions
resulted in overall 48 % WER for the test set data. For
ELSA test subset, the WER reached 4.9 % which is
particularly good result. This subset data probably ts perfectly
to the training set. On the contrary, the unseen YouTube
test subset resulted in 87.2 % WER which is the worst score
out of test subsets.</p>
      <p>The improved late submitted BLSTM system using the
combination of LMs showed overall 1.7 % WER
improvement on the test data. The improvement on every single
test subsets is nearly 3% WER (except ELSA subset).</p>
      <p>The late GMM/DNN system using 8kHz audio and cleaned
texts ended up with the worst overall score 55.7 % WER for
the test set. Compared to our best BLSTM system, the
score decreased by overall 9.4 % WER. The results for
single databases (seen during the training) are following: the
decrease of 23.4 % for ELSA; the decrease of 6.7 % for Forvo;
the decrease of 12.6 % for RhinoSpike. The only score
improvement of 3.3 % was for YouTube subset. The conclusion
is that GMM/DNN system is more robust on unseen data.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Martin</surname>
            <given-names>Kara at</given-names>
          </string-name>
          , Frantisek Grezl, Mirko Hannemann, and
          <string-name>
            <given-names>Jan</given-names>
            <surname>Cernocky</surname>
          </string-name>
          .
          <article-title>BUT Neural Network Features for Spontaneous Vietnamese in BABEL</article-title>
          .
          <source>In Proceedings of ICASSP 2014</source>
          , pages
          <fpage>5659</fpage>
          {
          <fpage>5663</fpage>
          .
          <source>IEEE Signal Processing Society</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Martin</surname>
            <given-names>Kara at</given-names>
          </string-name>
          , Murali Karthick Baskar, Pavel Matejka, Karel Vesely,
          <article-title>Frantisek Grezl, and Jan "Honza" Cernocky. Multilingual BLSTM and Speaker-Speci c Vector Adaptation in 2016 BUT Babel System</article-title>
          .
          <source>Accepted at SLT</source>
          <year>2016</year>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Igor</given-names>
            <surname>Szo</surname>
          </string-name>
          <article-title>ke and Xavier Anguera</article-title>
          .
          <article-title>Zero-Cost Speech Recognition Task at Mediaeval 2016</article-title>
          .
          <source>In Working Notes Proceedings of the Mediaeval 2016 Workshop</source>
          , Hilversum, Netherlands, October
          <volume>20</volume>
          -21
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Lucas</given-names>
            <surname>Ondel</surname>
          </string-name>
          , Lukas Burget, and
          <string-name>
            <given-names>Jan</given-names>
            <surname>Cernocky</surname>
          </string-name>
          .
          <article-title>Variational Inference for Acoustic Unit Discovery</article-title>
          .
          <source>In Procedia Computer Science</source>
          , volume
          <volume>2016</volume>
          , pages
          <fpage>80</fpage>
          {
          <fpage>86</fpage>
          .
          <string-name>
            <surname>Elsevier</surname>
            <given-names>Science</given-names>
          </string-name>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Charles</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Antoniak</surname>
          </string-name>
          .
          <article-title>Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems</article-title>
          .
          <source>Annals of Statistics</source>
          ,
          <volume>2</volume>
          (
          <issue>6</issue>
          ),
          <year>November 1974</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Chia-ying Lee</surname>
            and
            <given-names>James</given-names>
          </string-name>
          <string-name>
            <surname>Glass</surname>
          </string-name>
          .
          <article-title>A Nonparametric Bayesian Approach to Acoustic Model Discovery</article-title>
          .
          <source>In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL '12</source>
          , pages
          <fpage>40</fpage>
          {
          <fpage>49</fpage>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA,
          <year>2012</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>