<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LTL-UDE at Low-Resource Speech-to-Text Shared Task: Investigating Mozilla DeepSpeech in a low-resource setting</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aashish Agarwal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Torsten Zesch Language Technology Lab</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Duisburg-Essen Duisburg</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We describe our system participating in the SwissText/KONVENS shared task on low-resource speech-to-text (Plüss et al., 2020). We train an end-to-end neural model based on Mozilla DeepSpeech. We examine various methods to improve over the baseline results: transfer learning from standard German and English, data augmentation, and post-processing. Our best system achieves a somewhat disappointing WER of 58.9% on the held-out test set, indicating that it is currently challenging to obtain good results with this approach in a low-resource setting.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Recently, end-to-end models like DeepSpeech1
have been introduced as an alternative to
traditional HMM-DNN based models like Kaldi
        <xref ref-type="bibr" rid="ref14">(Povey et al., 2011)</xref>
        . However, they are relatively
data hungry, i.e. they require large amounts of
annotated data to work well. For example, the
original DeepSpeech implementation from Baidu
        <xref ref-type="bibr" rid="ref7">(Hannun et al., 2014)</xref>
        was trained on 7,380 hours
of data, DeepSpeech2
        <xref ref-type="bibr" rid="ref2">(Amodei et al., 2015)</xref>
        was
trained on 11,940 hours of data and DeepSpeech3
        <xref ref-type="bibr" rid="ref5">(Battenberg et al., 2017)</xref>
        was trained on about
10,000 hours of data. Such large datasets are
usually only available for languages like English or
Mandarin, but even for major languages like
German much less data is available and consequently
DeepSpeech models do not perform well
        <xref ref-type="bibr" rid="ref1">(Agarwal and Zesch, 2019)</xref>
        .
      </p>
      <p>In this paper, we examine how well DeepSpeech
performs in a truly low-resource setting like Swiss
Language</p>
      <p>Dataset
Swiss German</p>
      <p>SwissText Shared Task</p>
      <p>ArchiMob
German
English</p>
      <p>Voxforge
TUDA-De
M-AILABS
MCV_v4
LibriSpeech
MCV</p>
      <p>
        Size [h]
German, where less than 100 hours of annotated
data are available. Previous speech recognition
systems for Swiss German
        <xref ref-type="bibr" rid="ref10 ref17 ref6">(Garner et al., 2014;
Stadtschnitzer and Schmidt, 2018)</xref>
        are based on
Kaldi.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Model Training</title>
      <p>We used DeepSpeech version 0.6.0 for all
experiments.2
2.1</p>
      <sec id="sec-2-1">
        <title>Datasets</title>
        <p>To train the Swiss German DeepSpeech model, we
utilized the following publicly available datasets
as showed in Table 1.</p>
        <p>For Swiss German we used the official data
provided by the shared task (Plüss et al., 2020).
The corpus contains 70 hours of spoken Swiss
German (predominantly in the Bernese dialect)
and some Standard German speech from the
parliament of the canton of Bern3. We additionally
use the ArchiMob (Samardžic´ et al., 2016) corpus,
which represents German linguistic varieties
spoken within the territory of Switzerland and
contains long samples of transcribed text in Swiss
German. The corpus contains 57 hours and is
2https://github.com/mozilla/DeepSpeech/releases/tag/v0.6.0
3https://swisstext-and-konvens-2020.org/low-resource-speech-to-text/
available under creative commons licence 4.0.4</p>
        <p>
          As the amount of data is probably not sufficient
to train a good model, we will experiment with
transfer learning from standard German. Publicly
available datasets include Voxforge5, TUDA-De
          <xref ref-type="bibr" rid="ref10 ref17">(Milde and Köhn, 2018)</xref>
          , M-AILabs6, and Mozilla
Common Voice
          <xref ref-type="bibr" rid="ref3">(Ardila et al., 2019)</xref>
          . Together
those datasets add almost 1,000 hours of
additional training data (although in the wrong German
dialect). The datasets also do not contain
political speeches and thus are a less than ideal starting
point for transfer learning.
        </p>
        <p>
          As there has been previous work on
transfer learning models starting with a different
language
          <xref ref-type="bibr" rid="ref4 ref9">(Kunze et al., 2017; Bansal et al., 2018)</xref>
          ,
we also consider English corpora: LibriSpeech
          <xref ref-type="bibr" rid="ref11">(Panayotov et al., 2015)</xref>
          and Mozilla Common
Voice.7 These are among the largest and widely
used open-source corpora. LibriSpeech consists
of 16kHz read English speech derived from
audiobooks from the LibriVox project and has been
carefully segmented and aligned.8 On the other
hand, the Mozilla Common Voice project employs
crowdsourcing to collect data on its portal.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2 Server &amp; Runtime</title>
        <p>We trained and tested our models on a compute
server having 56 Intel(R) Xeon(R) Gold 5120
CPUs @ 2.20GHz, 3 Nvidia Quadro RTX 6000
with 24GB of RAM each. Typical training time
with augmentation for the SwissText dataset is 1.5
hours, for German 12 hours, and for English 30
hours. Without augmentation, the training time
was approximately 10% less than with
augmentation.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Preprocessing</title>
        <p>We cleaned the data by using only the allowed set
of characters listed by the shared task. We
converted all transcriptions to lower case and further
ensured that all audio clips are in wav format. The
resulting samples were split into training (70%),
validation (15%), and test data (15%). The
preprocessing scripts can be referenced at GitHub 9
4https://www.spur.uzh.ch/en/departments/research/textgroup/ArchiMob.
html</p>
        <p>5http://www.voxforge.org/home/forums/other-languages/german/
open-speech-data-corpus-for-german
6https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/
7https://voice.mozilla.org/en
8http://www.openslr.org/12/
9https://github.com/AASHISHAG/deepspeech-swiss-german
Hyperparameter
Batch Size
Dropout
Learning Rate
English
a
b
German
a
b</p>
        <p>
          Value
For the acoustic model, we use the best
hyperparameters as reported by
          <xref ref-type="bibr" rid="ref1">(Agarwal and Zesch,
2019)</xref>
          and listed in Table 2.
        </p>
        <p>
          We use a probabilistic 3-gram language model
based on KenLM
          <xref ref-type="bibr" rid="ref8">(Heafield, 2011)</xref>
          and trained on
the German-English part of Europarl10 as well as
the corpus used to train the TUDA-De language
model
          <xref ref-type="bibr" rid="ref15">(Radeck-Arneth et al., 2015)</xref>
          . For German,
we searched for a good set of values and got the
best results with the ones mentioned in Table 2.
For English we referred the values of a and b
from DeepSpeech release page11
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>As the baseline model, we train DeepSpeech with
the setup described above and using only the Swiss
German data provided by the shared task. The
model achieved a WER of 71.5%. As expected,
DeepSpeech is not able to simply train a suitable
model based on this amount of training data.</p>
      <p>We try to improve over those results using data
augmentation and transfer learning as discussed in
the remainder of this section.
3.1</p>
      <sec id="sec-3-1">
        <title>Data Augmentation</title>
        <p>Augmentation is a useful technique for better
generalization of machine learning models. Inspired
by Park et al. (2019), Mozilla DeepSpeech has
implemented several augmentation techniques like
frequency masking, time masking, speed scaling,
and pitch scaling. We used all the augmentation
approaches with default hyperparameters, which
can be referenced here.12 Augmentation actually
10https://www.statmt.org/europarl/
11https://github.com/mozilla/DeepSpeech/releases/tag/v0.6.0
12https://deepspeech.readthedocs.io/en/v0.7.0/TRAINING.html#
training-with-augmentation
71.5
increases model error from 71.5% to 74.3%.
However, we further test the impact of augmentation in
our transfer learning results discussed below.
As we have discussed above, end-to-end training
of automated speech recognition systems requires
massive data. As we only have 70 hours of training
data available from the shared task, we experiment
with transferring the model from different starting
points. Table 3 gives an overview of the results.
Transferring from about 2,500 hours of English
data gives about the same results are starting from
about 1,000 hours of German data even if
standard German is closer to Swiss German than
English. However, the best results are achieved when
starting with English, transferring to German and
then transferring to Swiss. Data augmentation in
this case improves results a bit for a final WER of
61.5%.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Error Analysis</title>
      <p>When analyzing the errors made by DeepSpeech,
one issue stands out: truncated output. Quite a lot
of output texts are much shorter than the source
transcript. Table 4 shows some examples. The
performance of the model will be seriously
impacted by not producing long enough output
sentences. It might be informative to only look at
output text that is about the same length as the
original transcript. Figure 1 displays the distribution
of samples with a certain ratio of sample length
to source sample length in characters. The
figure shows that almost all DeepSpeech outputs are
shorter than the original. If we only look at the
samples that are about the same size as expected
(with a ratio higher than 0.75, which is still about
half of all samples), we find that WER improves
from 61.5% to 47.7%. This means that when the
model outputs a string that is approximately of the
correct length, it is actually much better than the
1.00
0.80
1.00
0.67
results in Table 3 indicate.</p>
      <p>The length of the output is partly controlled by
the model’s hyperparameters. We want to find a
sequence c that maximizes the combined objective
function:</p>
      <p>
        Q(c)=log(P(cjx))+a log(Plm(c))+b wordcount(c)
where a and b controls the trade-off between the
acoustic model, the language model constraint,
and the length of the sentence. The term Plm
indicates the probability of the sequence c according
to the language model. The weight a constrains
the relative contributions of the CTC network and
the language model and the weight b determines
the count of words in the recognized transcription
        <xref ref-type="bibr" rid="ref2 ref7">(Hannun et al., 2014; Amodei et al., 2015)</xref>
        .
      </p>
      <p>By changing the relative weight of acoustic
model and language model by optimizing a and b ,
we can improve the model a bit as shown in the
optimized model examples in Table 4. However, we
were not able to eliminate the problem altogether.
70.2
Consequently, WER only improves from 61.5. to
57.1 (with augmentation). As the model with the
optimized hyperparameters and without
augmentation is still a bit better, we submitted that one in
the shared task. It achieved a WER of 58.9% on
the held-out test set.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Summary</title>
      <p>The baseline system trained only on the
SwissGerman data yields a quite high word error rate of
71.5. Data augmentation strategies implemented
in DeepSpeech did not result in consistent
improvements. Transfer learning has a much higher
impact reducing the word error rate by over 10
percent points when transferring an English model to
German and finally transferring to Swiss German.
The best model yields a WER of 56.6% on our test
set (58.9% in the public ranking based on the
hidden test set of the shared task). When analyzing
the results, the model seems to suffer from
truncated output which we can somewhat improve by
hyperparameter tuning. Overall, the results show
that training an end-to-end neural speech
recognition system with DeepSpeech in a low-resource
setting remains challenging.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Aashish</given-names>
            <surname>Agarwal</surname>
          </string-name>
          and
          <string-name>
            <given-names>Torsten</given-names>
            <surname>Zesch</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>German end-to-end speech recognition based on deepspeech</article-title>
          .
          <source>In Proceedings of the 15th Conference on Natural Language Processing (KONVENS</source>
          <year>2019</year>
          ), pages
          <fpage>111</fpage>
          -
          <lpage>119</lpage>
          , Erlangen, Germany. GSCL.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Dario</given-names>
            <surname>Amodei</surname>
          </string-name>
          , Rishita Anubhai, Eric Battenberg, and
          <string-name>
            <given-names>Carl</given-names>
            <surname>Case</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Deep speech 2: End-to-end speech recognition in english and mandarin</article-title>
          . CoRR, abs/1512.02595.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Rosana</given-names>
            <surname>Ardila</surname>
          </string-name>
          , Megan Branson, Kelly Davis,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Henretty</surname>
          </string-name>
          , Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders,
          <string-name>
            <surname>Francis M. Tyers</surname>
            , and
            <given-names>Gregor</given-names>
          </string-name>
          <string-name>
            <surname>Weber</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Common voice: A massivelymultilingual speech corpus</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Sameer</given-names>
            <surname>Bansal</surname>
          </string-name>
          , Herman Kamper, Karen Livescu, Adam Lopez, and
          <string-name>
            <given-names>Sharon</given-names>
            <surname>Goldwater</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Pretraining on high-resource speech recognition improves low-resource speech-to-text translation</article-title>
          .
          <source>CoRR</source>
          , abs/
          <year>1809</year>
          .01431.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Eric</given-names>
            <surname>Battenberg</surname>
          </string-name>
          , Jitong Chen, Rewon Child, Adam Coates, Yashesh Gaur,
          <string-name>
            <given-names>Yi</given-names>
            <surname>Li</surname>
          </string-name>
          , Hairong Liu, Sanjeev Satheesh, David Seetapun,
          <string-name>
            <given-names>Anuroop</given-names>
            <surname>Sriram</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Zhenyao</given-names>
            <surname>Zhu</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Exploring neural transducers for end-to-end speech recognition</article-title>
          .
          <source>CoRR, abs/1707</source>
          .07413.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Philip N.</given-names>
            <surname>Garner</surname>
          </string-name>
          , David Imseng, and Thomas Meyer.
          <year>2014</year>
          .
          <article-title>Automatic speech recognition and translation of a swiss german dialect: Walliserdeutsch.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Awni Y.</given-names>
            <surname>Hannun</surname>
          </string-name>
          , Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and
          <string-name>
            <given-names>Andrew Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Deep speech: Scaling up end-to-end speech recognition</article-title>
          .
          <source>CoRR, abs/1412</source>
          .5567.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Kenneth</given-names>
            <surname>Heafield</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>KenLM: Faster and smaller language model queries</article-title>
          .
          <source>In Proceedings of the Sixth Workshop on Statistical Machine Translation</source>
          , pages
          <fpage>187</fpage>
          -
          <lpage>197</lpage>
          , Edinburgh, Scotland.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Julius</given-names>
            <surname>Kunze</surname>
          </string-name>
          , Louis Kirsch, Ilia Kurenkov, Andreas Krug, Jens Johannsmeier, and
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Stober</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Transfer learning for speech recognition on a budget</article-title>
          .
          <source>CoRR, abs/1706</source>
          .00290.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Milde</surname>
          </string-name>
          and
          <string-name>
            <given-names>Arne</given-names>
            <surname>Köhn</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Open source automatic speech recognition for german</article-title>
          .
          <source>CoRR</source>
          , abs/
          <year>1807</year>
          .10311.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>V.</given-names>
            <surname>Panayotov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Povey</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Khudanpur</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Librispeech: An asr corpus based on public domain audio books</article-title>
          .
          <source>In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , pages
          <fpage>5206</fpage>
          -
          <lpage>5210</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Daniel S.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>William</given-names>
            <surname>Chan</surname>
          </string-name>
          , Yu Zhang, ChungCheng Chiu, Barret Zoph,
          <string-name>
            <given-names>Ekin D.</given-names>
            <surname>Cubuk</surname>
          </string-name>
          , and
          <string-name>
            <surname>Quoc</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Specaugment: A simple data augmentation method for automatic speech recognition</article-title>
          .
          <source>Interspeech</source>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Michel</given-names>
            <surname>Plüss</surname>
          </string-name>
          , Lukas Neukom, and
          <string-name>
            <given-names>Manfred</given-names>
            <surname>Vogel</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Germeval 2020 task 4: Low-resource speechto-text</article-title>
          .
          <source>In preparation.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Povey</surname>
          </string-name>
          , Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and
          <string-name>
            <given-names>Karel</given-names>
            <surname>Vesely</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>The kaldi speech recognition toolkit</article-title>
          .
          <source>In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Stephan</given-names>
            <surname>Radeck-Arneth</surname>
          </string-name>
          , Benjamin Milde, Arvid Lange, Evandro Gouvêa, Stefan Radomski, Max Mühlhäuser, and
          <string-name>
            <given-names>Chris</given-names>
            <surname>Biemann</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Open source german distant speech recognition: Corpus and acoustic model</article-title>
          .
          <source>In Text, Speech, and Dialogue</source>
          , pages
          <fpage>480</fpage>
          -
          <lpage>488</lpage>
          , Cham.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Tanja</surname>
            <given-names>Samardžic´</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Yves</given-names>
            <surname>Scherrer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Elvira</given-names>
            <surname>Glaser</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>ArchiMob - a corpus of spoken swiss German</article-title>
          .
          <source>In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)</source>
          , pages
          <fpage>4061</fpage>
          -
          <lpage>4066</lpage>
          , Portorož,
          <string-name>
            <given-names>Slovenia. European</given-names>
            <surname>Language Resources Association (ELRA).</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Michael</given-names>
            <surname>Stadtschnitzer</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christoph</given-names>
            <surname>Schmidt</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Data-driven pronunciation modeling of swiss German dialectal speech for automatic speech recognition</article-title>
          .
          <source>In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          ), Miyazaki,
          <string-name>
            <given-names>Japan. European</given-names>
            <surname>Language Resources Association (ELRA).</surname>
          </string-name>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>