=Paper=
{{Paper
|id=Vol-2624/germeval-task4-paper2
|storemode=property
|title=LTL-UDE at Low-Resource Speech-to-Text Shared Task: Investigating Mozilla DeepSpeech in a low-resource setting
|pdfUrl=https://ceur-ws.org/Vol-2624/germeval-task4-paper2.pdf
|volume=Vol-2624
|authors=Aashish Agarwal,Torsten Zesch
|dblpUrl=https://dblp.org/rec/conf/swisstext/AgarwalZ20
}}
==LTL-UDE at Low-Resource Speech-to-Text Shared Task: Investigating Mozilla DeepSpeech in a low-resource setting==
<pdf width="1500px">https://ceur-ws.org/Vol-2624/germeval-task4-paper2.pdf</pdf>
<pre>
                 LTL-UDE at Low-Resource Speech-to-Text Shared Task:
                Investigating Mozilla DeepSpeech in a low-resource setting

                                              Aashish Agarwal and Torsten Zesch
                                                   Language Technology Lab
                                                  University of Duisburg-Essen
                                                      Duisburg, Germany


                             Abstract                                Language             Dataset                           Size [h]

      We describe our system participating in                                             SwissText Shared Task                    70
                                                                     Swiss German
                                                                                          ArchiMob                                 57
      the SwissText/KONVENS shared task on
                                                                                          Voxforge                                 57
      low-resource speech-to-text (Plüss et al.,                                          TUDA-De                                 184
      2020). We train an end-to-end neural                           German
                                                                                          M-AILABS                                233
      model based on Mozilla DeepSpeech. We                                               MCV_v4                                  454
      examine various methods to improve over                        English
                                                                                          LibriSpeech                            1,000
      the baseline results: transfer learning from                                        MCV                                    1,488
      standard German and English, data aug-
                                                                                  Table 1: Dataset overview
      mentation, and post-processing. Our best
      system achieves a somewhat disappoint-
      ing WER of 58.9% on the held-out test                    German, where less than 100 hours of annotated
      set, indicating that it is currently challeng-           data are available. Previous speech recognition
      ing to obtain good results with this ap-                 systems for Swiss German (Garner et al., 2014;
      proach in a low-resource setting.                        Stadtschnitzer and Schmidt, 2018) are based on
1     Introduction                                             Kaldi.

Recently, end-to-end models like DeepSpeech1                   2     Model Training
have been introduced as an alternative to tra-
ditional HMM-DNN based models like Kaldi                       We used DeepSpeech version 0.6.0 for all experi-
(Povey et al., 2011). However, they are relatively             ments.2
data hungry, i.e. they require large amounts of
annotated data to work well. For example, the                  2.1     Datasets
original DeepSpeech implementation from Baidu                  To train the Swiss German DeepSpeech model, we
(Hannun et al., 2014) was trained on 7,380 hours               utilized the following publicly available datasets
of data, DeepSpeech2 (Amodei et al., 2015) was                 as showed in Table 1.
trained on 11,940 hours of data and DeepSpeech3                   For Swiss German we used the official data
(Battenberg et al., 2017) was trained on about                 provided by the shared task (Plüss et al., 2020).
10,000 hours of data. Such large datasets are usu-             The corpus contains 70 hours of spoken Swiss
ally only available for languages like English or              German (predominantly in the Bernese dialect)
Mandarin, but even for major languages like Ger-               and some Standard German speech from the par-
man much less data is available and consequently               liament of the canton of Bern3 . We additionally
DeepSpeech models do not perform well (Agar-                   use the ArchiMob (Samardžić et al., 2016) corpus,
wal and Zesch, 2019).                                          which represents German linguistic varieties spo-
   In this paper, we examine how well DeepSpeech               ken within the territory of Switzerland and con-
performs in a truly low-resource setting like Swiss            tains long samples of transcribed text in Swiss
Copyright c 2020 for this paper by its authors. Use permit-    German. The corpus contains 57 hours and is
ted under Creative Commons License Attribution 4.0 Interna-
tional (CC BY 4.0)                                                 2 https://github.com/mozilla/DeepSpeech/releases/tag/v0.6.0
    1 https://github.com/mozilla/DeepSpeech                        3 https://swisstext-and-konvens-2020.org/low-resource-speech-to-text/
available under creative commons licence 4.0.4                                                       Hyperparameter             Value
   As the amount of data is probably not sufficient                                                  Batch Size                     24
to train a good model, we will experiment with                                                       Dropout                      0.25
                                                                                                     Learning Rate             0.0001
transfer learning from standard German. Publicly
available datasets include Voxforge5 , TUDA-De                                                       English
                                                                                                     α                            0.75
(Milde and Köhn, 2018), M-AILabs6 , and Mozilla                                                      β                            1.85
Common Voice (Ardila et al., 2019). Together                                                         German
those datasets add almost 1,000 hours of addi-                                                       α                            0.40
tional training data (although in the wrong German                                                   β                            1.10
dialect). The datasets also do not contain politi-
                                                                                   Table 2: Hyperparameters used in the experiments
cal speeches and thus are a less than ideal starting
point for transfer learning.
   As there has been previous work on trans-                                   2.4      Hyperparameters
fer learning models starting with a different lan-
guage (Kunze et al., 2017; Bansal et al., 2018),                               For the acoustic model, we use the best hyper-
we also consider English corpora: LibriSpeech                                  parameters as reported by (Agarwal and Zesch,
(Panayotov et al., 2015) and Mozilla Common                                    2019) and listed in Table 2.
Voice.7 These are among the largest and widely                                    We use a probabilistic 3-gram language model
used open-source corpora. LibriSpeech consists                                 based on KenLM (Heafield, 2011) and trained on
of 16kHz read English speech derived from au-                                  the German-English part of Europarl10 as well as
diobooks from the LibriVox project and has been                                the corpus used to train the TUDA-De language
carefully segmented and aligned.8 On the other                                 model (Radeck-Arneth et al., 2015). For German,
hand, the Mozilla Common Voice project employs                                 we searched for a good set of values and got the
crowdsourcing to collect data on its portal.                                   best results with the ones mentioned in Table 2.
                                                                               For English we referred the values of α and β
2.2        Server & Runtime                                                    from DeepSpeech release page11

We trained and tested our models on a compute                                  3      Experiments
server having 56 Intel(R) Xeon(R) Gold 5120
CPUs @ 2.20GHz, 3 Nvidia Quadro RTX 6000                                       As the baseline model, we train DeepSpeech with
with 24GB of RAM each. Typical training time                                   the setup described above and using only the Swiss
with augmentation for the SwissText dataset is 1.5                             German data provided by the shared task. The
hours, for German 12 hours, and for English 30                                 model achieved a WER of 71.5%. As expected,
hours. Without augmentation, the training time                                 DeepSpeech is not able to simply train a suitable
was approximately 10% less than with augmen-                                   model based on this amount of training data.
tation.                                                                          We try to improve over those results using data
                                                                               augmentation and transfer learning as discussed in
2.3        Preprocessing                                                       the remainder of this section.
We cleaned the data by using only the allowed set
of characters listed by the shared task. We con-                               3.1      Data Augmentation
verted all transcriptions to lower case and further                            Augmentation is a useful technique for better gen-
ensured that all audio clips are in wav format. The                            eralization of machine learning models. Inspired
resulting samples were split into training (70%),                              by Park et al. (2019), Mozilla DeepSpeech has
validation (15%), and test data (15%). The pre-                                implemented several augmentation techniques like
processing scripts can be referenced at GitHub 9                               frequency masking, time masking, speed scaling,
       4 https://www.spur.uzh.ch/en/departments/research/textgroup/ArchiMob.
                                                                               and pitch scaling. We used all the augmentation
html
                                                                               approaches with default hyperparameters, which
       5 http://www.voxforge.org/home/forums/other-languages/german/           can be referenced here.12 Augmentation actually
open-speech-data-corpus-for-german
       6 https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/                 10 https://www.statmt.org/europarl/
       7 https://voice.mozilla.org/en                                              11 https://github.com/mozilla/DeepSpeech/releases/tag/v0.6.0
       8 http://www.openslr.org/12/                                                12 https://deepspeech.readthedocs.io/en/v0.7.0/TRAINING.html#
       9 https://github.com/AASHISHAG/deepspeech-swiss-german                  training-with-augmentation
    Train                           Test         WER          WER      Example
                                              w/o     w/               src: ich habe diese nicht gefunden
                                              augmentation    1.00     def: es handelt
                                                              0.80     opt: der handel nicht
    SwissText                     SwissText   71.5    74.3
                                                                       src: lohnobergrenze für staatsbetriebe
    Swiss → Swiss                SwissText    70.7    69.0    1.00     def: der gefürsteten
    German → Swiss               SwissText    63.5    63.1    0.67     opt: bergen für staatsbetrieben
    English → Swiss              SwissText    64.1    64.4             src: er ist ein erfahrener grossrat
    English → German → Swiss     SwissText    61.0    61.5    1.00     def: die songs
                                                              0.53     opt: die so ein grossrat
                                                                       src: ich überlege mir jetzt folgendes
     Table 3: Transfer learning results (on public data)      1.00     def: es bereitet
                                                              0.40     opt: ich überlege ich jetzt wenn
                                                                       src: sie sehen die gleichstellung ist leider noch gar nicht erreicht
increases model error from 71.5% to 74.3%. How-               0.90     def: die stellung scheidungen
                                                              0.60     opt: die stellung schleid und noch gar nicht ein
ever, we further test the impact of augmentation in
our transfer learning results discussed below.               Table 4: Examples of truncated output with default
                                                             (def) hyperparameters that improve when optimized
3.2     Transfer Learning                                    (opt)
As we have discussed above, end-to-end training
of automated speech recognition systems requires
massive data. As we only have 70 hours of training
data available from the shared task, we experiment
with transferring the model from different starting
points. Table 3 gives an overview of the results.
Transferring from about 2,500 hours of English
data gives about the same results are starting from
about 1,000 hours of German data even if stan-
dard German is closer to Swiss German than En-
glish. However, the best results are achieved when
starting with English, transferring to German and
then transferring to Swiss. Data augmentation in             Figure 1: Distribution of output sample length to origi-
                                                             nal sample length (in characters)
this case improves results a bit for a final WER of
61.5%.
                                                             results in Table 3 indicate.
4      Error Analysis                                           The length of the output is partly controlled by
When analyzing the errors made by DeepSpeech,                the model’s hyperparameters. We want to find a
one issue stands out: truncated output. Quite a lot          sequence c that maximizes the combined objective
of output texts are much shorter than the source             function:
transcript. Table 4 shows some examples. The                         Q(c)=log(P(c|x))+α log(Plm (c))+β wordcount(c)
performance of the model will be seriously im-
pacted by not producing long enough output sen-              where α and β controls the trade-off between the
tences. It might be informative to only look at out-         acoustic model, the language model constraint,
put text that is about the same length as the orig-          and the length of the sentence. The term Plm indi-
inal transcript. Figure 1 displays the distribution          cates the probability of the sequence c according
of samples with a certain ratio of sample length             to the language model. The weight α constrains
to source sample length in characters. The fig-              the relative contributions of the CTC network and
ure shows that almost all DeepSpeech outputs are             the language model and the weight β determines
shorter than the original. If we only look at the            the count of words in the recognized transcription
samples that are about the same size as expected             (Hannun et al., 2014; Amodei et al., 2015).
(with a ratio higher than 0.75, which is still about            By changing the relative weight of acoustic
half of all samples), we find that WER improves              model and language model by optimizing α and β ,
from 61.5% to 47.7%. This means that when the                we can improve the model a bit as shown in the op-
model outputs a string that is approximately of the          timized model examples in Table 4. However, we
correct length, it is actually much better than the          were not able to eliminate the problem altogether.
    Train                        Test         WER         Sameer Bansal, Herman Kamper, Karen Livescu,
                                           w/o     w/       Adam Lopez, and Sharon Goldwater. 2018. Pre-
                                           augmentation     training on high-resource speech recognition
    SwissText                  SwissText   70.2   69.6      improves low-resource speech-to-text translation.
                                                            CoRR, abs/1809.01431.
    Swiss → Swiss              SwissText   67.9   68.6
    German → Swiss             SwissText   59.4   59.5    Eric Battenberg, Jitong Chen, Rewon Child, Adam
    English → Swiss            SwissText   60.1   59.1       Coates, Yashesh Gaur, Yi Li, Hairong Liu, San-
    English → German → Swiss   SwissText   56.6   57.1
                                                             jeev Satheesh, David Seetapun, Anuroop Sriram,
                                                             and Zhenyao Zhu. 2017. Exploring neural trans-
Table 5: Transfer learning results (on public data) -        ducers for end-to-end speech recognition. CoRR,
with optimized hyperparameters                               abs/1707.07413.
                                                          Philip N. Garner, David Imseng, and Thomas Meyer.
Consequently, WER only improves from 61.5. to               2014. Automatic speech recognition and translation
57.1 (with augmentation). As the model with the             of a swiss german dialect: Walliserdeutsch.
optimized hyperparameters and without augmen-             Awni Y. Hannun, Carl Case, Jared Casper, Bryan
tation is still a bit better, we submitted that one in      Catanzaro, Greg Diamos, Erich Elsen, Ryan
the shared task. It achieved a WER of 58.9% on              Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam
the held-out test set.                                      Coates, and Andrew Y. Ng. 2014. Deep speech:
                                                            Scaling up end-to-end speech recognition. CoRR,
                                                            abs/1412.5567.
5      Summary
                                                          Kenneth Heafield. 2011. KenLM: Faster and smaller
The baseline system trained only on the Swiss-              language model queries. In Proceedings of the Sixth
German data yields a quite high word error rate of          Workshop on Statistical Machine Translation, pages
71.5. Data augmentation strategies implemented              187–197, Edinburgh, Scotland.
in DeepSpeech did not result in consistent im-
                                                          Julius Kunze, Louis Kirsch, Ilia Kurenkov, Andreas
provements. Transfer learning has a much higher              Krug, Jens Johannsmeier, and Sebastian Stober.
impact reducing the word error rate by over 10 per-          2017. Transfer learning for speech recognition on
cent points when transferring an English model to            a budget. CoRR, abs/1706.00290.
German and finally transferring to Swiss German.          Benjamin Milde and Arne Köhn. 2018. Open source
The best model yields a WER of 56.6% on our test            automatic speech recognition for german. CoRR,
set (58.9% in the public ranking based on the hid-          abs/1807.10311.
den test set of the shared task). When analyzing
                                                          V. Panayotov, G. Chen, D. Povey, and S. Khudanpur.
the results, the model seems to suffer from trun-            2015. Librispeech: An asr corpus based on public
cated output which we can somewhat improve by                domain audio books. In 2015 IEEE International
hyperparameter tuning. Overall, the results show             Conference on Acoustics, Speech and Signal Pro-
that training an end-to-end neural speech recog-             cessing (ICASSP), pages 5206–5210.
nition system with DeepSpeech in a low-resource           Daniel S. Park, William Chan, Yu Zhang, Chung-
setting remains challenging.                                Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and
                                                            Quoc V. Le. 2019. Specaugment: A simple data
                                                            augmentation method for automatic speech recogni-
References                                                  tion. Interspeech 2019.

Aashish Agarwal and Torsten Zesch. 2019. German           Michel Plüss, Lukas Neukom, and Manfred Vogel.
  end-to-end speech recognition based on deepspeech.        2020. Germeval 2020 task 4: Low-resource speech-
  In Proceedings of the 15th Conference on Natu-            to-text. In preparation.
  ral Language Processing (KONVENS 2019), pages
  111–119, Erlangen, Germany. GSCL.                       Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas
                                                            Burget, Ondrej Glembek, Nagendra Goel, Mirko
Dario Amodei, Rishita Anubhai, Eric Battenberg, and         Hannemann, Petr Motlicek, Yanmin Qian, Petr
  Carl Case. 2015. Deep speech 2: End-to-end                Schwarz, Jan Silovsky, Georg Stemmer, and Karel
  speech recognition in english and mandarin. CoRR,         Vesely. 2011. The kaldi speech recognition toolkit.
  abs/1512.02595.                                           In IEEE 2011 Workshop on Automatic Speech
                                                            Recognition and Understanding.
Rosana Ardila, Megan Branson, Kelly Davis, Michael
  Henretty, Michael Kohler, Josh Meyer, Reuben            Stephan Radeck-Arneth, Benjamin Milde, Arvid
  Morais, Lindsay Saunders, Francis M. Tyers, and            Lange, Evandro Gouvêa, Stefan Radomski, Max
  Gregor Weber. 2019. Common voice: A massively-             Mühlhäuser, and Chris Biemann. 2015. Open
  multilingual speech corpus.                                source german distant speech recognition: Corpus
  and acoustic model. In Text, Speech, and Dialogue,
  pages 480–488, Cham.
Tanja Samardžić, Yves Scherrer, and Elvira Glaser.
  2016. ArchiMob - a corpus of spoken swiss Ger-
  man. In Proceedings of the Tenth International
  Conference on Language Resources and Evalua-
  tion (LREC’16), pages 4061–4066, Portorož, Slove-
  nia. European Language Resources Association
  (ELRA).
Michael Stadtschnitzer and Christoph Schmidt. 2018.
  Data-driven pronunciation modeling of swiss Ger-
  man dialectal speech for automatic speech recogni-
  tion. In Proceedings of the Eleventh International
  Conference on Language Resources and Evaluation
  (LREC 2018), Miyazaki, Japan. European Language
  Resources Association (ELRA).

</pre>