=Paper=
{{Paper
|id=Vol-2624/germeval-task4-paper2
|storemode=property
|title=LTL-UDE at Low-Resource Speech-to-Text Shared Task: Investigating Mozilla DeepSpeech in a low-resource setting
|pdfUrl=https://ceur-ws.org/Vol-2624/germeval-task4-paper2.pdf
|volume=Vol-2624
|authors=Aashish Agarwal,Torsten Zesch
|dblpUrl=https://dblp.org/rec/conf/swisstext/AgarwalZ20
}}
==LTL-UDE at Low-Resource Speech-to-Text Shared Task: Investigating Mozilla DeepSpeech in a low-resource setting==
LTL-UDE at Low-Resource Speech-to-Text Shared Task:
Investigating Mozilla DeepSpeech in a low-resource setting
Aashish Agarwal and Torsten Zesch
Language Technology Lab
University of Duisburg-Essen
Duisburg, Germany
Abstract Language Dataset Size [h]
We describe our system participating in SwissText Shared Task 70
Swiss German
ArchiMob 57
the SwissText/KONVENS shared task on
Voxforge 57
low-resource speech-to-text (Plüss et al., TUDA-De 184
2020). We train an end-to-end neural German
M-AILABS 233
model based on Mozilla DeepSpeech. We MCV_v4 454
examine various methods to improve over English
LibriSpeech 1,000
the baseline results: transfer learning from MCV 1,488
standard German and English, data aug-
Table 1: Dataset overview
mentation, and post-processing. Our best
system achieves a somewhat disappoint-
ing WER of 58.9% on the held-out test German, where less than 100 hours of annotated
set, indicating that it is currently challeng- data are available. Previous speech recognition
ing to obtain good results with this ap- systems for Swiss German (Garner et al., 2014;
proach in a low-resource setting. Stadtschnitzer and Schmidt, 2018) are based on
1 Introduction Kaldi.
Recently, end-to-end models like DeepSpeech1 2 Model Training
have been introduced as an alternative to tra-
ditional HMM-DNN based models like Kaldi We used DeepSpeech version 0.6.0 for all experi-
(Povey et al., 2011). However, they are relatively ments.2
data hungry, i.e. they require large amounts of
annotated data to work well. For example, the 2.1 Datasets
original DeepSpeech implementation from Baidu To train the Swiss German DeepSpeech model, we
(Hannun et al., 2014) was trained on 7,380 hours utilized the following publicly available datasets
of data, DeepSpeech2 (Amodei et al., 2015) was as showed in Table 1.
trained on 11,940 hours of data and DeepSpeech3 For Swiss German we used the official data
(Battenberg et al., 2017) was trained on about provided by the shared task (Plüss et al., 2020).
10,000 hours of data. Such large datasets are usu- The corpus contains 70 hours of spoken Swiss
ally only available for languages like English or German (predominantly in the Bernese dialect)
Mandarin, but even for major languages like Ger- and some Standard German speech from the par-
man much less data is available and consequently liament of the canton of Bern3 . We additionally
DeepSpeech models do not perform well (Agar- use the ArchiMob (Samardžić et al., 2016) corpus,
wal and Zesch, 2019). which represents German linguistic varieties spo-
In this paper, we examine how well DeepSpeech ken within the territory of Switzerland and con-
performs in a truly low-resource setting like Swiss tains long samples of transcribed text in Swiss
Copyright c 2020 for this paper by its authors. Use permit- German. The corpus contains 57 hours and is
ted under Creative Commons License Attribution 4.0 Interna-
tional (CC BY 4.0) 2 https://github.com/mozilla/DeepSpeech/releases/tag/v0.6.0
1 https://github.com/mozilla/DeepSpeech 3 https://swisstext-and-konvens-2020.org/low-resource-speech-to-text/
available under creative commons licence 4.0.4 Hyperparameter Value
As the amount of data is probably not sufficient Batch Size 24
to train a good model, we will experiment with Dropout 0.25
Learning Rate 0.0001
transfer learning from standard German. Publicly
available datasets include Voxforge5 , TUDA-De English
α 0.75
(Milde and Köhn, 2018), M-AILabs6 , and Mozilla β 1.85
Common Voice (Ardila et al., 2019). Together German
those datasets add almost 1,000 hours of addi- α 0.40
tional training data (although in the wrong German β 1.10
dialect). The datasets also do not contain politi-
Table 2: Hyperparameters used in the experiments
cal speeches and thus are a less than ideal starting
point for transfer learning.
As there has been previous work on trans- 2.4 Hyperparameters
fer learning models starting with a different lan-
guage (Kunze et al., 2017; Bansal et al., 2018), For the acoustic model, we use the best hyper-
we also consider English corpora: LibriSpeech parameters as reported by (Agarwal and Zesch,
(Panayotov et al., 2015) and Mozilla Common 2019) and listed in Table 2.
Voice.7 These are among the largest and widely We use a probabilistic 3-gram language model
used open-source corpora. LibriSpeech consists based on KenLM (Heafield, 2011) and trained on
of 16kHz read English speech derived from au- the German-English part of Europarl10 as well as
diobooks from the LibriVox project and has been the corpus used to train the TUDA-De language
carefully segmented and aligned.8 On the other model (Radeck-Arneth et al., 2015). For German,
hand, the Mozilla Common Voice project employs we searched for a good set of values and got the
crowdsourcing to collect data on its portal. best results with the ones mentioned in Table 2.
For English we referred the values of α and β
2.2 Server & Runtime from DeepSpeech release page11
We trained and tested our models on a compute 3 Experiments
server having 56 Intel(R) Xeon(R) Gold 5120
CPUs @ 2.20GHz, 3 Nvidia Quadro RTX 6000 As the baseline model, we train DeepSpeech with
with 24GB of RAM each. Typical training time the setup described above and using only the Swiss
with augmentation for the SwissText dataset is 1.5 German data provided by the shared task. The
hours, for German 12 hours, and for English 30 model achieved a WER of 71.5%. As expected,
hours. Without augmentation, the training time DeepSpeech is not able to simply train a suitable
was approximately 10% less than with augmen- model based on this amount of training data.
tation. We try to improve over those results using data
augmentation and transfer learning as discussed in
2.3 Preprocessing the remainder of this section.
We cleaned the data by using only the allowed set
of characters listed by the shared task. We con- 3.1 Data Augmentation
verted all transcriptions to lower case and further Augmentation is a useful technique for better gen-
ensured that all audio clips are in wav format. The eralization of machine learning models. Inspired
resulting samples were split into training (70%), by Park et al. (2019), Mozilla DeepSpeech has
validation (15%), and test data (15%). The pre- implemented several augmentation techniques like
processing scripts can be referenced at GitHub 9 frequency masking, time masking, speed scaling,
4 https://www.spur.uzh.ch/en/departments/research/textgroup/ArchiMob.
and pitch scaling. We used all the augmentation
html
approaches with default hyperparameters, which
5 http://www.voxforge.org/home/forums/other-languages/german/ can be referenced here.12 Augmentation actually
open-speech-data-corpus-for-german
6 https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/ 10 https://www.statmt.org/europarl/
7 https://voice.mozilla.org/en 11 https://github.com/mozilla/DeepSpeech/releases/tag/v0.6.0
8 http://www.openslr.org/12/ 12 https://deepspeech.readthedocs.io/en/v0.7.0/TRAINING.html#
9 https://github.com/AASHISHAG/deepspeech-swiss-german training-with-augmentation
Train Test WER WER Example
w/o w/ src: ich habe diese nicht gefunden
augmentation 1.00 def: es handelt
0.80 opt: der handel nicht
SwissText SwissText 71.5 74.3
src: lohnobergrenze für staatsbetriebe
Swiss → Swiss SwissText 70.7 69.0 1.00 def: der gefürsteten
German → Swiss SwissText 63.5 63.1 0.67 opt: bergen für staatsbetrieben
English → Swiss SwissText 64.1 64.4 src: er ist ein erfahrener grossrat
English → German → Swiss SwissText 61.0 61.5 1.00 def: die songs
0.53 opt: die so ein grossrat
src: ich überlege mir jetzt folgendes
Table 3: Transfer learning results (on public data) 1.00 def: es bereitet
0.40 opt: ich überlege ich jetzt wenn
src: sie sehen die gleichstellung ist leider noch gar nicht erreicht
increases model error from 71.5% to 74.3%. How- 0.90 def: die stellung scheidungen
0.60 opt: die stellung schleid und noch gar nicht ein
ever, we further test the impact of augmentation in
our transfer learning results discussed below. Table 4: Examples of truncated output with default
(def) hyperparameters that improve when optimized
3.2 Transfer Learning (opt)
As we have discussed above, end-to-end training
of automated speech recognition systems requires
massive data. As we only have 70 hours of training
data available from the shared task, we experiment
with transferring the model from different starting
points. Table 3 gives an overview of the results.
Transferring from about 2,500 hours of English
data gives about the same results are starting from
about 1,000 hours of German data even if stan-
dard German is closer to Swiss German than En-
glish. However, the best results are achieved when
starting with English, transferring to German and
then transferring to Swiss. Data augmentation in Figure 1: Distribution of output sample length to origi-
nal sample length (in characters)
this case improves results a bit for a final WER of
61.5%.
results in Table 3 indicate.
4 Error Analysis The length of the output is partly controlled by
When analyzing the errors made by DeepSpeech, the model’s hyperparameters. We want to find a
one issue stands out: truncated output. Quite a lot sequence c that maximizes the combined objective
of output texts are much shorter than the source function:
transcript. Table 4 shows some examples. The Q(c)=log(P(c|x))+α log(Plm (c))+β wordcount(c)
performance of the model will be seriously im-
pacted by not producing long enough output sen- where α and β controls the trade-off between the
tences. It might be informative to only look at out- acoustic model, the language model constraint,
put text that is about the same length as the orig- and the length of the sentence. The term Plm indi-
inal transcript. Figure 1 displays the distribution cates the probability of the sequence c according
of samples with a certain ratio of sample length to the language model. The weight α constrains
to source sample length in characters. The fig- the relative contributions of the CTC network and
ure shows that almost all DeepSpeech outputs are the language model and the weight β determines
shorter than the original. If we only look at the the count of words in the recognized transcription
samples that are about the same size as expected (Hannun et al., 2014; Amodei et al., 2015).
(with a ratio higher than 0.75, which is still about By changing the relative weight of acoustic
half of all samples), we find that WER improves model and language model by optimizing α and β ,
from 61.5% to 47.7%. This means that when the we can improve the model a bit as shown in the op-
model outputs a string that is approximately of the timized model examples in Table 4. However, we
correct length, it is actually much better than the were not able to eliminate the problem altogether.
Train Test WER Sameer Bansal, Herman Kamper, Karen Livescu,
w/o w/ Adam Lopez, and Sharon Goldwater. 2018. Pre-
augmentation training on high-resource speech recognition
SwissText SwissText 70.2 69.6 improves low-resource speech-to-text translation.
CoRR, abs/1809.01431.
Swiss → Swiss SwissText 67.9 68.6
German → Swiss SwissText 59.4 59.5 Eric Battenberg, Jitong Chen, Rewon Child, Adam
English → Swiss SwissText 60.1 59.1 Coates, Yashesh Gaur, Yi Li, Hairong Liu, San-
English → German → Swiss SwissText 56.6 57.1
jeev Satheesh, David Seetapun, Anuroop Sriram,
and Zhenyao Zhu. 2017. Exploring neural trans-
Table 5: Transfer learning results (on public data) - ducers for end-to-end speech recognition. CoRR,
with optimized hyperparameters abs/1707.07413.
Philip N. Garner, David Imseng, and Thomas Meyer.
Consequently, WER only improves from 61.5. to 2014. Automatic speech recognition and translation
57.1 (with augmentation). As the model with the of a swiss german dialect: Walliserdeutsch.
optimized hyperparameters and without augmen- Awni Y. Hannun, Carl Case, Jared Casper, Bryan
tation is still a bit better, we submitted that one in Catanzaro, Greg Diamos, Erich Elsen, Ryan
the shared task. It achieved a WER of 58.9% on Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam
the held-out test set. Coates, and Andrew Y. Ng. 2014. Deep speech:
Scaling up end-to-end speech recognition. CoRR,
abs/1412.5567.
5 Summary
Kenneth Heafield. 2011. KenLM: Faster and smaller
The baseline system trained only on the Swiss- language model queries. In Proceedings of the Sixth
German data yields a quite high word error rate of Workshop on Statistical Machine Translation, pages
71.5. Data augmentation strategies implemented 187–197, Edinburgh, Scotland.
in DeepSpeech did not result in consistent im-
Julius Kunze, Louis Kirsch, Ilia Kurenkov, Andreas
provements. Transfer learning has a much higher Krug, Jens Johannsmeier, and Sebastian Stober.
impact reducing the word error rate by over 10 per- 2017. Transfer learning for speech recognition on
cent points when transferring an English model to a budget. CoRR, abs/1706.00290.
German and finally transferring to Swiss German. Benjamin Milde and Arne Köhn. 2018. Open source
The best model yields a WER of 56.6% on our test automatic speech recognition for german. CoRR,
set (58.9% in the public ranking based on the hid- abs/1807.10311.
den test set of the shared task). When analyzing
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur.
the results, the model seems to suffer from trun- 2015. Librispeech: An asr corpus based on public
cated output which we can somewhat improve by domain audio books. In 2015 IEEE International
hyperparameter tuning. Overall, the results show Conference on Acoustics, Speech and Signal Pro-
that training an end-to-end neural speech recog- cessing (ICASSP), pages 5206–5210.
nition system with DeepSpeech in a low-resource Daniel S. Park, William Chan, Yu Zhang, Chung-
setting remains challenging. Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and
Quoc V. Le. 2019. Specaugment: A simple data
augmentation method for automatic speech recogni-
References tion. Interspeech 2019.
Aashish Agarwal and Torsten Zesch. 2019. German Michel Plüss, Lukas Neukom, and Manfred Vogel.
end-to-end speech recognition based on deepspeech. 2020. Germeval 2020 task 4: Low-resource speech-
In Proceedings of the 15th Conference on Natu- to-text. In preparation.
ral Language Processing (KONVENS 2019), pages
111–119, Erlangen, Germany. GSCL. Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas
Burget, Ondrej Glembek, Nagendra Goel, Mirko
Dario Amodei, Rishita Anubhai, Eric Battenberg, and Hannemann, Petr Motlicek, Yanmin Qian, Petr
Carl Case. 2015. Deep speech 2: End-to-end Schwarz, Jan Silovsky, Georg Stemmer, and Karel
speech recognition in english and mandarin. CoRR, Vesely. 2011. The kaldi speech recognition toolkit.
abs/1512.02595. In IEEE 2011 Workshop on Automatic Speech
Recognition and Understanding.
Rosana Ardila, Megan Branson, Kelly Davis, Michael
Henretty, Michael Kohler, Josh Meyer, Reuben Stephan Radeck-Arneth, Benjamin Milde, Arvid
Morais, Lindsay Saunders, Francis M. Tyers, and Lange, Evandro Gouvêa, Stefan Radomski, Max
Gregor Weber. 2019. Common voice: A massively- Mühlhäuser, and Chris Biemann. 2015. Open
multilingual speech corpus. source german distant speech recognition: Corpus
and acoustic model. In Text, Speech, and Dialogue,
pages 480–488, Cham.
Tanja Samardžić, Yves Scherrer, and Elvira Glaser.
2016. ArchiMob - a corpus of spoken swiss Ger-
man. In Proceedings of the Tenth International
Conference on Language Resources and Evalua-
tion (LREC’16), pages 4061–4066, Portorož, Slove-
nia. European Language Resources Association
(ELRA).
Michael Stadtschnitzer and Christoph Schmidt. 2018.
Data-driven pronunciation modeling of swiss Ger-
man dialectal speech for automatic speech recogni-
tion. In Proceedings of the Eleventh International
Conference on Language Resources and Evaluation
(LREC 2018), Miyazaki, Japan. European Language
Resources Association (ELRA).