=Paper=
{{Paper
|id=Vol-2624/germeval-task4-paper3
|storemode=property
|title=ZHAW-InIT at GermEval 2020 Task 4: Low-Resource Speech-to-Text
|pdfUrl=https://ceur-ws.org/Vol-2624/germeval-task4-paper3.pdf
|volume=Vol-2624
|authors=Matthias Büchi,Malgorzata Anna Ulasik,Manuela Hürlimann,Fernando Benites,Pius von Däniken,Mark Cieliebak
|dblpUrl=https://dblp.org/rec/conf/swisstext/BuchiUHBDC20
}}
==ZHAW-InIT at GermEval 2020 Task 4: Low-Resource Speech-to-Text==
ZHAW-InIT at GermEval 2020 Task 4: Low-Resource Speech-to-Text
Matthias Büchi, Malgorzata Anna Ulasik, Manuela Hürlimann,
Fernando Benites, Pius von Däniken, and Mark Cieliebak
Institute of Applied Information Technology
Zurich University of Applied Sciences
{buec, ulas, hueu, benf, vode, ciel}@zhaw.ch
Abstract since there is not enough data for each dialect for
many natural language processing tasks. Nonethe-
This paper presents the contribution of less, there is enough data to train ASR systems
ZHAW-InIT to Task 4 ”Low-Resource for Standard German, which is spoken by a sub-
STT” at GermEval 2020. The goal of the stantially larger group of native speakers, being
task is to develop a system for translating an official language also in Germany and Aus-
Swiss German dialect speech into Stan- tria. On official occasions, speeches are written
dard German text in the domain of parlia- down, transcribed, or logged in Standard German.
mentary debates. Our approach is based Since the linguistic distance between the Swiss
on Jasper, a CNN Acoustic Model, which German dialects and the official language German
we fine-tune on the task data. We enhance are quite large, this poses a similar task as Cross-
the base system with an extended Lan- Linguistic Speech-To-Text (CL-STT; also referred
guage Model containing in-domain data to as speech-to-text translation) which is a difficult
and speed perturbation and run further ex- interdisciplinary challenge, combining STT with
periments with post-processing. Our sub- elements of Machine Translation (MT) (Bérard
mission achieved first place with a final et al., 2016). Both fields have a long history of
Word Error Rate of 40.29%. methods and approaches, which are currently at
the point of converging thanks to the development
1 Introduction
of deep learning technology. This combination
Automatic Speech Recognition (ASR) is defined of ASR and MT is indeed needed in the context
as mapping audio signals to text. A particular of Swiss German dialects, as speeches are para-
challenge for ASR arises if a language does not phrased or even translated (see Section 6 for an
have a standardized writing system, as is the case example).
for Swiss German. In German-speaking Switzer- The Shared Task ”Low-Resource STT” at Ger-
land, Swiss German is the default spoken language mEval 2020 aimed exactly at a specific Swiss case
on most occasions, from formal to informal; how- of CL-STT: translating Swiss German dialect spo-
ever, the language of reading and writing is Stan- ken in an official context to written Standard Ger-
dard German (“medial diglossia”, Siebenhaar and man.
Wyler (1997)). Swiss German is increasingly used In our approach, we applied a general character-
for writing in informal contexts, especially on so- based ASR system (Li et al., 2019), pre-trained
cial media, but users usually write phonetically in on a large German corpus, and fine-tuned to the
their local dialect (Siebenhaar, 2013). The par- Shared Task data. We further enriched our Lan-
ticular dialects of Swiss German differ from each guage Model with additional publicly available
other to such an extent that speakers of one dialect data.
might even have difficulty understanding dialects
from some other regions. An indirect consequence 2 Shared Task Description
is that many dialects are considered low-resource,
The goal of this Shared Task was to develop a sys-
Copyright c 2020 for this paper by its authors. Use permit-
ted under Creative Commons License Attribution 4.0 Interna- tem for translating Swiss German dialect speech
tional (CC BY 4.0) into Standard German text in the domain of par-
liamentary debates. Neural Transducer (Jaitly et al., 2016), Recur-
A data set of 36’572 utterances with a total du- rent Neural Aligner (Sak et al., 2017) and models
ration of 69.8 hours was made available for train- based on the Transformer architecture (Vaswani
ing the systems and a 4 hour test set was used for et al., 2017) as in Dong et al. (2018). Zeghidour
evaluating solutions. The training data consists of et al. (2018) achieved state-of-the-art performance
a set of recordings of debates held in the parlia- on an end-to-end system based on convolutional
ment of the canton of Bern, with utterances pro- neural networks (CNN). Their system can predict
duced by 191 speakers. None of these 191 speak- characters directly from raw waveforms, instead
ers occur in the test set. The audio recordings con- of the commonly used log-MEL features.
tain mostly Swiss German dialect speech with the Li et al. (2019) propose a convolutional network
majority of the utterances being spoken in Bernese with residual connections, with state-of-the-art re-
dialect; however, there are also some recordings of sults on the LibriSpeech and Wall Street Journal
Standard German speech as well as a few English ASR data sets. The network predicts a character at
utterances. Each utterance contains one sentence each step (of 20 ms) and a Connectionist Tempo-
and has an average duration of 6.9 seconds. ral Classification (CTC) loss (Graves et al., 2006)
All recordings have been manually transcribed is used for training. Beamsearch decoding allows
into Standard German, while the alignment be- the prediction to match a pre-trained Language
tween audio and transcripts was performed au- Model. Preliminary work, Büchi (2020), showed
tomatically by the task organizers (Plüss et al., that this approach was much easier to adapt and
2020b,a). train on a large German corpus in comparison to
The transcript accuracy is measured with the hybrid systems trained with Kaldi (Povey et al.,
Word Error Rate (WER), which is the standard 2011) which achieve similar results.
ASR evaluation metric. It is computed as the sum While usually tens of thousands of hours of
of the number of insertions, deletions and substi- audio are required for achieving state-of-the-art
tutions between predicted and reference sentences ASR performance, some approaches target lan-
divided by the number of words in the reference guages where only a few hours of data are avail-
(Zechner and Waibel, 2000). Selecting WER in- able (Samarakoon et al., 2018). The use of pre-
stead of the BLEU score, which is usually applied training and transfer learning are especially help-
for automatic evaluation of translations, is justified ful in such challenging setups (Stoian et al., 2019).
by the task organizers with the fact that the Swiss Although there are approaches which directly
German spoken in the parliament is comparatively target the speech translation setup (Bérard et al.,
close to Standard German and the diversity of the 2016; Jia et al., 2019b,a), and the Shared Task data
possible correct translations is very limited. Prior consists of translations and paraphrases of the spo-
to evaluation, the task organizers normalized both ken utterances, we decided not to add an additional
ground truth and transcribed utterances by lower- component dealing specifically with translation to
casing them and removing punctuation. our system because of the lack of relevant avail-
able data.
3 Related Work
4 System Description
The most recent developments in both ASR and
machine translation involve generalized methods This section describes the initial system used to
that can be relatively easily ported across the two establish a base for our experiments. Important
tasks, such as the encoder-decoder architecture. concepts as well as parameters crucial for the ex-
One of the most prominent, ”Listen, Attend and periments are explained.
Spell” (LAS) (Chan et al., 2016), uses an encoder-
decoder architecture with attention and a pyrami- 4.1 Reference Text Pre-processing
dal LSTM for the encoder. Chiu et al. (2018) de- We normalized all texts before training the Acous-
scribe improvements to LAS, such as multi-head tic Models and Language Models. This step was
attention, scheduled sampling, and label smooth- necessary to have a standardized set of possible
ing, which achieved new state-of-the-art perfor- characters, which in this case were the letters a-
mance, although only on proprietary voice search z, ä, ö and ü. Normalization was performed in
data. Other encoder-decoder models include the multiple steps, starting by lower-casing the whole
text and splitting it into sentences. All punctua-
tion symbols were removed, except for points and
commas which might be used as decimal point or
for ordinal numbers. Numbers were transliterated
to words. Common abbreviations and symbols
were replaced by their spoken form (e.g. ”%” by
”Prozent” or ”kg” by ”Kilogramm”). Letters with
diacritics other than ä, ö, and ü were replaced by
their counterpart without diacritics. Finally, any
remaining unknown symbols were removed with-
out replacement.
4.2 Acoustic Model
An Acoustic Model was used to predict linguis-
tic units based on an audio signal. For this pur-
pose, Jasper (Li et al., 2019), a DNN-based model,
was applied. Jasper predicts a probability distri-
bution over all possible characters at every time Figure 1: Visualization of the Jasper B × R Dense
step based on mel-filterbank features as input. The Residual model, from the Jasper Github repository
input was augmented with SpecAugment (Park (NVIDIA, 2020). It shows one pre-processing, three
et al., 2019). post-processing and intermediate blocks with residual
The model consists of convolutional layers connections.
structured in blocks and sub-blocks. A model
B × R is defined by the number of blocks B and a 6-gram model trained with KenLM (Heafield,
number of sub-blocks R. Every sub-block consists 2011).
of a 1D-convolution, batch-normalization, ReLU,
and dropout. The input of each block is connected 4.4 Pre-training on Standard German
to the last sub-block by a residual connection. We The Acoustic Model requires a large amount of
applied the Dense Residual configuration, which data for training. Therefore, Standard German
is shown in Figure 1, where the output of each speech data as listed in Table 1 was used to cre-
block is additionally added to the inputs of all ate a pre-trained model1 . Based on the given data
following blocks. For pre- and post-processing sets, a combined version was created. Training,
one and three additional blocks were used, respec- development and test splits were kept if given in
tively. the original data sets. Otherwise, custom splits
During training, the CTC loss (Graves et al., were created with a size of 15% for test and vali-
2006) was minimized using the Novograd opti- dation, but with a maximum of 15000 seconds.
mizer introduced in Li et al. (2019). For the size of the model the configuration 10 ×
5 was used. The model was trained with an initial
4.3 Decoding learning rate of 0.015 on batches of size 64 for a
In order to get transcriptions from the Acoustic total of 100 epochs.
Model output, beam search was applied. Beam
4.5 Fine-Tuning
search tries to find the most probable text sequence
given probabilities of characters over time. Addi- The pre-trained model was used as a base for fine-
tionally, a Language Model was used to re-rank tuning using the task specific data. The first few
the beam search hypotheses. A Language Model blocks serve as acoustic feature extraction. Since
penalizes words that are not known and assigns a acoustic features of Standard German and Swiss
probability to each word given the words preced- German are very close, only weights of the post-
ing it. The weight of the Language Model is con- processing blocks as well as the last three or five
trolled with parameter α. A parameter β is used as intermediate blocks were updated, depending on
the word insertion bonus to prevent the preference 1
Accessible through https://github.com/
of long words. The Language Model we used was german-asr/megs.
Table 1: List of speech corpora used for pre-training. We used the original training splits, if available, and removed
all identified invalid samples (e.g. containing wrong transcriptions or corrupted audios). This resulted in training
data consisting of 536.9 hours of speech.
Name Size (h) Num. of Speakers
TuDa (Milde and Köhn, 2018) 183 179
CV (Ardila et al., 2019) 324 4852
VoxForge (VoxForge, 2019) 32 328
SWC (Baumann et al., 2018) 266 573
M-AILABS (M-AILABS, 2019) 233 -
the experiment as described in Section 5.2. Apart 5 Experiments
from the frozen blocks, the same hyperparameters
were used as for the pre-training. The model was We describe the experiments we conducted in or-
trained for another 100 epochs for fine-tuning (see der to improve the baseline system in Section 5.2,
Figure 2 for Word Error Rate progression over the present the results we obtained in Section 5.3 and
100 epochs). reflect on them in 5.4.
5.1 System Components
The data set provided as part of the Shared Task
was split into internal train, development and test
sets. The train set consisted of 32’978 utterances,
the development set contained 1’778 utterances,
while the test set comprised 1’816 utterances. This
split approximates 90% training, 5% development,
5% testing. A single speaker could not occur in
different sets and the utterance lengths were taken
into account for splitting.
The experiments consisted in fine-tuning the
baseline system with the use of additional text
Figure 2: Word Error Rate progression on the internal
data and, in one case, in applying transcript post-
development set. processing.
Acoustic Models The baseline Acoustic Model
(called ”AM base” below) was fine-tuned on the
internal train set, first on three blocks (model
4.6 Performance ”AM-E 3x5”) and in the second version on five
blocks (model ”AM-E 5x5”). In the last step of
The acoustic models were trained on a NVIDIA Acoustic Model fine-tuning, the baseline model
DGX-1 system. Pre-training with about 540 was re-trained on the complete official train set
hours of Standard German took approximately (internal train, development and test sets com-
197 hours using two NVIDA Tesla V100 GPUs, bined), which resulted in the model called ”AM-
while fine-tuning of the acoustic model (AM-A- A 5x5”. Additionally, we trained a model with
5x5-SP) with about 70 hours of Swiss German the internal training set without applying any pre-
speech required approximately 21 hours with one training (model ”AM-NOPRE”).
V100 GPU. The time for inference was much
lower and took only about two minutes per 4 hours Language Models The language modelling
of speech on a NVIDA Titan X GPU. Applying setup is described in Section 4.3. We used
the language model (LM extended) required some two different Language Models (LMs). The
additional computation time. However, this took basic Language Model (”LM base”) consists
only a few minutes on a recent system for training of corpora 1-3 in Table 2. Since these cor-
as well as for decoding in combination with the pora are from different domains than the task
beam search algorithm. data, we injected additional data to fine-tune
Table 2: List of text corpora used for training Language Models. The first three corpora were used for the basic
Language Model, while the last two were additionally included in the extended LM.
Name Num. of Sentences
1 News-Commentary (Bojar et al., 2018) 383‘764
2 EuroParl (Koehn, 2005) 1‘920‘208
3 Tuda-Text (Milde and Köhn, 2018) 7‘776‘674
4 Federal Chancellery Press Releases 174‘520
5 Training set transcripts 32‘977
the LM: corpus 4 is a collection of 11’576 the Language Model (”AMch3x5 LMbase” and
press releases by the Federal Chancellery (Bun- ”AMch3x5 LMext”). Following that, we inves-
deskanzlei). These were scraped from https: tigated data augmentation possibilities. In ad-
//www.bk.admin.ch/bk/de/home/ dition to SpecAugment which is used in all ex-
dokumentation/medienmitteilungen. periments, we applied speed perturbation (Ko
msg-id-.html using a custom script, et al., 2015) on the Acoustic Model data (model
where consecutive s up to the most recent ”AMch3x5 sp LMext”). The sixth experiment
press release were queried and the content was (”AMch3x5 sp LMext artc”) was an attempt to
subsequently extracted using XPath. Corpus 5 improve the results by performing transcript post-
consists of the internal training set transcripts. processing. We sought to reduce the number
The LM trained on all available corpora (1-5) is of substitutions resulting from incorrect predic-
referred to as ”LM extended”. tion of articles by applying BERT as described
above. In ”AMch5x5 sp LMext” we introduced
Article Post-processing During development the Acoustic Model ”AM extended 5x5” and re-
we noticed that there was a considerable amount placed it with ”AM all 5x5” in the final experiment
of errors due to incorrectly predicted articles (e.g (AMall5x5 sp LMext).
”der”, ”die”, ”das”) (see Section 5.4 for more de-
tails). We identified individual definite and in- 5.3 Evaluation
definite articles in a predicted utterance, removed The results of all experiments were evaluated
them, and queried the top 5 predictions of a BERT on the internal test set, except for the last one,
model (Devlin et al., 2019). If the originally pre- ”AMall5x5 sp LMext”, where the internal test set
dicted article appeared in the list of suggestions, was used for training the models. The five best-
we kept it. Otherwise it was replaced by the ar- performing versions were submitted for evaluation
ticle scored highest by BERT, making sure not to on the public test set of the Shared Task. Table 3
replace an indefinite article by a definite one or provides an overview of all results.
vice-versa. Eventually, we achieved 40.29% WER on the
official test set. Our best performing system is a
5.2 Experimental Setup
combination of the baseline Acoustic Model re-
In total, nine experiments were conducted with trained on 5 blocks with Swiss German data, speed
the goal to investigate system performance of the perturbation, and a Language Model fine-tuned on
various models. The details of the experiments in-domain data from Switzerland.
are presented in Table 3. The very first exper-
iment (”base”) was performed without any fine- 5.4 Discussion
tuning or post-processing on the base model, while The two largest performance improvements were
the second one (”AMext3x5”) aimed at evaluat- achieved by fine-tuning the Acoustic Model on
ing the predictions from the ”AM extended 3x5” the task-specific data (”AMext3x5” vs ”base”:
model without applying any Language Model. WER reduced by 38% absolute) and by using a
In the third experiment we evaluated the model general-purpose Language Model during decod-
trained only on the internal Swiss German train ing (”AMext3x5 LMbase” vs ”AMext3x5”: WER
set without any pre-training on Standard Ger- reduced by 7.64% absolute). Both of these are
man (”AMno pretrain”). The next two exper- standard practices in ASR and hence these im-
iments consisted in introducing and extending provements are neither surprising nor particularly
Table 3: Experiments overview. Note on Acoustic Models: AM-E-3x5-SP stands for AM extended 3x5 with speed
perturbation, and AM-A-5x5-SP for AM all 5x5 with speed perturbation.
System Acoustic Language Post- WER
Name Model Model Processing internal official
base AM base - - 92.1% -
AMext3x5 AM-E 3x5 - - 54.1% -
AMch3x5 LMbase AM-E 3x5 LM base - 46.46% -
AMno pretrain LMext AM-NOPRE SP LM extended - 46.82% 43.52%
AMch3x5 LMext AM-E 3x5 LM extended - 45.52% 42.61%
AMch3x5 sp LMext AM-E 3x5 SP LM extended - 44.83% 41.76%
AMch3x5 sp LMext artc AM-E 3x5 SP LM extended articles 45.17% 42.2%
AMch5x5 sp LMext AM-E 5x5 SP LM extended - 44.43% 41.16%
AMall5x5 sp LMext AM-A 5x5 SP LM extended - - 40.29%
insightful. only account for 2% of all sentences passed to the
We identified articles as one distinct source LM. Using more in-domain data in the LM could
of errors: around one sixth of substitution lead to a larger effect.
errors were articles; hence, we decided to Further small improvements were obtained by
address these during post-processing (model using speed perturbation (”AMch3x5 sp LMext”
”AMch3x5 sp LMext artc”). Our method using vs ”AMch3x5 LMext”: -0.7% absolute on
BERT (see Section 5.2) did not improve perfor- our internal test set and -0.85% on the task
mance. There are several reasons for this. First, test set) and retraining five Jasper blocks
while some articles were indeed improved with instead of three (”AMch5x5 sp LMext” vs
this method, often there was insufficient context to ”AMch3x5 sp LMext”: -0.4% absolute on our
accurately determine the correct article. Domain- internal test set and -0.6% absolute on the task
specific abbreviations (e.g. party names such as test set).
SVP, EVP) also proved difficult. Second, we ob- We also note that our performance on the task
served a number of article errors that are due to the test set is consistently better than the one on our
non-exact nature of the transcription. These are internal test set.
linguistic or stylistic changes and improvements of
6 Training Data Challenges
the spoken text and can therefore not be addressed
by our method. For example: changing a spoken Before we conclude, we would like to reflect on
definite article to an indefinite one, using plural in- the properties of the task data and their repercus-
stead of singular, transcribing a spoken ”es” with sions for WER results.
”das”, or inserting an extra article into a coordi- Our analysis of the errors and the data showed
nated noun phrase. that properties of the data often lead to an increase
Finally, there is also a challenge that relates to in WER, where the ASR model provides an ade-
the specific language variety in this task: articles quate transcription but is ”punished” by data arte-
in Swiss German are rather difficult to detect as facts. We identified the following main issues:
they usually consist of single phonemes which are • We noticed that transcriptions in the training
assimilated to the following noun. This means that set are inconsistent with respect to numer-
articles may be missed at an earlier stage of pro- als, which are written as either numbers or
cessing and will not be present in the output passed words, so that transcribing the numeral four
to the post-processing. as ”vier” when the reference transcript has
Our extended Language Model brought ”4” will lead to a substitution error. Since
a nearly 1% absolute WER improvement there is no consistency in the writing of nu-
(”AMch3x5 LMext” vs ”AMch3x5 LMbase”), merals (e.g. always using words, always us-
which is less than we expected. However, this ing numbers, using words when smaller than
can be explained by the rather small amount of ten, etc), this leads to errors that we could not
additional data - corpora 4 and 5 (see Table 2) prevent.
• Transcripts are polished (e.g. speech disflu- References
encies such as repetitions, hesitations, and Rosana Ardila, Megan Branson, Kelly Davis, Michael
false starts are removed) and reformulated Henretty, Michael Kohler, Josh Meyer, Reuben
so they become more readable, which means Morais, Lindsay Saunders, Francis M. Tyers,
they do not exactly represent the spoken and Gregor Weber. 2019. Common Voice: A
Massively-Multilingual Speech Corpus. ArXiv,
text. For example, in training set item abs/1912.06670.
19940.flac, the speaker starts by saying
”mer hie enne” (DE: ”wir hier drin”, EN: ”we Timo Baumann, Arne Köhn, and Felix Hennig. 2018.
in here”), but this was transcribed as ”wir The Spoken Wikipedia Corpus collection: Harvest-
ing, alignment and an application to hyperlistening.
in diesem Saal” (EN: ”we in this chamber”), Language Resources and Evaluation.
leading to three errors (two substitutions and
one deletion) when transcribed faithfully to Alexandre Bérard, Olivier Pietquin, Christophe Servan,
the spoken utterance by the model. and Laurent Besacier. 2016. Listen and Translate:
A Proof of Concept for End-to-End Speech-to-Text
Translation. arXiv preprint arXiv:1612.01744.
• We also note issues with the segmentation of
audio files, which, according to the task or- Ondřej Bojar, Christian Federmann, Mark Fishel,
ganizers, was performed automatically. This Yvette Graham, Barry Haddow, Philipp Koehn, and
leads to insertion errors (when extra audio is Christof Monz. 2018. Findings of the 2018 Con-
ference on Machine Translation (WMT18). In Pro-
included beyond what is transcribed) or dele- ceedings of the Third Conference on Machine Trans-
tion errors (when portions of the audio are lation: Shared Task Papers, pages 272–303, Bel-
missing) of the model that cannot be miti- gium, Brussels. Association for Computational Lin-
gated. guistics.
Matthias Büchi. 2020. Speech Recognition Compo-
Given the observed discrepancies between the nent for Search-Oriented Conversational Artificial
speech and transcript, additional evaluation mea- Intelligence. Master’s thesis, ZHAW Zurich Univer-
sures might be considered. In CL-STT, BLEU sity of Applied Sciences.
scores are used for evaluation. Even though this
William Chan, Navdeep Jaitly, Quoc Le, and Oriol
metric has been criticized, it might fit the setup Vinyals. 2016. Listen, Attend and Spell: A Neu-
of this task better, since the paraphrasing might ral Network for Large Vocabulary Conversational
not be unique. Further, measures considering se- Speech Recognition. In 2016 IEEE International
mantics and synonyms (Wieting et al., 2019; Kane Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), pages 4960–4964. IEEE.
et al., 2020) might prove helpful. However, in this
specific case of official transcriptions, this would Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Ro-
entail re-annotation, the cost of which would be hit Prabhavalkar, Patrick Nguyen, Zhifeng Chen,
prohibitive. Anjuli Kannan, Ron J Weiss, Kanishka Rao, Eka-
terina Gonina, et al. 2018. State-of-the-Art Speech
Recognition with Sequence-to-Sequence Models.
7 Conclusion In 2018 IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP), pages
In this paper, we presented our contribution to the
4774–4778. IEEE.
Shared Task on Low-Resource STT at GermEval
2020. Our solution consists of a CNN acoustic Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
model based on Jasper (Li et al., 2019) with beam- Kristina Toutanova. 2019. BERT: Pre-training of
Deep Bidirectional Transformers for Language Un-
search decoding and CTC loss. Our most success- derstanding. In Proceedings of the 2019 Conference
ful model uses Transfer Learning, where we re- of the North American Chapter of the Association
train the last five blocks of the Acoustic Model on for Computational Linguistics: Human Language
the task data. Additionally, we use speed perturba- Technologies, Volume 1 (Long and Short Papers),
pages 4171–4186, Minneapolis, Minnesota. Associ-
tion and a Language Model trained on both out-of-
ation for Computational Linguistics.
domain and in-domain text data. These improve-
ments reduced the WER by over 50% compared Linhao Dong, Shuang Xu, and Bo Xu. 2018.
to the Standard German baseline system. Our best Speech-Transformer: a No-Recurrence Sequence-
to-Sequence Model for Speech Recognition. In
model achieved a WER of 40.29% on the official 2018 IEEE International Conference on Acous-
task test set, resulting in first place out of three tics, Speech and Signal Processing (ICASSP), pages
contributions. 5884–5888. IEEE.
Alex Graves, Santiago Fernández, Faustino Gomez, Daniel S. Park, William Chan, Yu Zhang, Chung-
and Jürgen Schmidhuber. 2006. Connectionist Tem- Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and
poral Classification: Labelling Unsegmented Se- Quoc V. Le. 2019. SpecAugment: A Simple
quence Data with Recurrent Neural Networks. In Data Augmentation Method for Automatic Speech
Proceedings of the 23rd international conference on Recognition. ArXiv, abs/1904.08779.
Machine learning, pages 369–376.
Michel Plüss, Lukas Neukom, and Manfred Vogel.
Kenneth Heafield. 2011. KenLM: Faster and Smaller 2020a. Forced alignment of swiss german speech
Language Model Queries. In Proceedings of the to standard german text. In preparation.
EMNLP 2011 Sixth Workshop on Statistical Ma-
chine Translation, pages 187–197, Edinburgh, Scot- Michel Plüss, Lukas Neukom, and Manfred Vogel.
land, United Kingdom. 2020b. Germeval 2020 task 4: Low-resource
speech-to-text. In preparation.
Navdeep Jaitly, Quoc V Le, Oriol Vinyals, Ilya
Sutskever, David Sussillo, and Samy Bengio. 2016. Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukáš
An Online Sequence-to-Sequence Model Using Par- Burget, Ondrej Glembek, Nagendra Goel, Mirko
tial Conditioning. In Advances in Neural Informa- Hannemann, Petr Motlı́ček, Yanmin Qian, Petr
tion Processing Systems, pages 5067–5075. Schwarz, Jan Silovský, Georg Stemmer, and Karel
Vesel. 2011. The Kaldi Speech Recognition Toolkit.
Ye Jia, Melvin Johnson, Wolfgang Macherey, Ron J IEEE 2011 Workshop on Automatic Speech Recog-
Weiss, Yuan Cao, Chung-Cheng Chiu, Naveen Ari, nition and Understanding.
Stella Laurenzo, and Yonghui Wu. 2019a. Lever-
aging Weakly Supervised Data to Improve End-to- Hasim Sak, Matt Shannon, Kanishka Rao, and
End Speech-to-Text Translation. In ICASSP 2019- Françoise Beaufays. 2017. Recurrent Neural
2019 IEEE International Conference on Acous- Aligner: An Encoder-Decoder Neural Network
tics, Speech and Signal Processing (ICASSP), pages Model for Sequence to Sequence Mapping. In In-
7180–7184. IEEE. terspeech, volume 8, pages 1298–1302.
Ye Jia, Ron J Weiss, Fadi Biadsy, Wolfgang Macherey, Lahiru Samarakoon, Brian Mak, and Albert YS
Melvin Johnson, Zhifeng Chen, and Yonghui Wu. Lam. 2018. Domain Adaptation of End-to-end
2019b. Direct Speech-to-Speech Translation with Speech Recognition in Low-Resource Settings. In
a Sequence-to-Sequence model. arXiv preprint 2018 IEEE Spoken Language Technology Workshop
arXiv:1904.06037. (SLT), pages 382–388. IEEE.
Hassan Kane, Muhammed Yusuf Kocyigit, Ali Ab- Beat Siebenhaar. 2013. Sprachgeographische aspekte
dalla, Pelkins Ajanoh, and Mohamed Coulibali. der morphologie und verschriftung in schweiz-
2020. NUBIA: NeUral Based Interchangeability erdeutschen chats.
Assessor for Text Generation.
Beat Siebenhaar and Alfred Wyler. 1997. Dialekt
Tom Ko, Vijayaditya Peddinti, Daniel Povey, and San-
und Hochsprache in der deutschsprachigen Schweiz.
jeev Khudanpur. 2015. Audio Augmentation for
Pro Helvetia.
Speech Recognition. In INTERSPEECH.
Philipp Koehn. 2005. Europarl: A Parallel Corpus for Mihaela C Stoian, Sameer Bansal, and Sharon Gold-
Statistical Machine Translation. water. 2019. Analyzing ASR Pretraining for
Low-Resource Speech-to-Text Translation. arXiv
Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan preprint arXiv:1910.10762.
Leary, Oleksii Kuchaiev, Jonathan M Cohen, Huyen
Nguyen, and Ravi Teja Gadde. 2019. Jasper: An Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
End-to-End Convolutional Neural Acoustic Model. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
arXiv preprint arXiv:1904.03288. Kaiser, and Illia Polosukhin. 2017. Attention Is All
You Need. In Advances in neural information pro-
M-AILABS. 2019. M-AILABS Speech Dataset. cessing systems, pages 5998–6008.
https://www.caito.de/2019/01/
the-m-ailabs-speech-dataset/. Ac- VoxForge. 2019. VoxForge. http://www.
cessed: 2019-12-10. voxforge.org/de. Accessed: 2019-12-10.
Benjamin Milde and Arne Köhn. 2018. Open Source John Wieting, Taylor Berg-Kirkpatrick, Kevin Gimpel,
Automatic Speech Recognition for German. In Pro- and Graham Neubig. 2019. Beyond BLEU: Train-
ceedings of ITG 2018. ing Neural Machine Translation with Semantic Sim-
ilarity. arXiv preprint arXiv:1909.06694.
NVIDIA. 2020. Jasper source code.
https://github.com/NVIDIA/ Klaus Zechner and Alex Waibel. 2000. Minimizing
DeepLearningExamples/blob/master/ Word Error Rate in Textual Summaries of Spoken
PyTorch/SpeechRecognition/Jasper/ Language. In 1st Meeting of the North American
images/jasper_dense_residual.png. Chapter of the Association for Computational Lin-
Accessed: 2020-05-14. guistics.
Neil Zeghidour, Qiantong Xu, Vitaliy Liptchinsky,
Nicolas Usunier, Gabriel Synnaeve, and Ronan Col-
lobert. 2018. Fully Convolutional Speech Recogni-
tion. arXiv preprint arXiv:1812.06864.