ZHAW-InIT at GermEval 2020 Task 4: Low-Resource Speech-to-Text

                    Matthias Büchi, Malgorzata Anna Ulasik, Manuela Hürlimann,
                      Fernando Benites, Pius von Däniken, and Mark Cieliebak

                                Institute of Applied Information Technology
                                   Zurich University of Applied Sciences
                                {buec, ulas, hueu, benf, vode, ciel}@zhaw.ch


                       Abstract                               since there is not enough data for each dialect for
                                                              many natural language processing tasks. Nonethe-
    This paper presents the contribution of                   less, there is enough data to train ASR systems
    ZHAW-InIT to Task 4 ”Low-Resource                         for Standard German, which is spoken by a sub-
    STT” at GermEval 2020. The goal of the                    stantially larger group of native speakers, being
    task is to develop a system for translating               an official language also in Germany and Aus-
    Swiss German dialect speech into Stan-                    tria. On official occasions, speeches are written
    dard German text in the domain of parlia-                 down, transcribed, or logged in Standard German.
    mentary debates. Our approach is based                    Since the linguistic distance between the Swiss
    on Jasper, a CNN Acoustic Model, which                    German dialects and the official language German
    we fine-tune on the task data. We enhance                 are quite large, this poses a similar task as Cross-
    the base system with an extended Lan-                     Linguistic Speech-To-Text (CL-STT; also referred
    guage Model containing in-domain data                     to as speech-to-text translation) which is a difficult
    and speed perturbation and run further ex-                interdisciplinary challenge, combining STT with
    periments with post-processing. Our sub-                  elements of Machine Translation (MT) (Bérard
    mission achieved first place with a final                 et al., 2016). Both fields have a long history of
    Word Error Rate of 40.29%.                                methods and approaches, which are currently at
                                                              the point of converging thanks to the development
1   Introduction
                                                              of deep learning technology. This combination
Automatic Speech Recognition (ASR) is defined                 of ASR and MT is indeed needed in the context
as mapping audio signals to text. A particular                of Swiss German dialects, as speeches are para-
challenge for ASR arises if a language does not               phrased or even translated (see Section 6 for an
have a standardized writing system, as is the case            example).
for Swiss German. In German-speaking Switzer-                    The Shared Task ”Low-Resource STT” at Ger-
land, Swiss German is the default spoken language             mEval 2020 aimed exactly at a specific Swiss case
on most occasions, from formal to informal; how-              of CL-STT: translating Swiss German dialect spo-
ever, the language of reading and writing is Stan-            ken in an official context to written Standard Ger-
dard German (“medial diglossia”, Siebenhaar and               man.
Wyler (1997)). Swiss German is increasingly used                 In our approach, we applied a general character-
for writing in informal contexts, especially on so-           based ASR system (Li et al., 2019), pre-trained
cial media, but users usually write phonetically in           on a large German corpus, and fine-tuned to the
their local dialect (Siebenhaar, 2013). The par-              Shared Task data. We further enriched our Lan-
ticular dialects of Swiss German differ from each             guage Model with additional publicly available
other to such an extent that speakers of one dialect          data.
might even have difficulty understanding dialects
from some other regions. An indirect consequence              2   Shared Task Description
is that many dialects are considered low-resource,
                                                              The goal of this Shared Task was to develop a sys-
Copyright c 2020 for this paper by its authors. Use permit-
ted under Creative Commons License Attribution 4.0 Interna-   tem for translating Swiss German dialect speech
tional (CC BY 4.0)                                            into Standard German text in the domain of par-
liamentary debates.                                      Neural Transducer (Jaitly et al., 2016), Recur-
   A data set of 36’572 utterances with a total du-      rent Neural Aligner (Sak et al., 2017) and models
ration of 69.8 hours was made available for train-       based on the Transformer architecture (Vaswani
ing the systems and a 4 hour test set was used for       et al., 2017) as in Dong et al. (2018). Zeghidour
evaluating solutions. The training data consists of      et al. (2018) achieved state-of-the-art performance
a set of recordings of debates held in the parlia-       on an end-to-end system based on convolutional
ment of the canton of Bern, with utterances pro-         neural networks (CNN). Their system can predict
duced by 191 speakers. None of these 191 speak-          characters directly from raw waveforms, instead
ers occur in the test set. The audio recordings con-     of the commonly used log-MEL features.
tain mostly Swiss German dialect speech with the            Li et al. (2019) propose a convolutional network
majority of the utterances being spoken in Bernese       with residual connections, with state-of-the-art re-
dialect; however, there are also some recordings of      sults on the LibriSpeech and Wall Street Journal
Standard German speech as well as a few English          ASR data sets. The network predicts a character at
utterances. Each utterance contains one sentence         each step (of 20 ms) and a Connectionist Tempo-
and has an average duration of 6.9 seconds.              ral Classification (CTC) loss (Graves et al., 2006)
   All recordings have been manually transcribed         is used for training. Beamsearch decoding allows
into Standard German, while the alignment be-            the prediction to match a pre-trained Language
tween audio and transcripts was performed au-            Model. Preliminary work, Büchi (2020), showed
tomatically by the task organizers (Plüss et al.,       that this approach was much easier to adapt and
2020b,a).                                                train on a large German corpus in comparison to
   The transcript accuracy is measured with the          hybrid systems trained with Kaldi (Povey et al.,
Word Error Rate (WER), which is the standard             2011) which achieve similar results.
ASR evaluation metric. It is computed as the sum            While usually tens of thousands of hours of
of the number of insertions, deletions and substi-       audio are required for achieving state-of-the-art
tutions between predicted and reference sentences        ASR performance, some approaches target lan-
divided by the number of words in the reference          guages where only a few hours of data are avail-
(Zechner and Waibel, 2000). Selecting WER in-            able (Samarakoon et al., 2018). The use of pre-
stead of the BLEU score, which is usually applied        training and transfer learning are especially help-
for automatic evaluation of translations, is justified   ful in such challenging setups (Stoian et al., 2019).
by the task organizers with the fact that the Swiss         Although there are approaches which directly
German spoken in the parliament is comparatively         target the speech translation setup (Bérard et al.,
close to Standard German and the diversity of the        2016; Jia et al., 2019b,a), and the Shared Task data
possible correct translations is very limited. Prior     consists of translations and paraphrases of the spo-
to evaluation, the task organizers normalized both       ken utterances, we decided not to add an additional
ground truth and transcribed utterances by lower-        component dealing specifically with translation to
casing them and removing punctuation.                    our system because of the lack of relevant avail-
                                                         able data.
3   Related Work
                                                         4     System Description
The most recent developments in both ASR and
machine translation involve generalized methods          This section describes the initial system used to
that can be relatively easily ported across the two      establish a base for our experiments. Important
tasks, such as the encoder-decoder architecture.         concepts as well as parameters crucial for the ex-
One of the most prominent, ”Listen, Attend and           periments are explained.
Spell” (LAS) (Chan et al., 2016), uses an encoder-
decoder architecture with attention and a pyrami-        4.1    Reference Text Pre-processing
dal LSTM for the encoder. Chiu et al. (2018) de-         We normalized all texts before training the Acous-
scribe improvements to LAS, such as multi-head           tic Models and Language Models. This step was
attention, scheduled sampling, and label smooth-         necessary to have a standardized set of possible
ing, which achieved new state-of-the-art perfor-         characters, which in this case were the letters a-
mance, although only on proprietary voice search         z, ä, ö and ü. Normalization was performed in
data. Other encoder-decoder models include the           multiple steps, starting by lower-casing the whole
text and splitting it into sentences. All punctua-
tion symbols were removed, except for points and
commas which might be used as decimal point or
for ordinal numbers. Numbers were transliterated
to words. Common abbreviations and symbols
were replaced by their spoken form (e.g. ”%” by
”Prozent” or ”kg” by ”Kilogramm”). Letters with
diacritics other than ä, ö, and ü were replaced by
their counterpart without diacritics. Finally, any
remaining unknown symbols were removed with-
out replacement.

4.2   Acoustic Model
An Acoustic Model was used to predict linguis-
tic units based on an audio signal. For this pur-
pose, Jasper (Li et al., 2019), a DNN-based model,
was applied. Jasper predicts a probability distri-
bution over all possible characters at every time       Figure 1: Visualization of the Jasper B × R Dense
step based on mel-filterbank features as input. The     Residual model, from the Jasper Github repository
input was augmented with SpecAugment (Park              (NVIDIA, 2020). It shows one pre-processing, three
et al., 2019).                                          post-processing and intermediate blocks with residual
   The model consists of convolutional layers           connections.
structured in blocks and sub-blocks. A model
B × R is defined by the number of blocks B and          a 6-gram model trained with KenLM (Heafield,
number of sub-blocks R. Every sub-block consists        2011).
of a 1D-convolution, batch-normalization, ReLU,
and dropout. The input of each block is connected       4.4   Pre-training on Standard German
to the last sub-block by a residual connection. We      The Acoustic Model requires a large amount of
applied the Dense Residual configuration, which         data for training. Therefore, Standard German
is shown in Figure 1, where the output of each          speech data as listed in Table 1 was used to cre-
block is additionally added to the inputs of all        ate a pre-trained model1 . Based on the given data
following blocks. For pre- and post-processing          sets, a combined version was created. Training,
one and three additional blocks were used, respec-      development and test splits were kept if given in
tively.                                                 the original data sets. Otherwise, custom splits
   During training, the CTC loss (Graves et al.,        were created with a size of 15% for test and vali-
2006) was minimized using the Novograd opti-            dation, but with a maximum of 15000 seconds.
mizer introduced in Li et al. (2019).                      For the size of the model the configuration 10 ×
                                                        5 was used. The model was trained with an initial
4.3   Decoding                                          learning rate of 0.015 on batches of size 64 for a
In order to get transcriptions from the Acoustic        total of 100 epochs.
Model output, beam search was applied. Beam
                                                        4.5   Fine-Tuning
search tries to find the most probable text sequence
given probabilities of characters over time. Addi-      The pre-trained model was used as a base for fine-
tionally, a Language Model was used to re-rank          tuning using the task specific data. The first few
the beam search hypotheses. A Language Model            blocks serve as acoustic feature extraction. Since
penalizes words that are not known and assigns a        acoustic features of Standard German and Swiss
probability to each word given the words preced-        German are very close, only weights of the post-
ing it. The weight of the Language Model is con-        processing blocks as well as the last three or five
trolled with parameter α. A parameter β is used as      intermediate blocks were updated, depending on
the word insertion bonus to prevent the preference        1
                                                            Accessible through      https://github.com/
of long words. The Language Model we used was           german-asr/megs.
Table 1: List of speech corpora used for pre-training. We used the original training splits, if available, and removed
all identified invalid samples (e.g. containing wrong transcriptions or corrupted audios). This resulted in training
data consisting of 536.9 hours of speech.

                      Name                                       Size (h)   Num. of Speakers
                      TuDa (Milde and Köhn, 2018)                   183                179
                      CV (Ardila et al., 2019)                       324               4852
                      VoxForge (VoxForge, 2019)                       32                328
                      SWC (Baumann et al., 2018)                     266                573
                      M-AILABS (M-AILABS, 2019)                      233                   -


the experiment as described in Section 5.2. Apart            5     Experiments
from the frozen blocks, the same hyperparameters
were used as for the pre-training. The model was             We describe the experiments we conducted in or-
trained for another 100 epochs for fine-tuning (see          der to improve the baseline system in Section 5.2,
Figure 2 for Word Error Rate progression over the            present the results we obtained in Section 5.3 and
100 epochs).                                                 reflect on them in 5.4.

                                                             5.1     System Components
                                                             The data set provided as part of the Shared Task
                                                             was split into internal train, development and test
                                                             sets. The train set consisted of 32’978 utterances,
                                                             the development set contained 1’778 utterances,
                                                             while the test set comprised 1’816 utterances. This
                                                             split approximates 90% training, 5% development,
                                                             5% testing. A single speaker could not occur in
                                                             different sets and the utterance lengths were taken
                                                             into account for splitting.
                                                                The experiments consisted in fine-tuning the
                                                             baseline system with the use of additional text
Figure 2: Word Error Rate progression on the internal
                                                             data and, in one case, in applying transcript post-
development set.                                             processing.

                                                             Acoustic Models The baseline Acoustic Model
                                                             (called ”AM base” below) was fine-tuned on the
                                                             internal train set, first on three blocks (model
4.6   Performance                                            ”AM-E 3x5”) and in the second version on five
                                                             blocks (model ”AM-E 5x5”). In the last step of
The acoustic models were trained on a NVIDIA                 Acoustic Model fine-tuning, the baseline model
DGX-1 system. Pre-training with about 540                    was re-trained on the complete official train set
hours of Standard German took approximately                  (internal train, development and test sets com-
197 hours using two NVIDA Tesla V100 GPUs,                   bined), which resulted in the model called ”AM-
while fine-tuning of the acoustic model (AM-A-               A 5x5”. Additionally, we trained a model with
5x5-SP) with about 70 hours of Swiss German                  the internal training set without applying any pre-
speech required approximately 21 hours with one              training (model ”AM-NOPRE”).
V100 GPU. The time for inference was much
lower and took only about two minutes per 4 hours            Language Models The language modelling
of speech on a NVIDA Titan X GPU. Applying                   setup is described in Section 4.3. We used
the language model (LM extended) required some               two different Language Models (LMs). The
additional computation time. However, this took              basic Language Model (”LM base”) consists
only a few minutes on a recent system for training           of corpora 1-3 in Table 2. Since these cor-
as well as for decoding in combination with the              pora are from different domains than the task
beam search algorithm.                                       data, we injected additional data to fine-tune
Table 2: List of text corpora used for training Language Models. The first three corpora were used for the basic
Language Model, while the last two were additionally included in the extended LM.

                         Name                                         Num. of Sentences
                     1   News-Commentary (Bojar et al., 2018)                  383‘764
                     2   EuroParl (Koehn, 2005)                              1‘920‘208
                     3   Tuda-Text (Milde and Köhn, 2018)                   7‘776‘674
                     4   Federal Chancellery Press Releases                    174‘520
                     5   Training set transcripts                               32‘977


the LM: corpus 4 is a collection of 11’576                the Language Model (”AMch3x5 LMbase” and
press releases by the Federal Chancellery (Bun-           ”AMch3x5 LMext”). Following that, we inves-
deskanzlei). These were scraped from https:               tigated data augmentation possibilities. In ad-
//www.bk.admin.ch/bk/de/home/                             dition to SpecAugment which is used in all ex-
dokumentation/medienmitteilungen.                         periments, we applied speed perturbation (Ko
msg-id-<ID>.html using a custom script,                   et al., 2015) on the Acoustic Model data (model
where consecutive <ID>s up to the most recent             ”AMch3x5 sp LMext”). The sixth experiment
press release were queried and the content was            (”AMch3x5 sp LMext artc”) was an attempt to
subsequently extracted using XPath. Corpus 5              improve the results by performing transcript post-
consists of the internal training set transcripts.        processing. We sought to reduce the number
The LM trained on all available corpora (1-5) is          of substitutions resulting from incorrect predic-
referred to as ”LM extended”.                             tion of articles by applying BERT as described
                                                          above. In ”AMch5x5 sp LMext” we introduced
Article Post-processing During development                the Acoustic Model ”AM extended 5x5” and re-
we noticed that there was a considerable amount           placed it with ”AM all 5x5” in the final experiment
of errors due to incorrectly predicted articles (e.g      (AMall5x5 sp LMext).
”der”, ”die”, ”das”) (see Section 5.4 for more de-
tails). We identified individual definite and in-         5.3   Evaluation
definite articles in a predicted utterance, removed       The results of all experiments were evaluated
them, and queried the top 5 predictions of a BERT         on the internal test set, except for the last one,
model (Devlin et al., 2019). If the originally pre-       ”AMall5x5 sp LMext”, where the internal test set
dicted article appeared in the list of suggestions,       was used for training the models. The five best-
we kept it. Otherwise it was replaced by the ar-          performing versions were submitted for evaluation
ticle scored highest by BERT, making sure not to          on the public test set of the Shared Task. Table 3
replace an indefinite article by a definite one or        provides an overview of all results.
vice-versa.                                                  Eventually, we achieved 40.29% WER on the
                                                          official test set. Our best performing system is a
5.2   Experimental Setup
                                                          combination of the baseline Acoustic Model re-
In total, nine experiments were conducted with            trained on 5 blocks with Swiss German data, speed
the goal to investigate system performance of the         perturbation, and a Language Model fine-tuned on
various models. The details of the experiments            in-domain data from Switzerland.
are presented in Table 3. The very first exper-
iment (”base”) was performed without any fine-            5.4   Discussion
tuning or post-processing on the base model, while        The two largest performance improvements were
the second one (”AMext3x5”) aimed at evaluat-             achieved by fine-tuning the Acoustic Model on
ing the predictions from the ”AM extended 3x5”            the task-specific data (”AMext3x5” vs ”base”:
model without applying any Language Model.                WER reduced by 38% absolute) and by using a
In the third experiment we evaluated the model            general-purpose Language Model during decod-
trained only on the internal Swiss German train           ing (”AMext3x5 LMbase” vs ”AMext3x5”: WER
set without any pre-training on Standard Ger-             reduced by 7.64% absolute). Both of these are
man (”AMno pretrain”). The next two exper-                standard practices in ASR and hence these im-
iments consisted in introducing and extending             provements are neither surprising nor particularly
Table 3: Experiments overview. Note on Acoustic Models: AM-E-3x5-SP stands for AM extended 3x5 with speed
perturbation, and AM-A-5x5-SP for AM all 5x5 with speed perturbation.

 System                         Acoustic                  Language            Post-                WER
 Name                           Model                     Model               Processing    internal official
 base                           AM base                   -                   -             92.1%    -
 AMext3x5                       AM-E 3x5                  -                   -             54.1%    -
 AMch3x5 LMbase                 AM-E 3x5                  LM base             -             46.46% -
 AMno pretrain LMext            AM-NOPRE SP               LM extended         -             46.82% 43.52%
 AMch3x5 LMext                  AM-E 3x5                  LM extended         -             45.52% 42.61%
 AMch3x5 sp LMext               AM-E 3x5 SP               LM extended         -             44.83% 41.76%
 AMch3x5 sp LMext artc          AM-E 3x5 SP               LM extended         articles      45.17% 42.2%
 AMch5x5 sp LMext               AM-E 5x5 SP               LM extended         -             44.43% 41.16%
 AMall5x5 sp LMext              AM-A 5x5 SP               LM extended         -             -        40.29%


insightful.                                                only account for 2% of all sentences passed to the
   We identified articles as one distinct source           LM. Using more in-domain data in the LM could
of errors: around one sixth of substitution                lead to a larger effect.
errors were articles; hence, we decided to                    Further small improvements were obtained by
address these during post-processing (model                using speed perturbation (”AMch3x5 sp LMext”
”AMch3x5 sp LMext artc”). Our method using                 vs ”AMch3x5 LMext”:            -0.7% absolute on
BERT (see Section 5.2) did not improve perfor-             our internal test set and -0.85% on the task
mance. There are several reasons for this. First,          test set) and retraining five Jasper blocks
while some articles were indeed improved with              instead of three (”AMch5x5 sp LMext” vs
this method, often there was insufficient context to       ”AMch3x5 sp LMext”: -0.4% absolute on our
accurately determine the correct article. Domain-          internal test set and -0.6% absolute on the task
specific abbreviations (e.g. party names such as           test set).
SVP, EVP) also proved difficult. Second, we ob-               We also note that our performance on the task
served a number of article errors that are due to the      test set is consistently better than the one on our
non-exact nature of the transcription. These are           internal test set.
linguistic or stylistic changes and improvements of
                                                           6    Training Data Challenges
the spoken text and can therefore not be addressed
by our method. For example: changing a spoken              Before we conclude, we would like to reflect on
definite article to an indefinite one, using plural in-    the properties of the task data and their repercus-
stead of singular, transcribing a spoken ”es” with         sions for WER results.
”das”, or inserting an extra article into a coordi-           Our analysis of the errors and the data showed
nated noun phrase.                                         that properties of the data often lead to an increase
   Finally, there is also a challenge that relates to      in WER, where the ASR model provides an ade-
the specific language variety in this task: articles       quate transcription but is ”punished” by data arte-
in Swiss German are rather difficult to detect as          facts. We identified the following main issues:
they usually consist of single phonemes which are              • We noticed that transcriptions in the training
assimilated to the following noun. This means that               set are inconsistent with respect to numer-
articles may be missed at an earlier stage of pro-               als, which are written as either numbers or
cessing and will not be present in the output passed             words, so that transcribing the numeral four
to the post-processing.                                          as ”vier” when the reference transcript has
  Our extended Language Model brought                            ”4” will lead to a substitution error. Since
a nearly 1% absolute WER improvement                             there is no consistency in the writing of nu-
(”AMch3x5 LMext” vs ”AMch3x5 LMbase”),                           merals (e.g. always using words, always us-
which is less than we expected. However, this                    ing numbers, using words when smaller than
can be explained by the rather small amount of                   ten, etc), this leads to errors that we could not
additional data - corpora 4 and 5 (see Table 2)                  prevent.
    • Transcripts are polished (e.g. speech disflu-    References
      encies such as repetitions, hesitations, and     Rosana Ardila, Megan Branson, Kelly Davis, Michael
      false starts are removed) and reformulated         Henretty, Michael Kohler, Josh Meyer, Reuben
      so they become more readable, which means          Morais, Lindsay Saunders, Francis M. Tyers,
      they do not exactly represent the spoken           and Gregor Weber. 2019. Common Voice: A
                                                         Massively-Multilingual Speech Corpus.     ArXiv,
      text. For example, in training set item            abs/1912.06670.
      19940.flac, the speaker starts by saying
      ”mer hie enne” (DE: ”wir hier drin”, EN: ”we     Timo Baumann, Arne Köhn, and Felix Hennig. 2018.
      in here”), but this was transcribed as ”wir        The Spoken Wikipedia Corpus collection: Harvest-
                                                         ing, alignment and an application to hyperlistening.
      in diesem Saal” (EN: ”we in this chamber”),        Language Resources and Evaluation.
      leading to three errors (two substitutions and
      one deletion) when transcribed faithfully to     Alexandre Bérard, Olivier Pietquin, Christophe Servan,
      the spoken utterance by the model.                 and Laurent Besacier. 2016. Listen and Translate:
                                                         A Proof of Concept for End-to-End Speech-to-Text
                                                         Translation. arXiv preprint arXiv:1612.01744.
    • We also note issues with the segmentation of
      audio files, which, according to the task or-    Ondřej Bojar, Christian Federmann, Mark Fishel,
      ganizers, was performed automatically. This        Yvette Graham, Barry Haddow, Philipp Koehn, and
      leads to insertion errors (when extra audio is     Christof Monz. 2018. Findings of the 2018 Con-
                                                         ference on Machine Translation (WMT18). In Pro-
      included beyond what is transcribed) or dele-      ceedings of the Third Conference on Machine Trans-
      tion errors (when portions of the audio are        lation: Shared Task Papers, pages 272–303, Bel-
      missing) of the model that cannot be miti-         gium, Brussels. Association for Computational Lin-
      gated.                                             guistics.

                                                       Matthias Büchi. 2020. Speech Recognition Compo-
   Given the observed discrepancies between the         nent for Search-Oriented Conversational Artificial
speech and transcript, additional evaluation mea-       Intelligence. Master’s thesis, ZHAW Zurich Univer-
sures might be considered. In CL-STT, BLEU              sity of Applied Sciences.
scores are used for evaluation. Even though this
                                                       William Chan, Navdeep Jaitly, Quoc Le, and Oriol
metric has been criticized, it might fit the setup       Vinyals. 2016. Listen, Attend and Spell: A Neu-
of this task better, since the paraphrasing might        ral Network for Large Vocabulary Conversational
not be unique. Further, measures considering se-         Speech Recognition. In 2016 IEEE International
mantics and synonyms (Wieting et al., 2019; Kane         Conference on Acoustics, Speech and Signal Pro-
                                                         cessing (ICASSP), pages 4960–4964. IEEE.
et al., 2020) might prove helpful. However, in this
specific case of official transcriptions, this would   Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Ro-
entail re-annotation, the cost of which would be         hit Prabhavalkar, Patrick Nguyen, Zhifeng Chen,
prohibitive.                                             Anjuli Kannan, Ron J Weiss, Kanishka Rao, Eka-
                                                         terina Gonina, et al. 2018. State-of-the-Art Speech
                                                         Recognition with Sequence-to-Sequence Models.
7    Conclusion                                          In 2018 IEEE International Conference on Acous-
                                                         tics, Speech and Signal Processing (ICASSP), pages
In this paper, we presented our contribution to the
                                                         4774–4778. IEEE.
Shared Task on Low-Resource STT at GermEval
2020. Our solution consists of a CNN acoustic          Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
model based on Jasper (Li et al., 2019) with beam-        Kristina Toutanova. 2019. BERT: Pre-training of
                                                          Deep Bidirectional Transformers for Language Un-
search decoding and CTC loss. Our most success-           derstanding. In Proceedings of the 2019 Conference
ful model uses Transfer Learning, where we re-            of the North American Chapter of the Association
train the last five blocks of the Acoustic Model on       for Computational Linguistics: Human Language
the task data. Additionally, we use speed perturba-       Technologies, Volume 1 (Long and Short Papers),
                                                          pages 4171–4186, Minneapolis, Minnesota. Associ-
tion and a Language Model trained on both out-of-
                                                          ation for Computational Linguistics.
domain and in-domain text data. These improve-
ments reduced the WER by over 50% compared             Linhao Dong, Shuang Xu, and Bo Xu. 2018.
to the Standard German baseline system. Our best         Speech-Transformer: a No-Recurrence Sequence-
                                                         to-Sequence Model for Speech Recognition. In
model achieved a WER of 40.29% on the official           2018 IEEE International Conference on Acous-
task test set, resulting in first place out of three     tics, Speech and Signal Processing (ICASSP), pages
contributions.                                           5884–5888. IEEE.
Alex Graves, Santiago Fernández, Faustino Gomez,       Daniel S. Park, William Chan, Yu Zhang, Chung-
  and Jürgen Schmidhuber. 2006. Connectionist Tem-       Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and
  poral Classification: Labelling Unsegmented Se-         Quoc V. Le. 2019. SpecAugment: A Simple
  quence Data with Recurrent Neural Networks. In          Data Augmentation Method for Automatic Speech
  Proceedings of the 23rd international conference on     Recognition. ArXiv, abs/1904.08779.
  Machine learning, pages 369–376.
                                                        Michel Plüss, Lukas Neukom, and Manfred Vogel.
Kenneth Heafield. 2011. KenLM: Faster and Smaller         2020a. Forced alignment of swiss german speech
  Language Model Queries. In Proceedings of the           to standard german text. In preparation.
  EMNLP 2011 Sixth Workshop on Statistical Ma-
  chine Translation, pages 187–197, Edinburgh, Scot-    Michel Plüss, Lukas Neukom, and Manfred Vogel.
  land, United Kingdom.                                   2020b. Germeval 2020 task 4: Low-resource
                                                          speech-to-text. In preparation.
Navdeep Jaitly, Quoc V Le, Oriol Vinyals, Ilya
  Sutskever, David Sussillo, and Samy Bengio. 2016.     Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukáš
  An Online Sequence-to-Sequence Model Using Par-         Burget, Ondrej Glembek, Nagendra Goel, Mirko
  tial Conditioning. In Advances in Neural Informa-       Hannemann, Petr Motlı́ček, Yanmin Qian, Petr
  tion Processing Systems, pages 5067–5075.               Schwarz, Jan Silovský, Georg Stemmer, and Karel
                                                          Vesel. 2011. The Kaldi Speech Recognition Toolkit.
Ye Jia, Melvin Johnson, Wolfgang Macherey, Ron J          IEEE 2011 Workshop on Automatic Speech Recog-
  Weiss, Yuan Cao, Chung-Cheng Chiu, Naveen Ari,          nition and Understanding.
  Stella Laurenzo, and Yonghui Wu. 2019a. Lever-
  aging Weakly Supervised Data to Improve End-to-       Hasim Sak, Matt Shannon, Kanishka Rao, and
  End Speech-to-Text Translation. In ICASSP 2019-         Françoise Beaufays. 2017.      Recurrent Neural
  2019 IEEE International Conference on Acous-            Aligner: An Encoder-Decoder Neural Network
  tics, Speech and Signal Processing (ICASSP), pages      Model for Sequence to Sequence Mapping. In In-
  7180–7184. IEEE.                                        terspeech, volume 8, pages 1298–1302.
Ye Jia, Ron J Weiss, Fadi Biadsy, Wolfgang Macherey,    Lahiru Samarakoon, Brian Mak, and Albert YS
  Melvin Johnson, Zhifeng Chen, and Yonghui Wu.           Lam. 2018. Domain Adaptation of End-to-end
  2019b. Direct Speech-to-Speech Translation with         Speech Recognition in Low-Resource Settings. In
  a Sequence-to-Sequence model. arXiv preprint            2018 IEEE Spoken Language Technology Workshop
  arXiv:1904.06037.                                       (SLT), pages 382–388. IEEE.
Hassan Kane, Muhammed Yusuf Kocyigit, Ali Ab-           Beat Siebenhaar. 2013. Sprachgeographische aspekte
  dalla, Pelkins Ajanoh, and Mohamed Coulibali.           der morphologie und verschriftung in schweiz-
  2020. NUBIA: NeUral Based Interchangeability            erdeutschen chats.
  Assessor for Text Generation.
                                                        Beat Siebenhaar and Alfred Wyler. 1997. Dialekt
Tom Ko, Vijayaditya Peddinti, Daniel Povey, and San-
                                                          und Hochsprache in der deutschsprachigen Schweiz.
  jeev Khudanpur. 2015. Audio Augmentation for
                                                          Pro Helvetia.
  Speech Recognition. In INTERSPEECH.
Philipp Koehn. 2005. Europarl: A Parallel Corpus for    Mihaela C Stoian, Sameer Bansal, and Sharon Gold-
  Statistical Machine Translation.                        water. 2019.     Analyzing ASR Pretraining for
                                                          Low-Resource Speech-to-Text Translation. arXiv
Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan          preprint arXiv:1910.10762.
   Leary, Oleksii Kuchaiev, Jonathan M Cohen, Huyen
   Nguyen, and Ravi Teja Gadde. 2019. Jasper: An        Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
   End-to-End Convolutional Neural Acoustic Model.        Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
   arXiv preprint arXiv:1904.03288.                       Kaiser, and Illia Polosukhin. 2017. Attention Is All
                                                          You Need. In Advances in neural information pro-
M-AILABS. 2019.       M-AILABS Speech Dataset.            cessing systems, pages 5998–6008.
  https://www.caito.de/2019/01/
  the-m-ailabs-speech-dataset/.           Ac-           VoxForge. 2019. VoxForge.    http://www.
  cessed: 2019-12-10.                                     voxforge.org/de. Accessed: 2019-12-10.

Benjamin Milde and Arne Köhn. 2018. Open Source        John Wieting, Taylor Berg-Kirkpatrick, Kevin Gimpel,
  Automatic Speech Recognition for German. In Pro-        and Graham Neubig. 2019. Beyond BLEU: Train-
  ceedings of ITG 2018.                                   ing Neural Machine Translation with Semantic Sim-
                                                          ilarity. arXiv preprint arXiv:1909.06694.
NVIDIA. 2020.           Jasper source code.
  https://github.com/NVIDIA/                            Klaus Zechner and Alex Waibel. 2000. Minimizing
  DeepLearningExamples/blob/master/                       Word Error Rate in Textual Summaries of Spoken
  PyTorch/SpeechRecognition/Jasper/                       Language. In 1st Meeting of the North American
  images/jasper_dense_residual.png.                       Chapter of the Association for Computational Lin-
  Accessed: 2020-05-14.                                   guistics.
Neil Zeghidour, Qiantong Xu, Vitaliy Liptchinsky,
  Nicolas Usunier, Gabriel Synnaeve, and Ronan Col-
  lobert. 2018. Fully Convolutional Speech Recogni-
  tion. arXiv preprint arXiv:1812.06864.