<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ZHAW-InIT at GermEval 2020 Task 4: Low-Resource Speech-to-Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matthias B u¨chi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Malgorzata Anna Ulasik</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuela H u¨rlimann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fernando Benites</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pius von Da¨niken</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark Cieliebak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Applied Information Technology Zurich University of Applied Sciences</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <abstract>
        <p>This paper presents the contribution of ZHAW-InIT to Task 4 ”Low-Resource STT” at GermEval 2020. The goal of the task is to develop a system for translating Swiss German dialect speech into Standard German text in the domain of parliamentary debates. Our approach is based on Jasper, a CNN Acoustic Model, which we fine-tune on the task data. We enhance the base system with an extended Language Model containing in-domain data and speed perturbation and run further experiments with post-processing. Our submission achieved first place with a final Word Error Rate of 40.29%.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Automatic Speech Recognition (ASR) is defined
as mapping audio signals to text. A particular
challenge for ASR arises if a language does not
have a standardized writing system, as is the case
for Swiss German. In German-speaking
Switzerland, Swiss German is the default spoken language
on most occasions, from formal to informal;
however, the language of reading and writing is
Standard German (“medial diglossia”, Siebenhaar and
Wyler (1997)). Swiss German is increasingly used
for writing in informal contexts, especially on
social media, but users usually write phonetically in
their local dialect
        <xref ref-type="bibr" rid="ref27">(Siebenhaar, 2013)</xref>
        . The
particular dialects of Swiss German differ from each
other to such an extent that speakers of one dialect
might even have difficulty understanding dialects
from some other regions. An indirect consequence
is that many dialects are considered low-resource,
since there is not enough data for each dialect for
many natural language processing tasks.
Nonetheless, there is enough data to train ASR systems
for Standard German, which is spoken by a
substantially larger group of native speakers, being
an official language also in Germany and
Austria. On official occasions, speeches are written
down, transcribed, or logged in Standard German.
Since the linguistic distance between the Swiss
German dialects and the official language German
are quite large, this poses a similar task as
CrossLinguistic Speech-To-Text (CL-STT; also referred
to as speech-to-text translation) which is a difficult
interdisciplinary challenge, combining STT with
elements of Machine Translation (MT)
        <xref ref-type="bibr" rid="ref3">(Be´rard
et al., 2016)</xref>
        . Both fields have a long history of
methods and approaches, which are currently at
the point of converging thanks to the development
of deep learning technology. This combination
of ASR and MT is indeed needed in the context
of Swiss German dialects, as speeches are
paraphrased or even translated (see Section 6 for an
example).
      </p>
      <p>The Shared Task ”Low-Resource STT” at
GermEval 2020 aimed exactly at a specific Swiss case
of CL-STT: translating Swiss German dialect
spoken in an official context to written Standard
German.</p>
      <p>
        In our approach, we applied a general
characterbased ASR system
        <xref ref-type="bibr" rid="ref18">(Li et al., 2019)</xref>
        , pre-trained
on a large German corpus, and fine-tuned to the
Shared Task data. We further enriched our
Language Model with additional publicly available
data.
The goal of this Shared Task was to develop a
system for translating Swiss German dialect speech
into Standard German text in the domain of
parliamentary debates.
      </p>
      <p>A data set of 36’572 utterances with a total
duration of 69.8 hours was made available for
training the systems and a 4 hour test set was used for
evaluating solutions. The training data consists of
a set of recordings of debates held in the
parliament of the canton of Bern, with utterances
produced by 191 speakers. None of these 191
speakers occur in the test set. The audio recordings
contain mostly Swiss German dialect speech with the
majority of the utterances being spoken in Bernese
dialect; however, there are also some recordings of
Standard German speech as well as a few English
utterances. Each utterance contains one sentence
and has an average duration of 6.9 seconds.</p>
      <p>
        All recordings have been manually transcribed
into Standard German, while the alignment
between audio and transcripts was performed
automatically by the task organizers
        <xref ref-type="bibr" rid="ref23 ref24">(Plu¨ss et al.,
2020b,a)</xref>
        .
      </p>
      <p>
        The transcript accuracy is measured with the
Word Error Rate (WER), which is the standard
ASR evaluation metric. It is computed as the sum
of the number of insertions, deletions and
substitutions between predicted and reference sentences
divided by the number of words in the reference
        <xref ref-type="bibr" rid="ref33">(Zechner and Waibel, 2000)</xref>
        . Selecting WER
instead of the BLEU score, which is usually applied
for automatic evaluation of translations, is justified
by the task organizers with the fact that the Swiss
German spoken in the parliament is comparatively
close to Standard German and the diversity of the
possible correct translations is very limited. Prior
to evaluation, the task organizers normalized both
ground truth and transcribed utterances by
lowercasing them and removing punctuation.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        The most recent developments in both ASR and
machine translation involve generalized methods
that can be relatively easily ported across the two
tasks, such as the encoder-decoder architecture.
One of the most prominent, ”Listen, Attend and
Spell” (LAS)
        <xref ref-type="bibr" rid="ref6">(Chan et al., 2016)</xref>
        , uses an
encoderdecoder architecture with attention and a
pyramidal LSTM for the encoder. Chiu et al. (2018)
describe improvements to LAS, such as multi-head
attention, scheduled sampling, and label
smoothing, which achieved new state-of-the-art
performance, although only on proprietary voice search
data. Other encoder-decoder models include the
Neural Transducer
        <xref ref-type="bibr" rid="ref12 ref6">(Jaitly et al., 2016)</xref>
        ,
Recurrent Neural Aligner
        <xref ref-type="bibr" rid="ref25">(Sak et al., 2017)</xref>
        and models
based on the Transformer architecture
        <xref ref-type="bibr" rid="ref30">(Vaswani
et al., 2017)</xref>
        as in Dong et al. (2018). Zeghidour
et al. (2018) achieved state-of-the-art performance
on an end-to-end system based on convolutional
neural networks (CNN). Their system can predict
characters directly from raw waveforms, instead
of the commonly used log-MEL features.
      </p>
      <p>
        Li et al. (2019) propose a convolutional network
with residual connections, with state-of-the-art
results on the LibriSpeech and Wall Street Journal
ASR data sets. The network predicts a character at
each step (of 20 ms) and a Connectionist
Temporal Classification (CTC) loss
        <xref ref-type="bibr" rid="ref10">(Graves et al., 2006)</xref>
        is used for training. Beamsearch decoding allows
the prediction to match a pre-trained Language
Model. Preliminary work, Bu¨chi (2020), showed
that this approach was much easier to adapt and
train on a large German corpus in comparison to
hybrid systems trained with Kaldi (Povey et al.,
2011) which achieve similar results.
      </p>
      <p>
        While usually tens of thousands of hours of
audio are required for achieving state-of-the-art
ASR performance, some approaches target
languages where only a few hours of data are
available
        <xref ref-type="bibr" rid="ref26">(Samarakoon et al., 2018)</xref>
        . The use of
pretraining and transfer learning are especially
helpful in such challenging setups
        <xref ref-type="bibr" rid="ref29">(Stoian et al., 2019)</xref>
        .
      </p>
      <p>
        Although there are approaches which directly
target the speech translation setup
        <xref ref-type="bibr" rid="ref13 ref14 ref3">(Be´rard et al.,
2016; Jia et al., 2019b,a)</xref>
        , and the Shared Task data
consists of translations and paraphrases of the
spoken utterances, we decided not to add an additional
component dealing specifically with translation to
our system because of the lack of relevant
available data.
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>System Description</title>
      <p>This section describes the initial system used to
establish a base for our experiments. Important
concepts as well as parameters crucial for the
experiments are explained.
4.1</p>
      <sec id="sec-3-1">
        <title>Reference Text Pre-processing</title>
        <p>We normalized all texts before training the
Acoustic Models and Language Models. This step was
necessary to have a standardized set of possible
characters, which in this case were the letters
az, a¨, o¨ and u¨. Normalization was performed in
multiple steps, starting by lower-casing the whole
text and splitting it into sentences. All
punctuation symbols were removed, except for points and
commas which might be used as decimal point or
for ordinal numbers. Numbers were transliterated
to words. Common abbreviations and symbols
were replaced by their spoken form (e.g. ”%” by
”Prozent” or ”kg” by ”Kilogramm”). Letters with
diacritics other than a¨, o¨, and u¨ were replaced by
their counterpart without diacritics. Finally, any
remaining unknown symbols were removed
without replacement.
4.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Acoustic Model</title>
        <p>
          An Acoustic Model was used to predict
linguistic units based on an audio signal. For this
purpose, Jasper
          <xref ref-type="bibr" rid="ref18">(Li et al., 2019)</xref>
          , a DNN-based model,
was applied. Jasper predicts a probability
distribution over all possible characters at every time
step based on mel-filterbank features as input. The
input was augmented with SpecAug
          <xref ref-type="bibr" rid="ref19">ment (Park
et al., 2019</xref>
          ).
        </p>
        <p>The model consists of convolutional layers
structured in blocks and sub-blocks. A model
B R is defined by the number of blocks B and
number of sub-blocks R. Every sub-block consists
of a 1D-convolution, batch-normalization, ReLU,
and dropout. The input of each block is connected
to the last sub-block by a residual connection. We
applied the Dense Residual configuration, which
is shown in Figure 1, where the output of each
block is additionally added to the inputs of all
following blocks. For pre- and post-processing
one and three additional blocks were used,
respectively.</p>
        <p>
          During training, the CTC loss
          <xref ref-type="bibr" rid="ref10">(Graves et al.,
2006)</xref>
          was minimized using the Novograd
opti
          <xref ref-type="bibr" rid="ref19">mizer introduced in Li et al. (2019</xref>
          ).
4.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Decoding</title>
        <p>
          In order to get transcriptions from the Acoustic
Model output, beam search was applied. Beam
search tries to find the most probable text sequence
given probabilities of characters over time.
Additionally, a Language Model was used to re-rank
the beam search hypotheses. A Language Model
penalizes words that are not known and assigns a
probability to each word given the words
preceding it. The weight of the Language Model is
controlled with parameter . A parameter is used as
the word insertion bonus to prevent the preference
of long words. The Language Model we used was
a 6-gram model trained with KenLM
          <xref ref-type="bibr" rid="ref11">(Heafield,
2011)</xref>
          .
4.4
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Pre-training on Standard German</title>
        <p>The Acoustic Model requires a large amount of
data for training. Therefore, Standard German
speech data as listed in Table 1 was used to
create a pre-trained model1. Based on the given data
sets, a combined version was created. Training,
development and test splits were kept if given in
the original data sets. Otherwise, custom splits
were created with a size of 15% for test and
validation, but with a maximum of 15000 seconds.</p>
        <p>For the size of the model the configuration 10
5 was used. The model was trained with an initial
learning rate of 0.015 on batches of size 64 for a
total of 100 epochs.
4.5</p>
      </sec>
      <sec id="sec-3-5">
        <title>Fine-Tuning</title>
        <p>The pre-trained model was used as a base for
finetuning using the task specific data. The first few
blocks serve as acoustic feature extraction. Since
acoustic features of Standard German and Swiss
German are very close, only weights of the
postprocessing blocks as well as the last three or five
intermediate blocks were updated, depending on
1Accessible through
german-asr/megs.
https://github.com/
the experiment as described in Section 5.2. Apart
from the frozen blocks, the same hyperparameters
were used as for the pre-training. The model was
trained for another 100 epochs for fine-tuning (see
Figure 2 for Word Error Rate progression over the
100 epochs).
The acoustic models were trained on a NVIDIA
DGX-1 system. Pre-training with about 540
hours of Standard German took approximately
197 hours using two NVIDA Tesla V100 GPUs,
while fine-tuning of the acoustic model
(AM-A5x5-SP) with about 70 hours of Swiss German
speech required approximately 21 hours with one
V100 GPU. The time for inference was much
lower and took only about two minutes per 4 hours
of speech on a NVIDA Titan X GPU. Applying
the language model (LM extended) required some
additional computation time. However, this took
only a few minutes on a recent system for training
as well as for decoding in combination with the
beam search algorithm.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>We describe the experiments we conducted in
order to improve the baseline system in Section 5.2,
present the results we obtained in Section 5.3 and
reflect on them in 5.4.
5.1</p>
      <sec id="sec-4-1">
        <title>System Components</title>
        <p>The data set provided as part of the Shared Task
was split into internal train, development and test
sets. The train set consisted of 32’978 utterances,
the development set contained 1’778 utterances,
while the test set comprised 1’816 utterances. This
split approximates 90% training, 5% development,
5% testing. A single speaker could not occur in
different sets and the utterance lengths were taken
into account for splitting.</p>
        <p>The experiments consisted in fine-tuning the
baseline system with the use of additional text
data and, in one case, in applying transcript
postprocessing.</p>
        <p>Acoustic Models The baseline Acoustic Model
(called ”AM base” below) was fine-tuned on the
internal train set, first on three blocks (model
”AM-E 3x5”) and in the second version on five
blocks (model ”AM-E 5x5”). In the last step of
Acoustic Model fine-tuning, the baseline model
was re-trained on the complete official train set
(internal train, development and test sets
combined), which resulted in the model called
”AMA 5x5”. Additionally, we trained a model with
the internal training set without applying any
pretraining (model ”AM-NOPRE”).</p>
        <p>Language Models The language modelling
setup is described in Section 4.3. We used
two different Language Models (LMs). The
basic Language Model (”LM base”) consists
of corpora 1-3 in Table 2. Since these
corpora are from different domains than the task
data, we injected additional data to fine-tune
the LM: corpus 4 is a collection of 11’576
press releases by the Federal Chancellery
(Bundeskanzlei). These were scraped from https:
//www.bk.admin.ch/bk/de/home/
dokumentation/medienmitteilungen.
msg-id-&lt;ID&gt;.html using a custom script,
where consecutive &lt;ID&gt;s up to the most recent
press release were queried and the content was
subsequently extracted using XPath. Corpus 5
consists of the internal training set transcripts.
The LM trained on all available corpora (1-5) is
referred to as ”LM extended”.</p>
        <p>
          Article Post-processing During development
we noticed that there was a considerable amount
of errors due to incorrectly predicted articles (e.g
”der”, ”die”, ”das”) (see Section 5.4 for more
details). We identified individual definite and
indefinite articles in a predicted utterance, removed
them, and queried the top 5 predictions of a BERT
model
          <xref ref-type="bibr" rid="ref8">(Devlin et al., 2019)</xref>
          . If the originally
predicted article appeared in the list of suggestions,
we kept it. Otherwise it was replaced by the
article scored highest by BERT, making sure not to
replace an indefinite article by a definite one or
vice-versa.
5.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Experimental Setup</title>
        <p>
          In total, nine experiments were conducted with
the goal to investigate system performance of the
various models. The details of the experiments
are presented in Table 3. The very first
experiment (”base”) was performed without any
finetuning or post-processing on the base model, while
the second one (”AMext3x5”) aimed at
evaluating the predictions from the ”AM extended 3x5”
model without applying any Language Model.
In the third experiment we evaluated the model
trained only on the internal Swiss German train
set without any pre-training on Standard
German (”AMno pretrain”). The next two
experiments consisted in introducing and extending
the Language Model (”AMch3x5 LMbase” and
”AMch3x5 LMext”). Following that, we
investigated data augmentation possibilities. In
addition to SpecAugment which is used in all
experiments, we applied speed perturbation
          <xref ref-type="bibr" rid="ref16">(Ko
et al., 2015)</xref>
          on the Acoustic Model data (model
”AMch3x5 sp LMext”). The sixth experiment
(”AMch3x5 sp LMext artc”) was an attempt to
improve the results by performing transcript
postprocessing. We sought to reduce the number
of substitutions resulting from incorrect
prediction of articles by applying BERT as described
above. In ”AMch5x5 sp LMext” we introduced
the Acoustic Model ”AM extended 5x5” and
replaced it with ”AM all 5x5” in the final experiment
(AMall5x5 sp LMext).
5.3
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Evaluation</title>
        <p>The results of all experiments were evaluated
on the internal test set, except for the last one,
”AMall5x5 sp LMext”, where the internal test set
was used for training the models. The five
bestperforming versions were submitted for evaluation
on the public test set of the Shared Task. Table 3
provides an overview of all results.</p>
        <p>Eventually, we achieved 40.29% WER on the
official test set. Our best performing system is a
combination of the baseline Acoustic Model
retrained on 5 blocks with Swiss German data, speed
perturbation, and a Language Model fine-tuned on
in-domain data from Switzerland.
5.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Discussion</title>
        <p>The two largest performance improvements were
achieved by fine-tuning the Acoustic Model on
the task-specific data (”AMext3x5” vs ”base”:
WER reduced by 38% absolute) and by using a
general-purpose Language Model during
decoding (”AMext3x5 LMbase” vs ”AMext3x5”: WER
reduced by 7.64% absolute). Both of these are
standard practices in ASR and hence these
improvements are neither surprising nor particularly</p>
        <p>We identified articles as one distinct source
of errors: around one sixth of substitution
errors were articles; hence, we decided to
address these during post-processing (model
”AMch3x5 sp LMext artc”). Our method using
BERT (see Section 5.2) did not improve
performance. There are several reasons for this. First,
while some articles were indeed improved with
this method, often there was insufficient context to
accurately determine the correct article.
Domainspecific abbreviations (e.g. party names such as
SVP, EVP) also proved difficult. Second, we
observed a number of article errors that are due to the
non-exact nature of the transcription. These are
linguistic or stylistic changes and improvements of
the spoken text and can therefore not be addressed
by our method. For example: changing a spoken
definite article to an indefinite one, using plural
instead of singular, transcribing a spoken ”es” with
”das”, or inserting an extra article into a
coordinated noun phrase.</p>
        <p>Finally, there is also a challenge that relates to
the specific language variety in this task: articles
in Swiss German are rather difficult to detect as
they usually consist of single phonemes which are
assimilated to the following noun. This means that
articles may be missed at an earlier stage of
processing and will not be present in the output passed
to the post-processing.</p>
        <p>Our extended Language Model brought
a nearly 1% absolute WER improvement
(”AMch3x5 LMext” vs ”AMch3x5 LMbase”),
which is less than we expected. However, this
can be explained by the rather small amount of
additional data - corpora 4 and 5 (see Table 2)
Language
Model
LM base
LM extended
LM extended
LM extended
LM extended
LM extended
LM extended
only account for 2% of all sentences passed to the
LM. Using more in-domain data in the LM could
lead to a larger effect.</p>
        <p>Further small improvements were obtained by
using speed perturbation (”AMch3x5 sp LMext”
vs ”AMch3x5 LMext”: -0.7% absolute on
our internal test set and -0.85% on the task
test set) and retraining five Jasper blocks
instead of three (”AMch5x5 sp LMext” vs
”AMch3x5 sp LMext”: -0.4% absolute on our
internal test set and -0.6% absolute on the task
test set).</p>
        <p>We also note that our performance on the task
test set is consistently better than the one on our
internal test set.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Training Data Challenges</title>
      <p>Before we conclude, we would like to reflect on
the properties of the task data and their
repercussions for WER results.</p>
      <p>Our analysis of the errors and the data showed
that properties of the data often lead to an increase
in WER, where the ASR model provides an
adequate transcription but is ”punished” by data
artefacts. We identified the following main issues:
We noticed that transcriptions in the training
set are inconsistent with respect to
numerals, which are written as either numbers or
words, so that transcribing the numeral four
as ”vier” when the reference transcript has
”4” will lead to a substitution error. Since
there is no consistency in the writing of
numerals (e.g. always using words, always
using numbers, using words when smaller than
ten, etc), this leads to errors that we could not
prevent.</p>
      <p>Transcripts are polished (e.g. speech
disfluencies such as repetitions, hesitations, and
false starts are removed) and reformulated
so they become more readable, which means
they do not exactly represent the spoken
text. For example, in training set item
19940.flac, the speaker starts by saying
”mer hie enne” (DE: ”wir hier drin”, EN: ”we
in here”), but this was transcribed as ”wir
in diesem Saal” (EN: ”we in this chamber”),
leading to three errors (two substitutions and
one deletion) when transcribed faithfully to
the spoken utterance by the model.</p>
      <p>We also note issues with the segmentation of
audio files, which, according to the task
organizers, was performed automatically. This
leads to insertion errors (when extra audio is
included beyond what is transcribed) or
deletion errors (when portions of the audio are
missing) of the model that cannot be
mitigated.</p>
      <p>
        Given the observed discrepancies between the
speech and transcript, additional evaluation
measures might be considered. In CL-STT, BLEU
scores are used for evaluation. Even though this
metric has been criticized, it might fit the setup
of this task better, since the paraphrasing might
not be unique. Further, measures considering
semantics and synonyms
        <xref ref-type="bibr" rid="ref15 ref32">(Wieting et al., 2019; Kane
et al., 2020)</xref>
        might prove helpful. However, in this
specific case of official transcriptions, this would
entail re-annotation, the cost of which would be
prohibitive.
7
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>
        In this paper, we presented our contribution to the
Shared Task on Low-Resource STT at GermEval
2020. Our solution consists of a CNN acoustic
model based on Jasper
        <xref ref-type="bibr" rid="ref18">(Li et al., 2019)</xref>
        with
beamsearch decoding and CTC loss. Our most
successful model uses Transfer Learning, where we
retrain the last five blocks of the Acoustic Model on
the task data. Additionally, we use speed
perturbation and a Language Model trained on both
out-ofdomain and in-domain text data. These
improvements reduced the WER by over 50% compared
to the Standard German baseline system. Our best
model achieved a WER of 40.29% on the official
task test set, resulting in first place out of three
contributions.
code.
      </p>
      <p>Neil Zeghidour, Qiantong Xu, Vitaliy Liptchinsky,
Nicolas Usunier, Gabriel Synnaeve, and Ronan
Collobert. 2018. Fully Convolutional Speech
Recognition. arXiv preprint arXiv:1812.06864.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Rosana</given-names>
            <surname>Ardila</surname>
          </string-name>
          , Megan Branson, Kelly Davis,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Henretty</surname>
          </string-name>
          , Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders,
          <string-name>
            <surname>Francis M. Tyers</surname>
            , and
            <given-names>Gregor</given-names>
          </string-name>
          <string-name>
            <surname>Weber</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Common Voice: A Massively-Multilingual Speech Corpus</article-title>
          . ArXiv, abs/
          <year>1912</year>
          .06670.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Timo</given-names>
            <surname>Baumann</surname>
          </string-name>
          , Arne Ko¨hn, and
          <string-name>
            <given-names>Felix</given-names>
            <surname>Hennig</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening</article-title>
          .
          <source>Language Resources and Evaluation.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          Alexandre Be´rard, Olivier Pietquin, Christophe Servan, and
          <string-name>
            <given-names>Laurent</given-names>
            <surname>Besacier</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation</article-title>
          .
          <source>arXiv preprint arXiv:1612</source>
          .
          <fpage>01744</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Ondrˇej Bojar</surname>
            , Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Philipp Koehn, and
            <given-names>Christof</given-names>
          </string-name>
          <string-name>
            <surname>Monz</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Findings of the 2018 Conference on Machine Translation (WMT18)</article-title>
          .
          <source>In Proceedings of the Third Conference on Machine Translation: Shared Task Papers</source>
          , pages
          <fpage>272</fpage>
          -
          <lpage>303</lpage>
          , Belgium, Brussels. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Matthias</given-names>
            <surname>Bu</surname>
          </string-name>
          ¨chi.
          <year>2020</year>
          .
          <article-title>Speech Recognition Component for Search-Oriented Conversational Artificial Intelligence</article-title>
          .
          <source>Master's thesis</source>
          , ZHAW Zurich University of Applied Sciences.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>William</given-names>
            <surname>Chan</surname>
          </string-name>
          , Navdeep Jaitly, Quoc Le, and
          <string-name>
            <given-names>Oriol</given-names>
            <surname>Vinyals</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition</article-title>
          .
          <source>In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , pages
          <fpage>4960</fpage>
          -
          <lpage>4964</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Chung-Cheng</surname>
            <given-names>Chiu</given-names>
          </string-name>
          , Tara N Sainath,
          <string-name>
            <surname>Yonghui Wu</surname>
            , Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao,
            <given-names>Ekaterina</given-names>
          </string-name>
          <string-name>
            <surname>Gonina</surname>
          </string-name>
          , et al.
          <year>2018</year>
          .
          <article-title>State-of-the-Art Speech Recognition with Sequence-to-Sequence Models</article-title>
          .
          <source>In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , pages
          <fpage>4774</fpage>
          -
          <lpage>4778</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers), pages
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          , Minneapolis, Minnesota. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Linhao</given-names>
            <surname>Dong</surname>
          </string-name>
          , Shuang Xu,
          <string-name>
            <given-names>and Bo</given-names>
            <surname>Xu</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Speech-Transformer: a No-Recurrence Sequenceto-Sequence Model for Speech Recognition</article-title>
          .
          <source>In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , pages
          <fpage>5884</fpage>
          -
          <lpage>5888</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Alex</given-names>
            <surname>Graves</surname>
          </string-name>
          , Santiago Ferna´ndez,
          <source>Faustino Gomez, and Ju¨rgen Schmidhuber</source>
          .
          <year>2006</year>
          .
          <article-title>Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks</article-title>
          .
          <source>In Proceedings of the 23rd international conference on Machine learning</source>
          , pages
          <fpage>369</fpage>
          -
          <lpage>376</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Kenneth</given-names>
            <surname>Heafield</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>KenLM: Faster and Smaller Language Model Queries</article-title>
          .
          <source>In Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation</source>
          , pages
          <fpage>187</fpage>
          -
          <lpage>197</lpage>
          , Edinburgh, Scotland, United Kingdom.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Navdeep</given-names>
            <surname>Jaitly</surname>
          </string-name>
          , Quoc V Le, Oriol Vinyals, Ilya Sutskever, David Sussillo,
          <string-name>
            <given-names>and Samy</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>An Online Sequence-to-Sequence Model Using Partial Conditioning</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>5067</fpage>
          -
          <lpage>5075</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Ye</given-names>
            <surname>Jia</surname>
          </string-name>
          , Melvin Johnson, Wolfgang Macherey, Ron J Weiss, Yuan Cao, Chung-Cheng Chiu, Naveen Ari, Stella Laurenzo, and
          <string-name>
            <given-names>Yonghui</given-names>
            <surname>Wu</surname>
          </string-name>
          . 2019a.
          <article-title>Leveraging Weakly Supervised Data to Improve End-toEnd Speech-to-Text Translation</article-title>
          .
          <source>In ICASSP 2019- 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , pages
          <fpage>7180</fpage>
          -
          <lpage>7184</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Ye</given-names>
            <surname>Jia</surname>
          </string-name>
          , Ron J Weiss, Fadi Biadsy, Wolfgang Macherey, Melvin Johnson, Zhifeng Chen, and
          <string-name>
            <given-names>Yonghui</given-names>
            <surname>Wu</surname>
          </string-name>
          . 2019b.
          <article-title>Direct Speech-to-Speech Translation with a Sequence-to-Sequence model</article-title>
          . arXiv preprint arXiv:
          <year>1904</year>
          .06037.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Hassan</given-names>
            <surname>Kane</surname>
          </string-name>
          , Muhammed Yusuf Kocyigit, Ali Abdalla, Pelkins Ajanoh, and
          <string-name>
            <given-names>Mohamed</given-names>
            <surname>Coulibali</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>NUBIA: NeUral Based Interchangeability Assessor for Text Generation</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Tom</given-names>
            <surname>Ko</surname>
          </string-name>
          , Vijayaditya Peddinti, Daniel Povey, and
          <string-name>
            <given-names>Sanjeev</given-names>
            <surname>Khudanpur</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Audio Augmentation for Speech Recognition</article-title>
          .
          <source>In INTERSPEECH.</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Philipp</given-names>
            <surname>Koehn</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Europarl: A Parallel Corpus for Statistical Machine Translation</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Jason</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Vitaly</given-names>
            <surname>Lavrukhin</surname>
          </string-name>
          , Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M Cohen,
          <string-name>
            <given-names>Huyen</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , and Ravi Teja Gadde.
          <year>2019</year>
          .
          <article-title>Jasper: An End-to-End Convolutional Neural Acoustic Model</article-title>
          . arXiv preprint arXiv:
          <year>1904</year>
          .03288.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>M-AILABS</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>M-AILABS Speech Dataset</article-title>
          . https://www.caito.de/2019/01/ the-m
          <article-title>-ailabs-speech-dataset/</article-title>
          . Accessed:
          <fpage>2019</fpage>
          -12-10.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Milde</surname>
          </string-name>
          and Arne Ko¨hn.
          <year>2018</year>
          .
          <article-title>Open Source Automatic Speech Recognition for German</article-title>
          .
          <source>In Proceedings of ITG</source>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>NVIDIA.</surname>
          </string-name>
          <year>2020</year>
          . Jasper source https://github.com/NVIDIA/ DeepLearningExamples/blob/master/ PyTorch/SpeechRecognition/Jasper/ images/jasper_dense_residual.png. Accessed:
          <fpage>2020</fpage>
          -05-14.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Daniel S.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>William</given-names>
            <surname>Chan</surname>
          </string-name>
          , Yu Zhang, ChungCheng Chiu, Barret Zoph,
          <string-name>
            <given-names>Ekin D.</given-names>
            <surname>Cubuk</surname>
          </string-name>
          , and
          <string-name>
            <surname>Quoc</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition</article-title>
          . ArXiv, abs/
          <year>1904</year>
          .08779.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>Michel</given-names>
            <surname>Plu</surname>
          </string-name>
          <article-title>¨ss, Lukas Neukom, and Manfred Vogel</article-title>
          . 2020a.
          <article-title>Forced alignment of swiss german speech to standard german text</article-title>
          .
          <source>In preparation.</source>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>Michel</given-names>
            <surname>Plu</surname>
          </string-name>
          <article-title>¨ss, Lukas Neukom, and Manfred Vogel</article-title>
          . 2020b.
          <article-title>Germeval 2020 task 4: Low-resource speech-to-text</article-title>
          .
          <source>In preparation.</source>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>Hasim</given-names>
            <surname>Sak</surname>
          </string-name>
          , Matt Shannon,
          <string-name>
            <given-names>Kanishka</given-names>
            <surname>Rao</surname>
          </string-name>
          , and Franc¸oise Beaufays.
          <year>2017</year>
          .
          <article-title>Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model for Sequence to Sequence Mapping</article-title>
          . In Interspeech, volume
          <volume>8</volume>
          , pages
          <fpage>1298</fpage>
          -
          <lpage>1302</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <given-names>Lahiru</given-names>
            <surname>Samarakoon</surname>
          </string-name>
          , Brian Mak, and
          <source>Albert YS Lam</source>
          .
          <year>2018</year>
          .
          <article-title>Domain Adaptation of End-to-end Speech Recognition in Low-Resource Settings</article-title>
          .
          <source>In 2018 IEEE Spoken Language Technology Workshop (SLT)</source>
          , pages
          <fpage>382</fpage>
          -
          <lpage>388</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <given-names>Beat</given-names>
            <surname>Siebenhaar</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Sprachgeographische aspekte der morphologie und verschriftung in schweizerdeutschen chats</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <given-names>Beat</given-names>
            <surname>Siebenhaar</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alfred</given-names>
            <surname>Wyler</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Dialekt und Hochsprache in der deutschsprachigen Schweiz</article-title>
          . Pro Helvetia.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>Mihaela C Stoian</surname>
          </string-name>
          ,
          <string-name>
            <surname>Sameer Bansal</surname>
            , and
            <given-names>Sharon</given-names>
          </string-name>
          <string-name>
            <surname>Goldwater</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Analyzing ASR Pretraining for Low-Resource Speech-to-Text Translation</article-title>
          . arXiv preprint arXiv:
          <year>1910</year>
          .10762.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <given-names>Ashish</given-names>
            <surname>Vaswani</surname>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
          <string-name>
            <surname>Łukasz Kaiser</surname>
            , and
            <given-names>Illia</given-names>
          </string-name>
          <string-name>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention Is All You Need</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>VoxForge.</surname>
          </string-name>
          <year>2019</year>
          . VoxForge. http://www. voxforge.org/de. Accessed:
          <fpage>2019</fpage>
          -12-10.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>John</surname>
            <given-names>Wieting</given-names>
          </string-name>
          , Taylor Berg-Kirkpatrick,
          <string-name>
            <given-names>Kevin</given-names>
            <surname>Gimpel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Graham</given-names>
            <surname>Neubig</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <string-name>
            <surname>Beyond</surname>
            <given-names>BLEU</given-names>
          </string-name>
          :
          <article-title>Training Neural Machine Translation with Semantic Similarity</article-title>
          . arXiv preprint arXiv:
          <year>1909</year>
          .06694.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <given-names>Klaus</given-names>
            <surname>Zechner</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alex</given-names>
            <surname>Waibel</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Minimizing Word Error Rate in Textual Summaries of Spoken Language. In 1st Meeting of the North American Chapter of the Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>