ZHAW-InIT at GermEval 2020 Task 4: Low-Resource Speech-to-Text Matthias Büchi, Malgorzata Anna Ulasik, Manuela Hürlimann, Fernando Benites, Pius von Däniken, and Mark Cieliebak Institute of Applied Information Technology Zurich University of Applied Sciences {buec, ulas, hueu, benf, vode, ciel}@zhaw.ch Abstract since there is not enough data for each dialect for many natural language processing tasks. Nonethe- This paper presents the contribution of less, there is enough data to train ASR systems ZHAW-InIT to Task 4 ”Low-Resource for Standard German, which is spoken by a sub- STT” at GermEval 2020. The goal of the stantially larger group of native speakers, being task is to develop a system for translating an official language also in Germany and Aus- Swiss German dialect speech into Stan- tria. On official occasions, speeches are written dard German text in the domain of parlia- down, transcribed, or logged in Standard German. mentary debates. Our approach is based Since the linguistic distance between the Swiss on Jasper, a CNN Acoustic Model, which German dialects and the official language German we fine-tune on the task data. We enhance are quite large, this poses a similar task as Cross- the base system with an extended Lan- Linguistic Speech-To-Text (CL-STT; also referred guage Model containing in-domain data to as speech-to-text translation) which is a difficult and speed perturbation and run further ex- interdisciplinary challenge, combining STT with periments with post-processing. Our sub- elements of Machine Translation (MT) (Bérard mission achieved first place with a final et al., 2016). Both fields have a long history of Word Error Rate of 40.29%. methods and approaches, which are currently at the point of converging thanks to the development 1 Introduction of deep learning technology. This combination Automatic Speech Recognition (ASR) is defined of ASR and MT is indeed needed in the context as mapping audio signals to text. A particular of Swiss German dialects, as speeches are para- challenge for ASR arises if a language does not phrased or even translated (see Section 6 for an have a standardized writing system, as is the case example). for Swiss German. In German-speaking Switzer- The Shared Task ”Low-Resource STT” at Ger- land, Swiss German is the default spoken language mEval 2020 aimed exactly at a specific Swiss case on most occasions, from formal to informal; how- of CL-STT: translating Swiss German dialect spo- ever, the language of reading and writing is Stan- ken in an official context to written Standard Ger- dard German (“medial diglossia”, Siebenhaar and man. Wyler (1997)). Swiss German is increasingly used In our approach, we applied a general character- for writing in informal contexts, especially on so- based ASR system (Li et al., 2019), pre-trained cial media, but users usually write phonetically in on a large German corpus, and fine-tuned to the their local dialect (Siebenhaar, 2013). The par- Shared Task data. We further enriched our Lan- ticular dialects of Swiss German differ from each guage Model with additional publicly available other to such an extent that speakers of one dialect data. might even have difficulty understanding dialects from some other regions. An indirect consequence 2 Shared Task Description is that many dialects are considered low-resource, The goal of this Shared Task was to develop a sys- Copyright c 2020 for this paper by its authors. Use permit- ted under Creative Commons License Attribution 4.0 Interna- tem for translating Swiss German dialect speech tional (CC BY 4.0) into Standard German text in the domain of par- liamentary debates. Neural Transducer (Jaitly et al., 2016), Recur- A data set of 36’572 utterances with a total du- rent Neural Aligner (Sak et al., 2017) and models ration of 69.8 hours was made available for train- based on the Transformer architecture (Vaswani ing the systems and a 4 hour test set was used for et al., 2017) as in Dong et al. (2018). Zeghidour evaluating solutions. The training data consists of et al. (2018) achieved state-of-the-art performance a set of recordings of debates held in the parlia- on an end-to-end system based on convolutional ment of the canton of Bern, with utterances pro- neural networks (CNN). Their system can predict duced by 191 speakers. None of these 191 speak- characters directly from raw waveforms, instead ers occur in the test set. The audio recordings con- of the commonly used log-MEL features. tain mostly Swiss German dialect speech with the Li et al. (2019) propose a convolutional network majority of the utterances being spoken in Bernese with residual connections, with state-of-the-art re- dialect; however, there are also some recordings of sults on the LibriSpeech and Wall Street Journal Standard German speech as well as a few English ASR data sets. The network predicts a character at utterances. Each utterance contains one sentence each step (of 20 ms) and a Connectionist Tempo- and has an average duration of 6.9 seconds. ral Classification (CTC) loss (Graves et al., 2006) All recordings have been manually transcribed is used for training. Beamsearch decoding allows into Standard German, while the alignment be- the prediction to match a pre-trained Language tween audio and transcripts was performed au- Model. Preliminary work, Büchi (2020), showed tomatically by the task organizers (Plüss et al., that this approach was much easier to adapt and 2020b,a). train on a large German corpus in comparison to The transcript accuracy is measured with the hybrid systems trained with Kaldi (Povey et al., Word Error Rate (WER), which is the standard 2011) which achieve similar results. ASR evaluation metric. It is computed as the sum While usually tens of thousands of hours of of the number of insertions, deletions and substi- audio are required for achieving state-of-the-art tutions between predicted and reference sentences ASR performance, some approaches target lan- divided by the number of words in the reference guages where only a few hours of data are avail- (Zechner and Waibel, 2000). Selecting WER in- able (Samarakoon et al., 2018). The use of pre- stead of the BLEU score, which is usually applied training and transfer learning are especially help- for automatic evaluation of translations, is justified ful in such challenging setups (Stoian et al., 2019). by the task organizers with the fact that the Swiss Although there are approaches which directly German spoken in the parliament is comparatively target the speech translation setup (Bérard et al., close to Standard German and the diversity of the 2016; Jia et al., 2019b,a), and the Shared Task data possible correct translations is very limited. Prior consists of translations and paraphrases of the spo- to evaluation, the task organizers normalized both ken utterances, we decided not to add an additional ground truth and transcribed utterances by lower- component dealing specifically with translation to casing them and removing punctuation. our system because of the lack of relevant avail- able data. 3 Related Work 4 System Description The most recent developments in both ASR and machine translation involve generalized methods This section describes the initial system used to that can be relatively easily ported across the two establish a base for our experiments. Important tasks, such as the encoder-decoder architecture. concepts as well as parameters crucial for the ex- One of the most prominent, ”Listen, Attend and periments are explained. Spell” (LAS) (Chan et al., 2016), uses an encoder- decoder architecture with attention and a pyrami- 4.1 Reference Text Pre-processing dal LSTM for the encoder. Chiu et al. (2018) de- We normalized all texts before training the Acous- scribe improvements to LAS, such as multi-head tic Models and Language Models. This step was attention, scheduled sampling, and label smooth- necessary to have a standardized set of possible ing, which achieved new state-of-the-art perfor- characters, which in this case were the letters a- mance, although only on proprietary voice search z, ä, ö and ü. Normalization was performed in data. Other encoder-decoder models include the multiple steps, starting by lower-casing the whole text and splitting it into sentences. All punctua- tion symbols were removed, except for points and commas which might be used as decimal point or for ordinal numbers. Numbers were transliterated to words. Common abbreviations and symbols were replaced by their spoken form (e.g. ”%” by ”Prozent” or ”kg” by ”Kilogramm”). Letters with diacritics other than ä, ö, and ü were replaced by their counterpart without diacritics. Finally, any remaining unknown symbols were removed with- out replacement. 4.2 Acoustic Model An Acoustic Model was used to predict linguis- tic units based on an audio signal. For this pur- pose, Jasper (Li et al., 2019), a DNN-based model, was applied. Jasper predicts a probability distri- bution over all possible characters at every time Figure 1: Visualization of the Jasper B × R Dense step based on mel-filterbank features as input. The Residual model, from the Jasper Github repository input was augmented with SpecAugment (Park (NVIDIA, 2020). It shows one pre-processing, three et al., 2019). post-processing and intermediate blocks with residual The model consists of convolutional layers connections. structured in blocks and sub-blocks. A model B × R is defined by the number of blocks B and a 6-gram model trained with KenLM (Heafield, number of sub-blocks R. Every sub-block consists 2011). of a 1D-convolution, batch-normalization, ReLU, and dropout. The input of each block is connected 4.4 Pre-training on Standard German to the last sub-block by a residual connection. We The Acoustic Model requires a large amount of applied the Dense Residual configuration, which data for training. Therefore, Standard German is shown in Figure 1, where the output of each speech data as listed in Table 1 was used to cre- block is additionally added to the inputs of all ate a pre-trained model1 . Based on the given data following blocks. For pre- and post-processing sets, a combined version was created. Training, one and three additional blocks were used, respec- development and test splits were kept if given in tively. the original data sets. Otherwise, custom splits During training, the CTC loss (Graves et al., were created with a size of 15% for test and vali- 2006) was minimized using the Novograd opti- dation, but with a maximum of 15000 seconds. mizer introduced in Li et al. (2019). For the size of the model the configuration 10 × 5 was used. The model was trained with an initial 4.3 Decoding learning rate of 0.015 on batches of size 64 for a In order to get transcriptions from the Acoustic total of 100 epochs. Model output, beam search was applied. Beam 4.5 Fine-Tuning search tries to find the most probable text sequence given probabilities of characters over time. Addi- The pre-trained model was used as a base for fine- tionally, a Language Model was used to re-rank tuning using the task specific data. The first few the beam search hypotheses. A Language Model blocks serve as acoustic feature extraction. Since penalizes words that are not known and assigns a acoustic features of Standard German and Swiss probability to each word given the words preced- German are very close, only weights of the post- ing it. The weight of the Language Model is con- processing blocks as well as the last three or five trolled with parameter α. A parameter β is used as intermediate blocks were updated, depending on the word insertion bonus to prevent the preference 1 Accessible through https://github.com/ of long words. The Language Model we used was german-asr/megs. Table 1: List of speech corpora used for pre-training. We used the original training splits, if available, and removed all identified invalid samples (e.g. containing wrong transcriptions or corrupted audios). This resulted in training data consisting of 536.9 hours of speech. Name Size (h) Num. of Speakers TuDa (Milde and Köhn, 2018) 183 179 CV (Ardila et al., 2019) 324 4852 VoxForge (VoxForge, 2019) 32 328 SWC (Baumann et al., 2018) 266 573 M-AILABS (M-AILABS, 2019) 233 - the experiment as described in Section 5.2. Apart 5 Experiments from the frozen blocks, the same hyperparameters were used as for the pre-training. The model was We describe the experiments we conducted in or- trained for another 100 epochs for fine-tuning (see der to improve the baseline system in Section 5.2, Figure 2 for Word Error Rate progression over the present the results we obtained in Section 5.3 and 100 epochs). reflect on them in 5.4. 5.1 System Components The data set provided as part of the Shared Task was split into internal train, development and test sets. The train set consisted of 32’978 utterances, the development set contained 1’778 utterances, while the test set comprised 1’816 utterances. This split approximates 90% training, 5% development, 5% testing. A single speaker could not occur in different sets and the utterance lengths were taken into account for splitting. The experiments consisted in fine-tuning the baseline system with the use of additional text Figure 2: Word Error Rate progression on the internal data and, in one case, in applying transcript post- development set. processing. Acoustic Models The baseline Acoustic Model (called ”AM base” below) was fine-tuned on the internal train set, first on three blocks (model 4.6 Performance ”AM-E 3x5”) and in the second version on five blocks (model ”AM-E 5x5”). In the last step of The acoustic models were trained on a NVIDIA Acoustic Model fine-tuning, the baseline model DGX-1 system. Pre-training with about 540 was re-trained on the complete official train set hours of Standard German took approximately (internal train, development and test sets com- 197 hours using two NVIDA Tesla V100 GPUs, bined), which resulted in the model called ”AM- while fine-tuning of the acoustic model (AM-A- A 5x5”. Additionally, we trained a model with 5x5-SP) with about 70 hours of Swiss German the internal training set without applying any pre- speech required approximately 21 hours with one training (model ”AM-NOPRE”). V100 GPU. The time for inference was much lower and took only about two minutes per 4 hours Language Models The language modelling of speech on a NVIDA Titan X GPU. Applying setup is described in Section 4.3. We used the language model (LM extended) required some two different Language Models (LMs). The additional computation time. However, this took basic Language Model (”LM base”) consists only a few minutes on a recent system for training of corpora 1-3 in Table 2. Since these cor- as well as for decoding in combination with the pora are from different domains than the task beam search algorithm. data, we injected additional data to fine-tune Table 2: List of text corpora used for training Language Models. The first three corpora were used for the basic Language Model, while the last two were additionally included in the extended LM. Name Num. of Sentences 1 News-Commentary (Bojar et al., 2018) 383‘764 2 EuroParl (Koehn, 2005) 1‘920‘208 3 Tuda-Text (Milde and Köhn, 2018) 7‘776‘674 4 Federal Chancellery Press Releases 174‘520 5 Training set transcripts 32‘977 the LM: corpus 4 is a collection of 11’576 the Language Model (”AMch3x5 LMbase” and press releases by the Federal Chancellery (Bun- ”AMch3x5 LMext”). Following that, we inves- deskanzlei). These were scraped from https: tigated data augmentation possibilities. In ad- //www.bk.admin.ch/bk/de/home/ dition to SpecAugment which is used in all ex- dokumentation/medienmitteilungen. periments, we applied speed perturbation (Ko msg-id-.html using a custom script, et al., 2015) on the Acoustic Model data (model where consecutive s up to the most recent ”AMch3x5 sp LMext”). The sixth experiment press release were queried and the content was (”AMch3x5 sp LMext artc”) was an attempt to subsequently extracted using XPath. Corpus 5 improve the results by performing transcript post- consists of the internal training set transcripts. processing. We sought to reduce the number The LM trained on all available corpora (1-5) is of substitutions resulting from incorrect predic- referred to as ”LM extended”. tion of articles by applying BERT as described above. In ”AMch5x5 sp LMext” we introduced Article Post-processing During development the Acoustic Model ”AM extended 5x5” and re- we noticed that there was a considerable amount placed it with ”AM all 5x5” in the final experiment of errors due to incorrectly predicted articles (e.g (AMall5x5 sp LMext). ”der”, ”die”, ”das”) (see Section 5.4 for more de- tails). We identified individual definite and in- 5.3 Evaluation definite articles in a predicted utterance, removed The results of all experiments were evaluated them, and queried the top 5 predictions of a BERT on the internal test set, except for the last one, model (Devlin et al., 2019). If the originally pre- ”AMall5x5 sp LMext”, where the internal test set dicted article appeared in the list of suggestions, was used for training the models. The five best- we kept it. Otherwise it was replaced by the ar- performing versions were submitted for evaluation ticle scored highest by BERT, making sure not to on the public test set of the Shared Task. Table 3 replace an indefinite article by a definite one or provides an overview of all results. vice-versa. Eventually, we achieved 40.29% WER on the official test set. Our best performing system is a 5.2 Experimental Setup combination of the baseline Acoustic Model re- In total, nine experiments were conducted with trained on 5 blocks with Swiss German data, speed the goal to investigate system performance of the perturbation, and a Language Model fine-tuned on various models. The details of the experiments in-domain data from Switzerland. are presented in Table 3. The very first exper- iment (”base”) was performed without any fine- 5.4 Discussion tuning or post-processing on the base model, while The two largest performance improvements were the second one (”AMext3x5”) aimed at evaluat- achieved by fine-tuning the Acoustic Model on ing the predictions from the ”AM extended 3x5” the task-specific data (”AMext3x5” vs ”base”: model without applying any Language Model. WER reduced by 38% absolute) and by using a In the third experiment we evaluated the model general-purpose Language Model during decod- trained only on the internal Swiss German train ing (”AMext3x5 LMbase” vs ”AMext3x5”: WER set without any pre-training on Standard Ger- reduced by 7.64% absolute). Both of these are man (”AMno pretrain”). The next two exper- standard practices in ASR and hence these im- iments consisted in introducing and extending provements are neither surprising nor particularly Table 3: Experiments overview. Note on Acoustic Models: AM-E-3x5-SP stands for AM extended 3x5 with speed perturbation, and AM-A-5x5-SP for AM all 5x5 with speed perturbation. System Acoustic Language Post- WER Name Model Model Processing internal official base AM base - - 92.1% - AMext3x5 AM-E 3x5 - - 54.1% - AMch3x5 LMbase AM-E 3x5 LM base - 46.46% - AMno pretrain LMext AM-NOPRE SP LM extended - 46.82% 43.52% AMch3x5 LMext AM-E 3x5 LM extended - 45.52% 42.61% AMch3x5 sp LMext AM-E 3x5 SP LM extended - 44.83% 41.76% AMch3x5 sp LMext artc AM-E 3x5 SP LM extended articles 45.17% 42.2% AMch5x5 sp LMext AM-E 5x5 SP LM extended - 44.43% 41.16% AMall5x5 sp LMext AM-A 5x5 SP LM extended - - 40.29% insightful. only account for 2% of all sentences passed to the We identified articles as one distinct source LM. Using more in-domain data in the LM could of errors: around one sixth of substitution lead to a larger effect. errors were articles; hence, we decided to Further small improvements were obtained by address these during post-processing (model using speed perturbation (”AMch3x5 sp LMext” ”AMch3x5 sp LMext artc”). Our method using vs ”AMch3x5 LMext”: -0.7% absolute on BERT (see Section 5.2) did not improve perfor- our internal test set and -0.85% on the task mance. There are several reasons for this. First, test set) and retraining five Jasper blocks while some articles were indeed improved with instead of three (”AMch5x5 sp LMext” vs this method, often there was insufficient context to ”AMch3x5 sp LMext”: -0.4% absolute on our accurately determine the correct article. Domain- internal test set and -0.6% absolute on the task specific abbreviations (e.g. party names such as test set). SVP, EVP) also proved difficult. Second, we ob- We also note that our performance on the task served a number of article errors that are due to the test set is consistently better than the one on our non-exact nature of the transcription. These are internal test set. linguistic or stylistic changes and improvements of 6 Training Data Challenges the spoken text and can therefore not be addressed by our method. For example: changing a spoken Before we conclude, we would like to reflect on definite article to an indefinite one, using plural in- the properties of the task data and their repercus- stead of singular, transcribing a spoken ”es” with sions for WER results. ”das”, or inserting an extra article into a coordi- Our analysis of the errors and the data showed nated noun phrase. that properties of the data often lead to an increase Finally, there is also a challenge that relates to in WER, where the ASR model provides an ade- the specific language variety in this task: articles quate transcription but is ”punished” by data arte- in Swiss German are rather difficult to detect as facts. We identified the following main issues: they usually consist of single phonemes which are • We noticed that transcriptions in the training assimilated to the following noun. This means that set are inconsistent with respect to numer- articles may be missed at an earlier stage of pro- als, which are written as either numbers or cessing and will not be present in the output passed words, so that transcribing the numeral four to the post-processing. as ”vier” when the reference transcript has Our extended Language Model brought ”4” will lead to a substitution error. Since a nearly 1% absolute WER improvement there is no consistency in the writing of nu- (”AMch3x5 LMext” vs ”AMch3x5 LMbase”), merals (e.g. always using words, always us- which is less than we expected. However, this ing numbers, using words when smaller than can be explained by the rather small amount of ten, etc), this leads to errors that we could not additional data - corpora 4 and 5 (see Table 2) prevent. • Transcripts are polished (e.g. speech disflu- References encies such as repetitions, hesitations, and Rosana Ardila, Megan Branson, Kelly Davis, Michael false starts are removed) and reformulated Henretty, Michael Kohler, Josh Meyer, Reuben so they become more readable, which means Morais, Lindsay Saunders, Francis M. Tyers, they do not exactly represent the spoken and Gregor Weber. 2019. Common Voice: A Massively-Multilingual Speech Corpus. ArXiv, text. For example, in training set item abs/1912.06670. 19940.flac, the speaker starts by saying ”mer hie enne” (DE: ”wir hier drin”, EN: ”we Timo Baumann, Arne Köhn, and Felix Hennig. 2018. in here”), but this was transcribed as ”wir The Spoken Wikipedia Corpus collection: Harvest- ing, alignment and an application to hyperlistening. in diesem Saal” (EN: ”we in this chamber”), Language Resources and Evaluation. leading to three errors (two substitutions and one deletion) when transcribed faithfully to Alexandre Bérard, Olivier Pietquin, Christophe Servan, the spoken utterance by the model. and Laurent Besacier. 2016. Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation. arXiv preprint arXiv:1612.01744. • We also note issues with the segmentation of audio files, which, according to the task or- Ondřej Bojar, Christian Federmann, Mark Fishel, ganizers, was performed automatically. This Yvette Graham, Barry Haddow, Philipp Koehn, and leads to insertion errors (when extra audio is Christof Monz. 2018. Findings of the 2018 Con- ference on Machine Translation (WMT18). In Pro- included beyond what is transcribed) or dele- ceedings of the Third Conference on Machine Trans- tion errors (when portions of the audio are lation: Shared Task Papers, pages 272–303, Bel- missing) of the model that cannot be miti- gium, Brussels. Association for Computational Lin- gated. guistics. Matthias Büchi. 2020. Speech Recognition Compo- Given the observed discrepancies between the nent for Search-Oriented Conversational Artificial speech and transcript, additional evaluation mea- Intelligence. Master’s thesis, ZHAW Zurich Univer- sures might be considered. In CL-STT, BLEU sity of Applied Sciences. scores are used for evaluation. Even though this William Chan, Navdeep Jaitly, Quoc Le, and Oriol metric has been criticized, it might fit the setup Vinyals. 2016. Listen, Attend and Spell: A Neu- of this task better, since the paraphrasing might ral Network for Large Vocabulary Conversational not be unique. Further, measures considering se- Speech Recognition. In 2016 IEEE International mantics and synonyms (Wieting et al., 2019; Kane Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pages 4960–4964. IEEE. et al., 2020) might prove helpful. However, in this specific case of official transcriptions, this would Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Ro- entail re-annotation, the cost of which would be hit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, prohibitive. Anjuli Kannan, Ron J Weiss, Kanishka Rao, Eka- terina Gonina, et al. 2018. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models. 7 Conclusion In 2018 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages In this paper, we presented our contribution to the 4774–4778. IEEE. Shared Task on Low-Resource STT at GermEval 2020. Our solution consists of a CNN acoustic Jacob Devlin, Ming-Wei Chang, Kenton Lee, and model based on Jasper (Li et al., 2019) with beam- Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Un- search decoding and CTC loss. Our most success- derstanding. In Proceedings of the 2019 Conference ful model uses Transfer Learning, where we re- of the North American Chapter of the Association train the last five blocks of the Acoustic Model on for Computational Linguistics: Human Language the task data. Additionally, we use speed perturba- Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Associ- tion and a Language Model trained on both out-of- ation for Computational Linguistics. domain and in-domain text data. These improve- ments reduced the WER by over 50% compared Linhao Dong, Shuang Xu, and Bo Xu. 2018. to the Standard German baseline system. Our best Speech-Transformer: a No-Recurrence Sequence- to-Sequence Model for Speech Recognition. In model achieved a WER of 40.29% on the official 2018 IEEE International Conference on Acous- task test set, resulting in first place out of three tics, Speech and Signal Processing (ICASSP), pages contributions. 5884–5888. IEEE. Alex Graves, Santiago Fernández, Faustino Gomez, Daniel S. Park, William Chan, Yu Zhang, Chung- and Jürgen Schmidhuber. 2006. Connectionist Tem- Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and poral Classification: Labelling Unsegmented Se- Quoc V. Le. 2019. SpecAugment: A Simple quence Data with Recurrent Neural Networks. In Data Augmentation Method for Automatic Speech Proceedings of the 23rd international conference on Recognition. ArXiv, abs/1904.08779. Machine learning, pages 369–376. Michel Plüss, Lukas Neukom, and Manfred Vogel. Kenneth Heafield. 2011. KenLM: Faster and Smaller 2020a. Forced alignment of swiss german speech Language Model Queries. In Proceedings of the to standard german text. In preparation. EMNLP 2011 Sixth Workshop on Statistical Ma- chine Translation, pages 187–197, Edinburgh, Scot- Michel Plüss, Lukas Neukom, and Manfred Vogel. land, United Kingdom. 2020b. Germeval 2020 task 4: Low-resource speech-to-text. In preparation. Navdeep Jaitly, Quoc V Le, Oriol Vinyals, Ilya Sutskever, David Sussillo, and Samy Bengio. 2016. Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukáš An Online Sequence-to-Sequence Model Using Par- Burget, Ondrej Glembek, Nagendra Goel, Mirko tial Conditioning. In Advances in Neural Informa- Hannemann, Petr Motlı́ček, Yanmin Qian, Petr tion Processing Systems, pages 5067–5075. Schwarz, Jan Silovský, Georg Stemmer, and Karel Vesel. 2011. The Kaldi Speech Recognition Toolkit. Ye Jia, Melvin Johnson, Wolfgang Macherey, Ron J IEEE 2011 Workshop on Automatic Speech Recog- Weiss, Yuan Cao, Chung-Cheng Chiu, Naveen Ari, nition and Understanding. Stella Laurenzo, and Yonghui Wu. 2019a. Lever- aging Weakly Supervised Data to Improve End-to- Hasim Sak, Matt Shannon, Kanishka Rao, and End Speech-to-Text Translation. In ICASSP 2019- Françoise Beaufays. 2017. Recurrent Neural 2019 IEEE International Conference on Acous- Aligner: An Encoder-Decoder Neural Network tics, Speech and Signal Processing (ICASSP), pages Model for Sequence to Sequence Mapping. In In- 7180–7184. IEEE. terspeech, volume 8, pages 1298–1302. Ye Jia, Ron J Weiss, Fadi Biadsy, Wolfgang Macherey, Lahiru Samarakoon, Brian Mak, and Albert YS Melvin Johnson, Zhifeng Chen, and Yonghui Wu. Lam. 2018. Domain Adaptation of End-to-end 2019b. Direct Speech-to-Speech Translation with Speech Recognition in Low-Resource Settings. In a Sequence-to-Sequence model. arXiv preprint 2018 IEEE Spoken Language Technology Workshop arXiv:1904.06037. (SLT), pages 382–388. IEEE. Hassan Kane, Muhammed Yusuf Kocyigit, Ali Ab- Beat Siebenhaar. 2013. Sprachgeographische aspekte dalla, Pelkins Ajanoh, and Mohamed Coulibali. der morphologie und verschriftung in schweiz- 2020. NUBIA: NeUral Based Interchangeability erdeutschen chats. Assessor for Text Generation. Beat Siebenhaar and Alfred Wyler. 1997. Dialekt Tom Ko, Vijayaditya Peddinti, Daniel Povey, and San- und Hochsprache in der deutschsprachigen Schweiz. jeev Khudanpur. 2015. Audio Augmentation for Pro Helvetia. Speech Recognition. In INTERSPEECH. Philipp Koehn. 2005. Europarl: A Parallel Corpus for Mihaela C Stoian, Sameer Bansal, and Sharon Gold- Statistical Machine Translation. water. 2019. Analyzing ASR Pretraining for Low-Resource Speech-to-Text Translation. arXiv Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan preprint arXiv:1910.10762. Leary, Oleksii Kuchaiev, Jonathan M Cohen, Huyen Nguyen, and Ravi Teja Gadde. 2019. Jasper: An Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob End-to-End Convolutional Neural Acoustic Model. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz arXiv preprint arXiv:1904.03288. Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. In Advances in neural information pro- M-AILABS. 2019. M-AILABS Speech Dataset. cessing systems, pages 5998–6008. https://www.caito.de/2019/01/ the-m-ailabs-speech-dataset/. Ac- VoxForge. 2019. VoxForge. http://www. cessed: 2019-12-10. voxforge.org/de. Accessed: 2019-12-10. Benjamin Milde and Arne Köhn. 2018. Open Source John Wieting, Taylor Berg-Kirkpatrick, Kevin Gimpel, Automatic Speech Recognition for German. In Pro- and Graham Neubig. 2019. Beyond BLEU: Train- ceedings of ITG 2018. ing Neural Machine Translation with Semantic Sim- ilarity. arXiv preprint arXiv:1909.06694. NVIDIA. 2020. Jasper source code. https://github.com/NVIDIA/ Klaus Zechner and Alex Waibel. 2000. Minimizing DeepLearningExamples/blob/master/ Word Error Rate in Textual Summaries of Spoken PyTorch/SpeechRecognition/Jasper/ Language. In 1st Meeting of the North American images/jasper_dense_residual.png. Chapter of the Association for Computational Lin- Accessed: 2020-05-14. guistics. Neil Zeghidour, Qiantong Xu, Vitaliy Liptchinsky, Nicolas Usunier, Gabriel Synnaeve, and Ronan Col- lobert. 2018. Fully Convolutional Speech Recogni- tion. arXiv preprint arXiv:1812.06864.