2nd Swiss German Speech to Standard German Text Shared Task at SwissText 2022 Michel Plüss1 , Yanick Schraner1,∗ , Christian Scheller1 and Manfred Vogel1,∗ 1 Institute for Data Science, University of Applied Sciences and Arts Northwestern Switzerland, Windisch, Switzerland Abstract We present the results and findings of the 2nd Swiss German speech to Standard German text shared task at SwissText 2022. Participants were asked to build a sentence-level Swiss German speech to Standard German text system specialized on the Grisons dialect. The objective was to maximize the BLEU score on a test set of Grisons speech. 3 teams participated, with the best-performing system achieving a BLEU score of 70.1. 1. Introduction ized on the Grisons dialect. The submission with the best BLEU score on a test set of Grisons dialect speakers The topic of this task is automatic speech recognition wins. Participants were encouraged to explore suitable (ASR) for Swiss German. Swiss German is a family of transfer learning and fine-tuning approaches based on German dialects spoken in Switzerland, see Plüss et al. [1]. the Swiss German, Standard German, French, and Italian Swiss German ASR is concerned with the transcription data provided. of Swiss German Speech to Standard German text and can be viewed as a speech translation task with similar source and target languages, see Plüss et al. [1]. 2.1. Data This task has two predecessors. The 2020 task [2] pro- We provide 5 different training datasets to participants, vided a 70-hours labeled training set of automatically all of which are collections of sentence-level transcribed aligned Swiss German speech (predominantly Bernese speech. SDS-200 [6] is a Swiss German dataset with 200 dialect) and Standard German text. The test set also com- hours of speech from all major Swiss German dialect re- prised mostly Bernese speech. The winning contribution gions, of which 6 hours are in Grisons dialect. SwissDial by Büchi et al. [3] achieved a word error rate (WER) of [7] is a Swiss German dataset with 34 hours of speech 40.3 %. The 2021 task [1] provided an improved and ex- from all major Swiss German dialect regions, of which tended 293-hours version of the 2020 training set, as well 11 hours are in Grisons dialect. From version 9.0 of the as a 1208-hours unlabeled speech dataset (predominantly Common Voice project [8], we provide 1166 hours of Zurich dialect). The test set covered a large part of the Standard German, 926 hours of French, and 340 hours of Swiss German dialect landscape. The winning contribu- Italian, all of which are official languages of Switzerland. tion by Arabskyy et al. [4] achieved a BLEU score [5] of The test set was collected in a similar fashion to SDS- 46.0. 200 [6]. It consists of 5 hours of sentence-level tran- The goal of this task is to build a system able to trans- scribed Grisons speech by 11 speakers, of which 8 are late Swiss German speech to Standard German text and female and 3 are male. The set is divided into two equally optimize it for the Grisons dialect. To enable this, we sized parts, a public part (score on this part was displayed provide the Swiss German labeled datasets SDS-200 [6] in the public ranking while the task was running) and and SwissDial [7], both including a substantial amount of a private part (final ranking is based on this part, was Grisons speech, as well as the Standard German, French, not available while the task was running). Two thirds and Italian labeled datasets of Common Voice 9.0 [8]. of the texts are from Swiss newspapers and one third is from the minutes of parliament debates in Aarau and Wettingen. Care was taken to avoid any overlap between 2. Task Description the Swiss newspaper sentences in this test set and the ones in SDS-200 [6]. The goal of the task is to build a sentence-level Swiss German speech to Standard German text system special- 2.2. Evaluation SwissText 2022: Swiss Text Analytics Conference, June 08–10, 2022, Lugano, Switzerland The submissions are evaluated using BLEU score [5]. Our ∗ Corresponding author. evaluation script, which uses the NLTK [9] BLEU imple- Envelope-Open yanick.schraner@fhnw.ch (Y. Schraner); manfred.vogel@fhnw.ch (M. Vogel) mentation, is open-source1 . The private part of the test © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) https://github.com/i4Ds/swisstext-2022-swiss-german-shared-task Rank Team BLEU 4. Conclusion 1 Baseline 70.1 2 Stucki et al. 68.1 We have described the 2nd Swiss German speech to Stan- 3 Nafisi et al. 55.3 dard German text shared task at SwissText 2022. The best-performing system on the Grisons speech test set is Table 1 our baseline with a BLEU score of 70.1. The same system Final ranking of the shared task. The BLEU column shows the achieves a BLEU score of 65.3 on the 2021 task test set BLEU score on the private 50 % of the test set. [1], a relative improvement of 42 % over the highest score of the 2021 task. This highlights the large progress in the field over the last year. The main drivers for this set is used for the final ranking. progress seem to be the new dataset SDS-200 [6] as well The test set contains the characters a-z, ä, ö, ü, 0-9, as the use of models pre-trained on large amounts of and spaces, and the participants’ models should support unlabeled speech as demonstrated by the teams Stucki exactly these. Punctuation and casing are ignored for et al. and Nafisi et al., who employed XLS-R models [10]. the evaluation. Numbers are not used consistently in The addition of an LM seems to be especially important the test set, so sometimes they are written as digits and for XLS-R models. The main difference between Nafisi et sometimes they are spelled out. We create a second ref- al. and Stucki et al. is that the latter add an LM, leading erence by automatically spelling out all numbers and use to a relative improvement of 23 % BLEU. both the original and this adjusted reference in the BLEU On the other hand, none of the 3 participating teams score calculation. Participants were advised to have their made a significant effort to optimize their system for models always spell out numbers. All other characters the Grisons dialect. The best approach to create an ASR are removed from the submission (see evaluation script system optimized for a specific dialect remains to be for details). Participants were therefore advised to re- found in future work. Incorporating the provided French place each additional character in their training set with and Italian data for training is another possible direction a sensible replacement. for future research. 3. Results References 3 teams participated in the shared task, including our [1] M. Plüss, L. Neukom, M. Vogel, Swisstext 2021 task baseline. Table 1 shows the final ranking. 3: Swiss german speech to standard german text, in: Our baseline achieves a BLEU score of 70.1. We use the Proceedings of the Swiss Text Analytics Conference model Transformer Baseline described in Plüss et al. [6]. 2021, 2021. We train the model from scratch on SDS-200, SwissDial, [2] M. Plüss, L. Neukom, M. Vogel, Germeval 2020 task and the Standard German part of Common Voice. Con- 4: Low-resource speech-to-text, in: Proceedings trary to Plüss et al. [6], we employ a Transformer-based of the 5th Swiss Text Analytics Conference (Swiss- language model (LM) with 12 decoder layers, 16 atten- Text) & 16th Conference on Natural Language Pro- tion heads, an embedding dimension of 512, and a fully cessing (KONVENS), 2020. connected layer with 1024 units. The LM is trained on [3] M. Büchi, M. A. Ulasik, M. Hürlimann, F. Benites, 67M Standard German sentences. We use a beam width P. von Däniken, M. Cieliebak, Zhaw-init at ger- of 60 during decoding. The same model achieves 65.3 meval 2020 task 4: Low-resource speech-to-text, in: BLEU on the 2021 task test set [1]. Proceedings of the 5th Swiss Text Analytics Con- Stucki et al. achieve a BLEU score of 68.1. They use ference (SwissText) & 16th Conference on Natural an XLS-R 1B model [10], pre-trained on 436K hours of Language Processing (KONVENS), 2020. unlabeled speech in 128 languages, not including Swiss [4] Y. Arabskyy, A. Agarwal, S. Dey, O. Koller, Dialectal German. They fine-tune the model on SDS-200 and Swiss- speech recognition and translation of swiss german Dial. A KenLM 5-gram LM [11] trained on the German speech to standard german text: Microsoft’s sub- Wikipedia is employed. mission to swisstext 2021, in: Proceedings of the Nafisi et al. achieve a BLEU score of 55.3. They use Swiss Text Analytics Conference 2021, 2021. an XLS-R 1B model [10], pre-trained on 436K hours of [5] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a unlabeled speech in 128 languages, not including Swiss method for automatic evaluation of machine trans- German. They fine-tune the model on SDS-200. No LM lation, in: Proceedings of the 40th Annual Meeting is employed. of the Association for Computational Linguistics, 2002. [6] M. Plüss, M. Hürlimann, M. Cuny, A. Stöckli, N. Kapotis, J. Hartmann, M. A. Ulasik, C. Scheller, Y. Schraner, A. Jain, J. Deriu, M. Cieliebak, M. Vogel, Sds-200: A swiss german speech to standard ger- man text corpus, in: Proceedings of the Language Resources and Evaluation Conference, 2022. [7] P. Dogan-Schönberger, J. Mäder, T. Hofmann, Swiss- dial: Parallel multidialectal corpus of spoken swiss german, CoRR abs/2103.11401 (2021). URL: https: //arxiv.org/abs/2103.11401. a r X i v : 2 1 0 3 . 1 1 4 0 1 . [8] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, G. Weber, Common Voice: A Massively- Multilingual Speech Corpus, in: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 2020. [9] S. Bird, E. Klein, E. Loper, Natural language process- ing with Python: analyzing text with the natural language toolkit, ” O’Reilly Media, Inc.”, 2009. [10] A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino, A. Baevski, A. Conneau, M. Auli, Xls-r: Self- supervised cross-lingual speech representation learning at scale, arXiv abs/2111.09296 (2021). [11] K. Heafield, Kenlm: Faster and smaller language model queries, in: Proceedings of the sixth work- shop on statistical machine translation, 2011.