2nd Swiss German Speech to Standard German Text Shared
Task at SwissText 2022
Michel Plüss1 , Yanick Schraner1,∗ , Christian Scheller1 and Manfred Vogel1,∗
1
    Institute for Data Science, University of Applied Sciences and Arts Northwestern Switzerland, Windisch, Switzerland


                                             Abstract
                                             We present the results and findings of the 2nd Swiss German speech to Standard German text shared task at SwissText 2022.
                                             Participants were asked to build a sentence-level Swiss German speech to Standard German text system specialized on the
                                             Grisons dialect. The objective was to maximize the BLEU score on a test set of Grisons speech. 3 teams participated, with the
                                             best-performing system achieving a BLEU score of 70.1.


1. Introduction                                              ized on the Grisons dialect. The submission with the
                                                             best BLEU score on a test set of Grisons dialect speakers
The topic of this task is automatic speech recognition wins. Participants were encouraged to explore suitable
(ASR) for Swiss German. Swiss German is a family of transfer learning and fine-tuning approaches based on
German dialects spoken in Switzerland, see Plüss et al. [1]. the Swiss German, Standard German, French, and Italian
Swiss German ASR is concerned with the transcription data provided.
of Swiss German Speech to Standard German text and
can be viewed as a speech translation task with similar
source and target languages, see Plüss et al. [1].
                                                             2.1. Data
   This task has two predecessors. The 2020 task [2] pro- We provide 5 different training datasets to participants,
vided a 70-hours labeled training set of automatically all of which are collections of sentence-level transcribed
aligned Swiss German speech (predominantly Bernese speech. SDS-200 [6] is a Swiss German dataset with 200
dialect) and Standard German text. The test set also com- hours of speech from all major Swiss German dialect re-
prised mostly Bernese speech. The winning contribution gions, of which 6 hours are in Grisons dialect. SwissDial
by Büchi et al. [3] achieved a word error rate (WER) of [7] is a Swiss German dataset with 34 hours of speech
40.3 %. The 2021 task [1] provided an improved and ex- from all major Swiss German dialect regions, of which
tended 293-hours version of the 2020 training set, as well 11 hours are in Grisons dialect. From version 9.0 of the
as a 1208-hours unlabeled speech dataset (predominantly Common Voice project [8], we provide 1166 hours of
Zurich dialect). The test set covered a large part of the Standard German, 926 hours of French, and 340 hours of
Swiss German dialect landscape. The winning contribu- Italian, all of which are official languages of Switzerland.
tion by Arabskyy et al. [4] achieved a BLEU score [5] of        The test set was collected in a similar fashion to SDS-
46.0.                                                        200 [6]. It consists of 5 hours of sentence-level tran-
   The goal of this task is to build a system able to trans- scribed Grisons speech by 11 speakers, of which 8 are
late Swiss German speech to Standard German text and female and 3 are male. The set is divided into two equally
optimize it for the Grisons dialect. To enable this, we sized parts, a public part (score on this part was displayed
provide the Swiss German labeled datasets SDS-200 [6] in the public ranking while the task was running) and
and SwissDial [7], both including a substantial amount of a private part (final ranking is based on this part, was
Grisons speech, as well as the Standard German, French, not available while the task was running). Two thirds
and Italian labeled datasets of Common Voice 9.0 [8].        of the texts are from Swiss newspapers and one third
                                                             is from the minutes of parliament debates in Aarau and
                                                             Wettingen. Care was taken to avoid any overlap between
2. Task Description                                          the Swiss newspaper sentences in this test set and the
                                                             ones in SDS-200 [6].
The goal of the task is to build a sentence-level Swiss
German speech to Standard German text system special-
                                                                                                                                      2.2. Evaluation
SwissText 2022: Swiss Text Analytics Conference, June 08–10, 2022,
Lugano, Switzerland                                                                                                                   The submissions are evaluated using BLEU score [5]. Our
∗
     Corresponding author.                                                                                                            evaluation script, which uses the NLTK [9] BLEU imple-
Envelope-Open yanick.schraner@fhnw.ch (Y. Schraner);
manfred.vogel@fhnw.ch (M. Vogel)
                                                                                                                                      mentation, is open-source1 . The private part of the test
                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).
                                                                                                                                      1
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                            https://github.com/i4Ds/swisstext-2022-swiss-german-shared-task
             Rank     Team            BLEU                  4. Conclusion
               1      Baseline         70.1
               2      Stucki et al.    68.1                 We have described the 2nd Swiss German speech to Stan-
               3      Nafisi et al.    55.3
                                                            dard German text shared task at SwissText 2022. The
                                                            best-performing system on the Grisons speech test set is
Table 1                                                     our baseline with a BLEU score of 70.1. The same system
Final ranking of the shared task. The BLEU column shows the achieves a BLEU score of 65.3 on the 2021 task test set
BLEU score on the private 50 % of the test set.             [1], a relative improvement of 42 % over the highest score
                                                            of the 2021 task. This highlights the large progress in
                                                            the field over the last year. The main drivers for this
set is used for the final ranking.                          progress seem to be the new dataset SDS-200 [6] as well
   The test set contains the characters a-z, ä, ö, ü, 0-9, as the use of models pre-trained on large amounts of
and spaces, and the participants’ models should support unlabeled speech as demonstrated by the teams Stucki
exactly these. Punctuation and casing are ignored for et al. and Nafisi et al., who employed XLS-R models [10].
the evaluation. Numbers are not used consistently in The addition of an LM seems to be especially important
the test set, so sometimes they are written as digits and for XLS-R models. The main difference between Nafisi et
sometimes they are spelled out. We create a second ref- al. and Stucki et al. is that the latter add an LM, leading
erence by automatically spelling out all numbers and use to a relative improvement of 23 % BLEU.
both the original and this adjusted reference in the BLEU      On the other hand, none of the 3 participating teams
score calculation. Participants were advised to have their made a significant effort to optimize their system for
models always spell out numbers. All other characters the Grisons dialect. The best approach to create an ASR
are removed from the submission (see evaluation script system optimized for a specific dialect remains to be
for details). Participants were therefore advised to re- found in future work. Incorporating the provided French
place each additional character in their training set with and Italian data for training is another possible direction
a sensible replacement.                                     for future research.


3. Results                                                  References
3 teams participated in the shared task, including our       [1] M. Plüss, L. Neukom, M. Vogel, Swisstext 2021 task
baseline. Table 1 shows the final ranking.                       3: Swiss german speech to standard german text, in:
   Our baseline achieves a BLEU score of 70.1. We use the        Proceedings of the Swiss Text Analytics Conference
model Transformer Baseline described in Plüss et al. [6].        2021, 2021.
We train the model from scratch on SDS-200, SwissDial,       [2] M. Plüss, L. Neukom, M. Vogel, Germeval 2020 task
and the Standard German part of Common Voice. Con-               4: Low-resource speech-to-text, in: Proceedings
trary to Plüss et al. [6], we employ a Transformer-based         of the 5th Swiss Text Analytics Conference (Swiss-
language model (LM) with 12 decoder layers, 16 atten-            Text) & 16th Conference on Natural Language Pro-
tion heads, an embedding dimension of 512, and a fully           cessing (KONVENS), 2020.
connected layer with 1024 units. The LM is trained on        [3] M. Büchi, M. A. Ulasik, M. Hürlimann, F. Benites,
67M Standard German sentences. We use a beam width               P. von Däniken, M. Cieliebak, Zhaw-init at ger-
of 60 during decoding. The same model achieves 65.3              meval 2020 task 4: Low-resource speech-to-text, in:
BLEU on the 2021 task test set [1].                              Proceedings of the 5th Swiss Text Analytics Con-
   Stucki et al. achieve a BLEU score of 68.1. They use          ference (SwissText) & 16th Conference on Natural
an XLS-R 1B model [10], pre-trained on 436K hours of             Language Processing (KONVENS), 2020.
unlabeled speech in 128 languages, not including Swiss       [4] Y. Arabskyy, A. Agarwal, S. Dey, O. Koller, Dialectal
German. They fine-tune the model on SDS-200 and Swiss-           speech recognition and translation of swiss german
Dial. A KenLM 5-gram LM [11] trained on the German               speech to standard german text: Microsoft’s sub-
Wikipedia is employed.                                           mission to swisstext 2021, in: Proceedings of the
   Nafisi et al. achieve a BLEU score of 55.3. They use          Swiss Text Analytics Conference 2021, 2021.
an XLS-R 1B model [10], pre-trained on 436K hours of         [5] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a
unlabeled speech in 128 languages, not including Swiss           method for automatic evaluation of machine trans-
German. They fine-tune the model on SDS-200. No LM               lation, in: Proceedings of the 40th Annual Meeting
is employed.                                                     of the Association for Computational Linguistics,
                                                                 2002.
                                                             [6] M. Plüss, M. Hürlimann, M. Cuny, A. Stöckli,
     N. Kapotis, J. Hartmann, M. A. Ulasik, C. Scheller,
     Y. Schraner, A. Jain, J. Deriu, M. Cieliebak, M. Vogel,
     Sds-200: A swiss german speech to standard ger-
     man text corpus, in: Proceedings of the Language
     Resources and Evaluation Conference, 2022.
 [7] P. Dogan-Schönberger, J. Mäder, T. Hofmann, Swiss-
     dial: Parallel multidialectal corpus of spoken swiss
     german, CoRR abs/2103.11401 (2021). URL: https:
     //arxiv.org/abs/2103.11401. a r X i v : 2 1 0 3 . 1 1 4 0 1 .
 [8] R. Ardila, M. Branson, K. Davis, M. Henretty,
     M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M.
     Tyers, G. Weber, Common Voice: A Massively-
     Multilingual Speech Corpus, in: Proceedings of
     the 12th Conference on Language Resources and
     Evaluation (LREC 2020), 2020.
 [9] S. Bird, E. Klein, E. Loper, Natural language process-
     ing with Python: analyzing text with the natural
     language toolkit, ” O’Reilly Media, Inc.”, 2009.
[10] A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu,
     N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino,
     A. Baevski, A. Conneau, M. Auli, Xls-r: Self-
     supervised cross-lingual speech representation
     learning at scale, arXiv abs/2111.09296 (2021).
[11] K. Heafield, Kenlm: Faster and smaller language
     model queries, in: Proceedings of the sixth work-
     shop on statistical machine translation, 2011.