=Paper=
{{Paper
|id=Vol-2957/sg_paper1
|storemode=property
|title=SwissText 2021 Task 3: Swiss German Speech to Standard German Text (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2957/sg_paper1.pdf
|volume=Vol-2957
|authors=Michel Plüss,Lukas Neukom,Manfred Vogel
|dblpUrl=https://dblp.org/rec/conf/swisstext/PlussNV21
}}
==SwissText 2021 Task 3: Swiss German Speech to Standard German Text (short paper)==
SwissText 2021 Task 3: Swiss German Speech to Standard German Text
Michel Plüss Lukas Neukom Manfred Vogel
Institute for Data Science
University of Applied Sciences and Arts Northwestern Switzerland
Windisch, Switzerland
michel.pluess@fhnw.ch
Abstract well. Zhang et al. (2020) set the current state-of-
the-art on the popular LibriSpeech test-other bench-
We present the results and findings of mark (Panayotov et al., 2015) with a word error rate
SwissText 2021 Task 3 on Swiss German (WER) of 2.6 %. In comparison, the 2020 shared
Speech to Standard German Text. Partic- task on Swiss German STT (Plüss et al., 2020), this
ipants were asked to build a system trans- task’s predecessor, was won by Büchi et al. (2020)
lating Swiss German speech to Standard with a WER of 40.3 %.
German text. The objective was to max- The goal of this task is to spur further progress
imize the BLEU score on a new test set in the field of Swiss German STT by providing a
covering a large part of the Swiss German larger labeled training set, an additional unlabeled
dialect landscape. Four teams participated, training set, and a test set with a dialect distribution
with the winning contribution achieving a similar to the real distribution of Swiss German
BLEU score of 46.0. dialects in Switzerland.
The remainder of this paper is structured as fol-
1 Introduction lows: the task, the data, and the evaluation of sub-
missions are described in section 2. An overview
Swiss German is a family of dialects spoken by
of the submissions and results of this task can be
around five million people in Switzerland. It is dif-
found in section 3. Section 4 wraps up the paper
ferent from Standard German regarding phonetics,
and gives directions for future work.
vocabulary, morphology, and syntax. Swiss Ger-
man is mostly a spoken language. While it is also 2 Task Description
used in writing, particularly in informal text mes-
sages, it lacks a standardized writing system. This The objective of the task is to build a sentence-
leads to difficulties for automated text processing level Swiss German speech to Standard German
such as spelling ambiguities and a huge vocabu- text speech translation system. The submission
lary size. Therefore, most use cases for a Swiss with the best BLEU score (Papineni et al., 2002)
German speech-to-text (STT) system require Stan- wins. Participants were encouraged to explore and
dard German text as output. This can be viewed as combine suitable supervised, semi-supervised, and
a speech translation problem with similar source unsupervised learning approaches.
and target languages. For example, the Swiss Ger-
man sentence ”Ide Abfahrt hetter de sächsti Platz 2.1 Data
beleit” can be translated to the Standard German We provide two training datasets. The first one is
sentence ”In der Abfahrt belegte er den sechsten the Swiss Parliaments Corpus (Plüss et al., 2021),
Platz”. Here, the sentence structure is very similar, a labeled 293-hours dataset of Swiss German de-
but the past tense changes in Standard German. bates from the Grosser Rat Kanton Bern parliament
Speech-to-text systems for well-resourced lan- with corresponding Standard German sentence-
guages like English or Standard German work very level transcriptions1 . The second one is an un-
labeled collection of 1208 hours of Swiss German
Copyright © 2021 for this paper by its authors. Use permitted
1
under Creative Commons License Attribution 4.0 Interna- https://www.cs.technik.fhnw.ch/
tional (CC BY 4.0) i4ds-datasets
Figure 1: Comparison of the dialect prevalence in Switzerland’s German-speaking population with the All Swiss
German Dialects Test Set. To make this comparison possible, a dialect is defined as the average dialect spoken in a
canton.
debates from the Gemeinderat Zürich parliament2 . dialect metadata for the test set. BS speakers are
The use of additional datasets is allowed, but has therefore included in BL.
to be declared in the system description.
The test set created for this task, the All Swiss 2.2 Evaluation
German Dialects Test Set, contains 13 hours of The submissions are evaluated using BLEU
sentence-level Swiss German speech and Standard score (Papineni et al., 2002). Our evaluation script,
German text pairs3 . The set is divided into two which uses the NLTK (Bird et al., 2009) BLEU
equally sized parts, a public part (score on this part implementation, is open-source5 . The private part
was displayed in the public ranking while the task of the test set is used for the final ranking. The test
was running) and a private part (final ranking is set contains the characters a-z, ä, ö, ü, and spaces,
based on this part, was not available while the task and the participants’ models should support exactly
was running). The texts are from the Common these. Punctuation and casing are ignored for the
Voice project4 and were spoken by 178 speakers evaluation. Numbers are spelled out. All other
from all over Switzerland. It covers a large part of characters are removed from the submission (see
the Swiss German dialect landscape. Figure 1 com- evaluation script for details). Participants were
pares the test set dialect distribution with the real therefore advised to replace each additional charac-
distribution of Swiss German dialects in Switzer- ter in their training set with a sensible replacement.
land. The comparison highlights the good match
between the test set dialect distribution and the 3 Results
real distribution. There are some exceptions, e.g.
there is no data from the cantons AI, AR, and OW Four teams participated in the shared task. Table 1
due to their small size. Also, BE and SG speak- shows the final ranking.
ers are overrepresented whereas ZH speakers are The team in first place, Arabskyy et al. (2021),
underrepresented. There was no distinction made achieved a BLEU score of 46.0. They use a hybrid
between BL and BS during the collection of the system with a lexicon that incorporates translations,
2
a first pass language model that deals with Swiss
https://www.cs.technik.fhnw.ch/
i4ds-datasets
German particularities, an acoustic model transfer-
3
https://www.cs.technik.fhnw.ch/ learned from a large Standard German dataset, and
i4ds-datasets
4 5
https://github.com/common-voice/ https://github.com/i4Ds/
common-voice/tree/main/server/data/de swisstext-2021-task-3
Rank Team BLEU unsupervised learning for STT, see e.g. (Park et al.,
1 Arabskyy et al. 46.0 2020) and (Baevski et al., 2020), none of the partic-
2 Plüss et al. 41.0 ipants made use of the provided unlabeled training
3 Ulasik et al. 39.4 set. This seems to be a promising direction for
4 DeJa 17.1 further improvements of Swiss German STT given
that the amount of available labeled training data is
Table 1: Final ranking of the shared task. The BLEU still comparatively small.
column shows the BLEU score on the private 50 % of
the All Swiss German Dialects Test Set. Acknowledgments
We thank our participants for their interest in the
a strong neural language model for second pass shared task, for their participation, and for their
rescoring. timely feedback, which have helped us make this
Our baseline ranks second with 41.0 BLEU. The task a success.
system is described in (Plüss et al., 2021) (sec- We also thank Elias Schorr for his great work on
tion 5). We train an end-to-end Conformer (Gulati the submission and evaluation website.
et al., 2020) model using a hybrid CTC / atten-
tion encoder-decoder framework. The training data
consists of the Swiss Parliaments Corpus (Plüss References
et al., 2021), an additional 250 hours corpus of Yuriy Arabskyy, Aashish Agarwal, Subhadeep Dey, and
automatically aligned Swiss German parliament Oscar Koller. 2021. Dialectal speech recognition
debates, and the Standard German Common Voice and translation of swiss german speech to standard
german text: Microsoft’s submission to swisstext
corpus (Ardila et al., 2019). 2021. In preparation.
The team in third place, Ulasik et al. (2021),
achieved a BLEU score of 39.4. Their approach Rosana Ardila, Megan Branson, Kelly Davis, Michael
Henretty, Michael Kohler, Josh Meyer, Reuben
combines three models trained on multilingual, Morais, Lindsay Saunders, Francis M Tyers, and
Standard German, and Swiss German data using Gregor Weber. 2019. Common voice: A massively-
ensembling. multilingual speech corpus. arXiv preprint
The team called DeJa ranked fourth and achieved arXiv:1912.06670.
a BLEU score of 17.1. We have not received a Alexei Baevski, Henry Zhou, Abdelrahman Mohamed,
system description for this submission. and Michael Auli. 2020. wav2vec 2.0: A frame-
work for self-supervised learning of speech represen-
4 Conclusion tations.
We have described SwissText 2021 Task 3 on Swiss Steven Bird, Ewan Klein, and Edward Loper. 2009. Nat-
ural language processing with Python: analyzing text
German Speech to Standard German Text. Sub- with the natural language toolkit. ” O’Reilly Media,
missions were evaluated on the All Swiss German Inc.”.
Dialects Test Set, which we introduced in this work.
It covers a large part of the Swiss German dialect Matthias Büchi, Malgorzata Anna Ulasik, Manuela
Hürlimann, Fernando Benites, Pius von Däniken,
landscape. Four teams participated in the task, with and Mark Cieliebak. 2020. Zhaw-init at germeval
the winning team reaching a BLEU score of 46.0. 2020 task 4: Low-resource speech-to-text. In SWIS-
The results are hard to compare to the results of this STEXT & KONVENS 2020, Proceedings of the 5th
task’s predecessor, GermEval 2020 Task 4 (Plüss Swiss Text Analytics Conference (SwissText) & 16th
Conference on Natural Language Processing (KON-
et al., 2020), due to the different test set and metric. VENS).
Last year’s winning contribution achieved a WER
of 40.3 %. In our experiments in (Plüss et al., 2021), Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki
ranking second in this year’s task, we achieved a Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang,
Zhengdong Zhang, Yonghui Wu, and Ruoming Pang.
WER of 27.8 % on a test set comparable to Ger- 2020. Conformer: Convolution-augmented Trans-
mEval 2020 Task 4. The relative improvement of former for Speech Recognition. In Proceedings of
31 % indicates that a lot of progress has been made Interspeech, pages 5036–5040.
in the field of Swiss German STT over the past
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur.
year. 2015. Librispeech: An asr corpus based on public
Despite recent advances in semi-supervised and domain audio books. In 2015 IEEE International
Conference on Acoustics, Speech and Signal Process-
ing (ICASSP), pages 5206–5210.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic evalu-
ation of machine translation. In Proceedings of the
40th Annual Meeting of the Association for Compu-
tational Linguistics, pages 311–318.
Daniel S. Park, Yu Zhang, Ye Jia, Wei Han, Chung-
Cheng Chiu, Bo Li, Yonghui Wu, and Quoc V. Le.
2020. Improved noisy student training for automatic
speech recognition. Interspeech 2020.
Michel Plüss, Lukas Neukom, Christian Scheller, and
Manfred Vogel. 2021. Swiss parliaments corpus, an
automatically aligned swiss german speech to stan-
dard german text corpus.
Michel Plüss, Lukas Neukom, and Manfred Vogel. 2020.
Germeval 2020 task 4: Low-resource speech-to-text.
In SWISSTEXT & KONVENS 2020, Proceedings of
the 5th Swiss Text Analytics Conference (SwissText)
& 16th Conference on Natural Language Processing
(KONVENS).
Malgorzata Anna Ulasik, Manuela Hurlimann, Bogu-
mila Dubel, Yves Kaufmann, Silas Rudolf, Jan Deriu,
Katsiaryna Mlynchyk, Hans-Peter Hutter, and Mark
Cieliebak. 2021. Zhaw-cai: Ensemble method for
swiss german speech to standard german text. In
preparation.
Yu Zhang, James Qin, Daniel S. Park, Wei Han, Chung-
Cheng Chiu, Ruoming Pang, Quoc V. Le, and
Yonghui Wu. 2020. Pushing the limits of semi-
supervised learning for automatic speech recognition.