=Paper=
{{Paper
|id=Vol-2957/sg_paper1
|storemode=property
|title=SwissText 2021 Task 3: Swiss German Speech to Standard German Text (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2957/sg_paper1.pdf
|volume=Vol-2957
|authors=Michel Plüss,Lukas Neukom,Manfred Vogel
|dblpUrl=https://dblp.org/rec/conf/swisstext/PlussNV21
}}
==SwissText 2021 Task 3: Swiss German Speech to Standard German Text (short paper)==
<pdf width="1500px">https://ceur-ws.org/Vol-2957/sg_paper1.pdf</pdf>
<pre>
    SwissText 2021 Task 3: Swiss German Speech to Standard German Text

                          Michel Plüss   Lukas Neukom         Manfred Vogel
                                      Institute for Data Science
                  University of Applied Sciences and Arts Northwestern Switzerland
                                        Windisch, Switzerland
                                   michel.pluess@fhnw.ch


                        Abstract                                well. Zhang et al. (2020) set the current state-of-
                                                                the-art on the popular LibriSpeech test-other bench-
     We present the results and findings of                     mark (Panayotov et al., 2015) with a word error rate
     SwissText 2021 Task 3 on Swiss German                      (WER) of 2.6 %. In comparison, the 2020 shared
     Speech to Standard German Text. Partic-                    task on Swiss German STT (Plüss et al., 2020), this
     ipants were asked to build a system trans-                 task’s predecessor, was won by Büchi et al. (2020)
     lating Swiss German speech to Standard                     with a WER of 40.3 %.
     German text. The objective was to max-                        The goal of this task is to spur further progress
     imize the BLEU score on a new test set                     in the field of Swiss German STT by providing a
     covering a large part of the Swiss German                  larger labeled training set, an additional unlabeled
     dialect landscape. Four teams participated,                training set, and a test set with a dialect distribution
     with the winning contribution achieving a                  similar to the real distribution of Swiss German
     BLEU score of 46.0.                                        dialects in Switzerland.
                                                                   The remainder of this paper is structured as fol-
1    Introduction                                               lows: the task, the data, and the evaluation of sub-
                                                                missions are described in section 2. An overview
Swiss German is a family of dialects spoken by
                                                                of the submissions and results of this task can be
around five million people in Switzerland. It is dif-
                                                                found in section 3. Section 4 wraps up the paper
ferent from Standard German regarding phonetics,
                                                                and gives directions for future work.
vocabulary, morphology, and syntax. Swiss Ger-
man is mostly a spoken language. While it is also               2     Task Description
used in writing, particularly in informal text mes-
sages, it lacks a standardized writing system. This             The objective of the task is to build a sentence-
leads to difficulties for automated text processing             level Swiss German speech to Standard German
such as spelling ambiguities and a huge vocabu-                 text speech translation system. The submission
lary size. Therefore, most use cases for a Swiss                with the best BLEU score (Papineni et al., 2002)
German speech-to-text (STT) system require Stan-                wins. Participants were encouraged to explore and
dard German text as output. This can be viewed as               combine suitable supervised, semi-supervised, and
a speech translation problem with similar source                unsupervised learning approaches.
and target languages. For example, the Swiss Ger-
man sentence ”Ide Abfahrt hetter de sächsti Platz              2.1    Data
beleit” can be translated to the Standard German                We provide two training datasets. The first one is
sentence ”In der Abfahrt belegte er den sechsten                the Swiss Parliaments Corpus (Plüss et al., 2021),
Platz”. Here, the sentence structure is very similar,           a labeled 293-hours dataset of Swiss German de-
but the past tense changes in Standard German.                  bates from the Grosser Rat Kanton Bern parliament
   Speech-to-text systems for well-resourced lan-               with corresponding Standard German sentence-
guages like English or Standard German work very                level transcriptions1 . The second one is an un-
                                                                labeled collection of 1208 hours of Swiss German
Copyright © 2021 for this paper by its authors. Use permitted
                                                                  1
under Creative Commons License Attribution 4.0 Interna-             https://www.cs.technik.fhnw.ch/
tional (CC BY 4.0)                                              i4ds-datasets
Figure 1: Comparison of the dialect prevalence in Switzerland’s German-speaking population with the All Swiss
German Dialects Test Set. To make this comparison possible, a dialect is defined as the average dialect spoken in a
canton.


debates from the Gemeinderat Zürich parliament2 .         dialect metadata for the test set. BS speakers are
The use of additional datasets is allowed, but has         therefore included in BL.
to be declared in the system description.
   The test set created for this task, the All Swiss       2.2    Evaluation
German Dialects Test Set, contains 13 hours of             The submissions are evaluated using BLEU
sentence-level Swiss German speech and Standard            score (Papineni et al., 2002). Our evaluation script,
German text pairs3 . The set is divided into two           which uses the NLTK (Bird et al., 2009) BLEU
equally sized parts, a public part (score on this part     implementation, is open-source5 . The private part
was displayed in the public ranking while the task         of the test set is used for the final ranking. The test
was running) and a private part (final ranking is          set contains the characters a-z, ä, ö, ü, and spaces,
based on this part, was not available while the task       and the participants’ models should support exactly
was running). The texts are from the Common                these. Punctuation and casing are ignored for the
Voice project4 and were spoken by 178 speakers             evaluation. Numbers are spelled out. All other
from all over Switzerland. It covers a large part of       characters are removed from the submission (see
the Swiss German dialect landscape. Figure 1 com-          evaluation script for details). Participants were
pares the test set dialect distribution with the real      therefore advised to replace each additional charac-
distribution of Swiss German dialects in Switzer-          ter in their training set with a sensible replacement.
land. The comparison highlights the good match
between the test set dialect distribution and the          3     Results
real distribution. There are some exceptions, e.g.
there is no data from the cantons AI, AR, and OW           Four teams participated in the shared task. Table 1
due to their small size. Also, BE and SG speak-            shows the final ranking.
ers are overrepresented whereas ZH speakers are               The team in first place, Arabskyy et al. (2021),
underrepresented. There was no distinction made            achieved a BLEU score of 46.0. They use a hybrid
between BL and BS during the collection of the             system with a lexicon that incorporates translations,
   2
                                                           a first pass language model that deals with Swiss
    https://www.cs.technik.fhnw.ch/
i4ds-datasets
                                                           German particularities, an acoustic model transfer-
  3
    https://www.cs.technik.fhnw.ch/                        learned from a large Standard German dataset, and
i4ds-datasets
  4                                                          5
    https://github.com/common-voice/                           https://github.com/i4Ds/
common-voice/tree/main/server/data/de                      swisstext-2021-task-3
         Rank     Team                BLEU                 unsupervised learning for STT, see e.g. (Park et al.,
          1       Arabskyy et al.      46.0                2020) and (Baevski et al., 2020), none of the partic-
          2       Plüss et al.        41.0                ipants made use of the provided unlabeled training
          3       Ulasik et al.        39.4                set. This seems to be a promising direction for
          4       DeJa                 17.1                further improvements of Swiss German STT given
                                                           that the amount of available labeled training data is
Table 1: Final ranking of the shared task. The BLEU        still comparatively small.
column shows the BLEU score on the private 50 % of
the All Swiss German Dialects Test Set.                    Acknowledgments
                                                           We thank our participants for their interest in the
a strong neural language model for second pass             shared task, for their participation, and for their
rescoring.                                                 timely feedback, which have helped us make this
   Our baseline ranks second with 41.0 BLEU. The           task a success.
system is described in (Plüss et al., 2021) (sec-            We also thank Elias Schorr for his great work on
tion 5). We train an end-to-end Conformer (Gulati          the submission and evaluation website.
et al., 2020) model using a hybrid CTC / atten-
tion encoder-decoder framework. The training data
consists of the Swiss Parliaments Corpus (Plüss           References
et al., 2021), an additional 250 hours corpus of           Yuriy Arabskyy, Aashish Agarwal, Subhadeep Dey, and
automatically aligned Swiss German parliament                Oscar Koller. 2021. Dialectal speech recognition
debates, and the Standard German Common Voice                and translation of swiss german speech to standard
                                                             german text: Microsoft’s submission to swisstext
corpus (Ardila et al., 2019).                                2021. In preparation.
   The team in third place, Ulasik et al. (2021),
achieved a BLEU score of 39.4. Their approach              Rosana Ardila, Megan Branson, Kelly Davis, Michael
                                                             Henretty, Michael Kohler, Josh Meyer, Reuben
combines three models trained on multilingual,               Morais, Lindsay Saunders, Francis M Tyers, and
Standard German, and Swiss German data using                 Gregor Weber. 2019. Common voice: A massively-
ensembling.                                                  multilingual speech corpus.      arXiv preprint
   The team called DeJa ranked fourth and achieved           arXiv:1912.06670.
a BLEU score of 17.1. We have not received a               Alexei Baevski, Henry Zhou, Abdelrahman Mohamed,
system description for this submission.                      and Michael Auli. 2020. wav2vec 2.0: A frame-
                                                             work for self-supervised learning of speech represen-
4   Conclusion                                               tations.

We have described SwissText 2021 Task 3 on Swiss           Steven Bird, Ewan Klein, and Edward Loper. 2009. Nat-
                                                              ural language processing with Python: analyzing text
German Speech to Standard German Text. Sub-                   with the natural language toolkit. ” O’Reilly Media,
missions were evaluated on the All Swiss German               Inc.”.
Dialects Test Set, which we introduced in this work.
It covers a large part of the Swiss German dialect         Matthias Büchi, Malgorzata Anna Ulasik, Manuela
                                                            Hürlimann, Fernando Benites, Pius von Däniken,
landscape. Four teams participated in the task, with        and Mark Cieliebak. 2020. Zhaw-init at germeval
the winning team reaching a BLEU score of 46.0.             2020 task 4: Low-resource speech-to-text. In SWIS-
The results are hard to compare to the results of this      STEXT & KONVENS 2020, Proceedings of the 5th
task’s predecessor, GermEval 2020 Task 4 (Plüss            Swiss Text Analytics Conference (SwissText) & 16th
                                                            Conference on Natural Language Processing (KON-
et al., 2020), due to the different test set and metric.    VENS).
Last year’s winning contribution achieved a WER
of 40.3 %. In our experiments in (Plüss et al., 2021),    Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki
ranking second in this year’s task, we achieved a            Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang,
                                                             Zhengdong Zhang, Yonghui Wu, and Ruoming Pang.
WER of 27.8 % on a test set comparable to Ger-               2020. Conformer: Convolution-augmented Trans-
mEval 2020 Task 4. The relative improvement of               former for Speech Recognition. In Proceedings of
31 % indicates that a lot of progress has been made          Interspeech, pages 5036–5040.
in the field of Swiss German STT over the past
                                                           V. Panayotov, G. Chen, D. Povey, and S. Khudanpur.
year.                                                         2015. Librispeech: An asr corpus based on public
   Despite recent advances in semi-supervised and             domain audio books. In 2015 IEEE International
  Conference on Acoustics, Speech and Signal Process-
  ing (ICASSP), pages 5206–5210.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
  Jing Zhu. 2002. Bleu: a method for automatic evalu-
  ation of machine translation. In Proceedings of the
  40th Annual Meeting of the Association for Compu-
  tational Linguistics, pages 311–318.
Daniel S. Park, Yu Zhang, Ye Jia, Wei Han, Chung-
  Cheng Chiu, Bo Li, Yonghui Wu, and Quoc V. Le.
  2020. Improved noisy student training for automatic
  speech recognition. Interspeech 2020.
Michel Plüss, Lukas Neukom, Christian Scheller, and
  Manfred Vogel. 2021. Swiss parliaments corpus, an
  automatically aligned swiss german speech to stan-
  dard german text corpus.
Michel Plüss, Lukas Neukom, and Manfred Vogel. 2020.
  Germeval 2020 task 4: Low-resource speech-to-text.
  In SWISSTEXT & KONVENS 2020, Proceedings of
  the 5th Swiss Text Analytics Conference (SwissText)
  & 16th Conference on Natural Language Processing
  (KONVENS).
Malgorzata Anna Ulasik, Manuela Hurlimann, Bogu-
 mila Dubel, Yves Kaufmann, Silas Rudolf, Jan Deriu,
 Katsiaryna Mlynchyk, Hans-Peter Hutter, and Mark
 Cieliebak. 2021. Zhaw-cai: Ensemble method for
 swiss german speech to standard german text. In
 preparation.
Yu Zhang, James Qin, Daniel S. Park, Wei Han, Chung-
  Cheng Chiu, Ruoming Pang, Quoc V. Le, and
  Yonghui Wu. 2020. Pushing the limits of semi-
  supervised learning for automatic speech recognition.

</pre>