<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>2nd Swiss German Speech to Standard German Text Shared Task at SwissText 2022</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michel Plüss</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yanick Schraner</string-name>
          <email>yanick.schraner@fhnw.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Scheller</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manfred Vogel</string-name>
          <email>manfred.vogel@fhnw.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Lugano, Switzerland</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Data Science, University of Applied Sciences and Arts Northwestern Switzerland</institution>
          ,
          <addr-line>Windisch</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present the results and findings of the 2nd Swiss German speech to Standard German text shared task at SwissText 2022. Participants were asked to build a sentence-level Swiss German speech to Standard German text system specialized on the Grisons dialect. The objective was to maximize the BLEU score on a test set of Grisons speech. 3 teams participated, with the best-performing system achieving a BLEU score of 70.1. German dialects spoken in Switzerland, see Plüss et al. [1]. the Swiss German, Standard German, French, and Italian ∗Corresponding author.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>The topic of this task is automatic speech recognition (ASR) for Swiss German. Swiss German is a family of</title>
      </sec>
      <sec id="sec-1-2">
        <title>Swiss German ASR is concerned with the transcription</title>
        <p>
          of Swiss German Speech to Standard German text and
can be viewed as a speech translation task with similar
source and target languages, see Plüss et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>
          This task has two predecessors. The 2020 task [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
provided a 70-hours labeled training set of automatically
aligned Swiss German speech (predominantly Bernese
dialect) and Standard German text. The test set also
comprised mostly Bernese speech. The winning contribution
by Büchi et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] achieved a word error rate (WER) of
ized on the Grisons dialect. The submission with the
best BLEU score on a test set of Grisons dialect speakers
wins. Participants were encouraged to explore suitable
transfer learning and fine-tuning approaches based on
data provided.
2.1. Data
We provide 5 diferent training datasets to participants,
all of which are collections of sentence-level transcribed
speech. SDS-200 [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] is a Swiss German dataset with 200
hours of speech from all major Swiss German dialect
regions, of which 6 hours are in Grisons dialect. SwissDial
[7] is a Swiss German dataset with 34 hours of speech
40.3 %. The 2021 task [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] provided an improved and ex- from all major Swiss German dialect regions, of which
tended 293-hours version of the 2020 training set, as well
as a 1208-hours unlabeled speech dataset (predominantly
        </p>
      </sec>
      <sec id="sec-1-3">
        <title>Zurich dialect). The test set covered a large part of the</title>
        <p>11 hours are in Grisons dialect. From version 9.0 of the</p>
      </sec>
      <sec id="sec-1-4">
        <title>Common Voice project [8], we provide 1166 hours of</title>
      </sec>
      <sec id="sec-1-5">
        <title>Standard German, 926 hours of French, and 340 hours of</title>
        <p>
          Swiss German dialect landscape. The winning contribu- Italian, all of which are oficial languages of Switzerland.
tion by Arabskyy et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] achieved a BLEU score [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] of
        </p>
      </sec>
      <sec id="sec-1-6">
        <title>The test set was collected in a similar fashion to SDS</title>
        <p>
          200 [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. It consists of 5 hours of sentence-level
tran
        </p>
        <p>
          The goal of this task is to build a system able to trans- scribed Grisons speech by 11 speakers, of which 8 are
late Swiss German speech to Standard German text and
optimize it for the Grisons dialect. To enable this, we
provide the Swiss German labeled datasets SDS-200 [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]
and SwissDial [7], both including a substantial amount of
        </p>
      </sec>
      <sec id="sec-1-7">
        <title>Grisons speech, as well as the Standard German, French, and Italian labeled datasets of Common Voice 9.0 [8].</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Task Description</title>
      <sec id="sec-2-1">
        <title>The goal of the task is to build a sentence-level Swiss</title>
      </sec>
      <sec id="sec-2-2">
        <title>German speech to Standard German text system special</title>
      </sec>
      <sec id="sec-2-3">
        <title>Wettingen. Care was taken to avoid any overlap between the Swiss newspaper sentences in this test set and the ones in SDS-200 [6].</title>
        <p>2.2. Evaluation</p>
      </sec>
      <sec id="sec-2-4">
        <title>The submissions are evaluated using BLEU score [5]. Our</title>
        <p>evaluation script, which uses the NLTK [9] BLEU
implementation, is open-source1. The private part of the test</p>
      </sec>
      <sec id="sec-2-5">
        <title>1https://github.com/i4Ds/swisstext-2022-swiss-german-shared-task</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Conclusion</title>
      <sec id="sec-3-1">
        <title>We have described the 2nd Swiss German speech to Standard German text shared task at SwissText 2022. The best-performing system on the Grisons speech test set is Table 1 our baseline with a BLEU score of 70.1. The same system</title>
        <p>
          Final ranking of the shared task. The BLEU column shows the achieves a BLEU score of 65.3 on the 2021 task test set
BLEU score on the private 50 % of the test set. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], a relative improvement of 42 % over the highest score
of the 2021 task. This highlights the large progress in
the field over the last year. The main drivers for this
set is used for the final ranking. progress seem to be the new dataset SDS-200 [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] as well
        </p>
        <p>The test set contains the characters a-z, ä, ö, ü, 0-9, as the use of models pre-trained on large amounts of
and spaces, and the participants’ models should support unlabeled speech as demonstrated by the teams Stucki
exactly these. Punctuation and casing are ignored for et al. and Nafisi et al., who employed XLS-R models [ 10].
the evaluation. Numbers are not used consistently in The addition of an LM seems to be especially important
the test set, so sometimes they are written as digits and for XLS-R models. The main diference between Nafisi et
sometimes they are spelled out. We create a second ref- al. and Stucki et al. is that the latter add an LM, leading
erence by automatically spelling out all numbers and use to a relative improvement of 23 % BLEU.
both the original and this adjusted reference in the BLEU On the other hand, none of the 3 participating teams
score calculation. Participants were advised to have their made a significant efort to optimize their system for
models always spell out numbers. All other characters the Grisons dialect. The best approach to create an ASR
are removed from the submission (see evaluation script system optimized for a specific dialect remains to be
for details). Participants were therefore advised to re- found in future work. Incorporating the provided French
place each additional character in their training set with and Italian data for training is another possible direction
a sensible replacement. for future research.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Results</title>
      <sec id="sec-4-1">
        <title>3 teams participated in the shared task, including our</title>
        <p>baseline. Table 1 shows the final ranking.</p>
        <p>
          Our baseline achieves a BLEU score of 70.1. We use the
model Transformer Baseline described in Plüss et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
We train the model from scratch on SDS-200, SwissDial,
and the Standard German part of Common Voice.
Contrary to Plüss et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], we employ a Transformer-based
language model (LM) with 12 decoder layers, 16
attention heads, an embedding dimension of 512, and a fully
connected layer with 1024 units. The LM is trained on
67M Standard German sentences. We use a beam width
of 60 during decoding. The same model achieves 65.3
BLEU on the 2021 task test set [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>Stucki et al. achieve a BLEU score of 68.1. They use
an XLS-R 1B model [10], pre-trained on 436K hours of
unlabeled speech in 128 languages, not including Swiss
German. They fine-tune the model on SDS-200 and
SwissDial. A KenLM 5-gram LM [11] trained on the German
Wikipedia is employed.</p>
        <p>Nafisi et al. achieve a BLEU score of 55.3. They use
an XLS-R 1B model [10], pre-trained on 436K hours of
unlabeled speech in 128 languages, not including Swiss
German. They fine-tune the model on SDS-200. No LM
is employed.
N. Kapotis, J. Hartmann, M. A. Ulasik, C. Scheller,
Y. Schraner, A. Jain, J. Deriu, M. Cieliebak, M. Vogel,
Sds-200: A swiss german speech to standard
german text corpus, in: Proceedings of the Language
Resources and Evaluation Conference, 2022.
[7] P. Dogan-Schönberger, J. Mäder, T. Hofmann,
Swissdial: Parallel multidialectal corpus of spoken swiss
german, CoRR abs/2103.11401 (2021). URL: https:
//arxiv.org/abs/2103.11401. a r X i v : 2 1 0 3 . 1 1 4 0 1 .
[8] R. Ardila, M. Branson, K. Davis, M. Henretty,
M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M.
Tyers, G. Weber, Common Voice: A
MassivelyMultilingual Speech Corpus, in: Proceedings of
the 12th Conference on Language Resources and
Evaluation (LREC 2020), 2020.
[9] S. Bird, E. Klein, E. Loper, Natural language
processing with Python: analyzing text with the natural
language toolkit, ” O’Reilly Media, Inc.”, 2009.
[10] A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu,
N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino,
A. Baevski, A. Conneau, M. Auli, Xls-r:
Selfsupervised cross-lingual speech representation
learning at scale, arXiv abs/2111.09296 (2021).
[11] K. Heafield, Kenlm: Faster and smaller language
model queries, in: Proceedings of the sixth
workshop on statistical machine translation, 2011.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Plüss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Neukom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vogel</surname>
          </string-name>
          ,
          <article-title>Swisstext 2021 task 3: Swiss german speech to standard german text</article-title>
          ,
          <source>in: Proceedings of the Swiss Text Analytics Conference</source>
          <year>2021</year>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Plüss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Neukom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vogel</surname>
          </string-name>
          ,
          <article-title>Germeval 2020 task 4: Low-resource speech-to-text</article-title>
          ,
          <source>in: Proceedings of the 5th Swiss Text Analytics Conference (SwissText) &amp; 16th Conference on Natural Language Processing (KONVENS)</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Büchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Ulasik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hürlimann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Benites</surname>
          </string-name>
          , P. von Däniken, M. Cieliebak,
          <article-title>Zhaw-init at germeval 2020 task 4: Low-resource speech-to-text</article-title>
          ,
          <source>in: Proceedings of the 5th Swiss Text Analytics Conference (SwissText) &amp; 16th Conference on Natural Language Processing (KONVENS)</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Arabskyy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Koller</surname>
          </string-name>
          ,
          <article-title>Dialectal speech recognition and translation of swiss german speech to standard german text: Microsoft's submission to swisstext 2021, in:</article-title>
          <source>Proceedings of the Swiss Text Analytics Conference</source>
          <year>2021</year>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu,
          <article-title>Bleu: a method for automatic evaluation of machine translation</article-title>
          ,
          <source>in: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Plüss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hürlimann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cuny</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Stöckli,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>