=Paper= {{Paper |id=Vol-2957/sg_paper3 |storemode=property |title=ZHAW-CAI: Ensemble Method for Swiss German Speech to Standard German Text (short paper) |pdfUrl=https://ceur-ws.org/Vol-2957/sg_paper3.pdf |volume=Vol-2957 |authors=Malgorzata Anna Ulasik,Manuela Hürlimann,Bogumila Dubel,Yves Kaufmann,Silas Rudolf,Jan Deriu,Katsiaryna Mlynchyk,Hans-Peter Hutter,Mark Cieliebak |dblpUrl=https://dblp.org/rec/conf/swisstext/UlasikHDKRDMHC21 }} ==ZHAW-CAI: Ensemble Method for Swiss German Speech to Standard German Text (short paper)== https://ceur-ws.org/Vol-2957/sg_paper3.pdf
    ZHAW-CAI: Ensemble Method for Swiss German Speech to Standard
                          German Text
       Malgorzata Anna Ulasik, Manuela Hürlimann, Bogumila Dubel, Yves Kaufmann,
     Silas Rudolf, Jan Deriu, Katsiaryna Mlynchyk, Hans-Peter Hutter, and Mark Cieliebak

                                 Centre for Artificial Intelligence
                             Zurich University of Applied Sciences
                          {ulas, hueu, deri, mlyn, huhp, ciel}@zhaw.ch
                 bodubel@gmail.com, y.kaufmann@yagan.ch, silasrudolf@gmail.com

                        Abstract                                contexts, but since there is no single standard writ-
                                                                ing system, Swiss German speakers usually write
    This paper presents the contribution of                     phonetically in their local dialect in informal sit-
    ZHAW-CAI to the Shared Task ”Swiss                          uations (Siebenhaar, 2003). On formal occasions
    German Speech to Standard German Text”                      such as work meetings and political debate, speech
    at the SwissText 2021 conference. Our ap-                   is typically transcribed into Standard German. As
    proach combines three models based on                       there is a considerable linguistic distance between
    the Fairseq, Jasper and Wav2vec architec-                   Swiss German dialects and Standard German, de-
    tures trained on multilingual, German and                   veloping a model for transcribing Swiss German
    Swiss German data. We applied an ensem-                     speech into Standard German text actually involves
    bling algorithm on the predictions of the                   Speech Translation, which combines STT with Ma-
    three models in order to retrieve the most                  chine Translation (MT) (Bérard et al., 2016).
    reliable candidate out of the provided trans-                  As a response to the Shared Task “Swiss Ger-
    lations for each spoken utterance. With the                 man Speech to Standard German Text” organised
    ensembling output, we achieved a BLEU                       at Swisstext 2021, we provided a solution consist-
    score of 39.39 on the private test set, which               ing of three models based on different architectures:
    gave us the third place out of four contrib-                Fairseq (Wang et al., 2020a), Jasper (Li et al., 2019)
    utors in the competition.                                   and Wav2vec XLSR-5 (Baevski et al., 2020) which
                                                                were trained with various data sets, both in Stan-
1   Introduction                                                dard German and Swiss German. Their predictions
                                                                were subsequently fed into a majority voting al-
Speech-to-Text (STT) enables transcribing spoken                gorithm with the aim to select the most reliable
utterances into text. For successfully performing               translation.
a transformation from speech to a text, the exis-                  The remainder of this paper is structured as fol-
tence of a standardised writing system of the target            lows: Section 2 provides the description of the
language is of prime importance. This is where                  Shared Task and Section 3 discusses relevant liter-
Swiss German 1 poses a substantial challenge: it                ature. In Section 4 we present the systems which
does not have a standardised orthography since it               make up our final solution, their architecture and
functions as the default spoken language in both                the applied training data. In section 5 we provide an
formal and informal situations, while for writing,              overview of all experiments performed with these
the Standard German language is used. This phe-                 models and their outputs. Section 6 lays out the
nomenon, called “medial diglossia” (Siebenhaar                  ensembling approach and section 7 presents the
and Wyler, 1997), occurs in the entire German-                  post-processing experiments we performed on the
speaking part of Switzerland, which is additionally             predictions of the models. The paper ends with a
characterised by a high dialect diversity. Swiss Ger-           conclusion presented in section 8.
man is increasingly used for writing in informal
                                                                2   Shared Task Description
Copyright © 2021 for this paper by its authors. Use permitted
under Creative Commons License Attribution 4.0 Interna-         The goal of the Shared Task was to build a system
tional (CC BY 4.0)                                              for translating speech in any Swiss German dialect
    1
      To be precise, there is no single ”Swiss German” lan-
guage, but rather a collection of many different regional di-   into Standard German text (Plüss et al., 2021).
alects that are subsumed with this term.                           The organisers provided a labelled data set con-
taining 293 hours of audio recordings, mostly in            made by the STT module are propagated to the
the Bernese dialect, transcribed in Standard Ger-           MT module (Ney, 1999). Thus, efforts are put into
man. Since the alignment between the recordings             coupling the STT and MT modules to prevent error
and the transcripts was done automatically, each ut-        propagation, for instance, by generating multiple
terance has an Intersection over Union (IoU) score          hypotheses of the STT system via n-best search
reflecting its alignment quality. Additionally, there       or the creation of lattices (Woszczyna et al., 1993;
was an unlabelled data set consisting of 1208 hours         Schultz et al., 2004).
of recordings, mostly in the Zurich dialect. The so-
lutions were evaluated based on a 13 hours test set,        End-to-End Approaches model ST as a single
which contains recordings of speakers coming from           task, where input is speech in the source language,
all German-speaking parts of Switzerland. The di-           and the output consists of text or speech in the
alect distribution of the test set is close to the actual   target language. The main issue with this mod-
Swiss German dialect distribution in Switzerland.           elling approach is the lack of sufficient training
   The translation accuracy of the provided solu-           data. Whereas data for STT typically consists of
tions is measured using BLEU, a standard metric             several hundreds of hours of transcribed data, most
for automatic evaluation of machine translation (Pa-        ST datasets contain only a fraction of this amount.
pineni et al., 2002). The approach consists in count-       For instance, the Europarl-ST corpus contains on
ing n-grams in the candidate translation matching           average only 42 hours of transcribed data per lan-
n-grams in the reference translation without taking         guage pair (Iranzo-Sánchez et al., 2020), whereas
the word order into account. The metric ranges              the Librispeech STT corpus contains around 1000
from 0 to 100. A perfect match results in a score           hours of transcribed data (Panayotov et al., 2015).
of 100. A score of 0 occurs if there are no matches.        For this reason, end-to-end approaches nowadays
The tool used by the organisers for evaluating so-          rely on leveraging multi-task learning and single
lutions is the NLTK implementation of the BLEU              language pre-training of the STT and MT submod-
score with default parameters2 . Prior to evaluation,       ules and use the ST dataset for fine-tuning (Wang
both the references and the translations are nor-           et al., 2020b).
malised: the utterances are lowercased, the punctu-            Most cascading approaches rely on data where
ation is removed, the numbers are spelled out and           access to both the source language transcript and
all non-ASCII characters except for the letters ”ä”,       its target language translation is needed. However,
”ö”, ”ü” are removed.                                     in our scenario, we do not have access to written
   The test set was split into a public and a private       text of the source language since Swiss German
subset of equal sizes. For all evaluations presented        is a spoken language, and thus, often directly tran-
in this paper, the public test set was used.                scribed into Standard German (see 1 for more de-
                                                            tails). Thus, our models follow the End-to-End
3       Related Work                                        approach.
Speech Translation (ST) is the task of translating          4     Systems Description
spoken text in a source language to text or speech
in a target language. The approaches to solve this          This section describes the architecture of the three
problem can be put into two categories: cascading           models which build the foundation for the experi-
approaches and end-to-end approaches (Sperber               ments presented in Section 5 and are components of
and Paulik, 2020).                                          the final solution which combines the three models’
                                                            outputs in an ensembling algorithm. The section
Cascaded Approaches work by splitting the
                                                            also explains what data was used for training the
task into two steps: first, an STT model transcribes
                                                            models.
speech of the source language to text in the tar-
get language, and then a machine translation (MT)           4.1     Fairseq
module translates the generated text into the target
language (Waibel et al., 1991). The main issue              4.1.1    Model
with the cascaded approach is the fact that errors          Fairseq is based on the transformer architecture for
    2
                                                            Speech-to-Text provided by Fairseq S2T Toolkit
   https://www.nltk.org/api/nltk.
translate.html#nltk.translate.bleu_score.                   (Wang et al., 2020a), which combines the tasks
corpus_bleu                                                 of STT and ST under the same encoder-decoder
architecture (Changhan Wang, 2020). The exper-         decessor of this Shared Task (Büchi et al., 2020).
iments were trained with the small transformer         The Acoustic Model as per Büchi et al. (2020) con-
model with 256 dimensions, 12 Layers encoder, 6        sists of 10x5 blocks and was pre-trained on 537
Layers decoder, 27M parameters, Adam optimiser,        hours of Standard German data (see Büchi et al.
and inverse square root for the learning rate sched-   (2020), Table 2). In all reported experiments, we
uler. Decoding is executed with a character-based      fine-tuned five blocks on the Shared Task data as
SentencePiece model (Taku Kudo, 2018) using an         described in Section 5.2 below. We used last year’s
n-best decoding strategy with n=5. The acoustic        extended language model, a 6-gram model trained
model (encoder) can be pre-trained with the same       with KenLM, without further fine-tuning on this
transformer architecture as described above.           year’s data. For the data sources, see Table 2 in
                                                       Büchi et al. (2020). Decoding was done using beam
4.1.2 Data
                                                       search with a beam size of 1024.
The audios were extracted to 80-dimensional log
mel-scale filterbank features (windows with 25 ms      4.2.2     Data
size and 10 ms shift) and saved in NumPy format        We extracted the audios to 64-dimensional mel-
for the training. To alleviate overfitting, speech     filterbank features with 20ms window size and
data transforms SpecAugment (Park et al., 2019),       10ms overlap as input to the Jasper acoustic model.
adopted by Fairseq S2T, were applied. For text         The reference texts were preprocessed as described
normalisation we used the script provided by the       in Büchi et al. (2020). No additional Swiss German
task organisers. Additional numbers were spelled       audio data was used for training Jasper.
out using num2words3 . We use three additional
datasets:                                              4.3      Wav2vec XLSR-53
   • SwissDial (Pelin Dogan-Schönberger, 2021):       4.3.1     Model
     26 hours of Swiss German                          Wav2vec XLSR-53 is a cross-lingual extension of
                                                       wav2vec 2.0 as per Baevski et al. (2020). Pre-
   • ArchiMob (Tanja Samardzic, 2016): 80 hours
                                                       trained on 53 different languages, it attempts to
     of Swiss German
                                                       learn a quantisation of the latent representations
   • Common Voice German v4: 483 hours of Ger-         shared across languages by solving a contrastive
     man4                                              task over masked speech representations. In the
                                                       experiment below, we fine-tuned wav2vec XLSR-
The SwissDial dataset consists of 26 hours of au-
                                                       53 on the Shared Task data. No explicit language
dios in 8 different Swiss dialects with correspond-
                                                       model was used to conduct the experiment.
ing transcriptions in Swiss dialect and Standard
German translations. The Swiss German transcrip-       4.3.2     Data
tion rules differ between dialects. ArchiMob con-      The labelled data used for fine-tuning XLSR-53
tains 70 hours of audios in 14 different Swiss di-     was based on the task training data. However, it
alects with transcription in Swiss German, where       was further pre-processed removing all utterances
each word is additionally provided with a Standard     which contained special characters or were detected
German normalisation. The transcription rules are      as not being in German using langdetect5 . Numeric
normalised and are equal for all dialects (Dieth       values were replaced by strings using num2words6 .
transcription, (Dieth and Schmid-Cadalbert, 1986)).
Common Voice German v4 consists of 483 hours           5       Experiments on Individual Models
of audios in Standard German with corresponding
transcriptions.                                        Sections 5.1 through 5.3 present the experiments
                                                       we performed to improve the individual models and
4.2    Jasper                                          provide the BLEU scores achieved in each experi-
4.2.1 Model                                            ment. We also discuss approaches to improve the
We used the Jasper (Li et al., 2019) configuration     model outputs with the use of ensembling (Section
corresponding to our best submission in the pre-       6) and post-processing (Section 7).
   3                                                       5
   https://pypi.org/project/num2words/                     https://github.com/Mimino666/
   4                                                   langdetect
   https://commonvoice.mozilla.org/en/
                                                         6
datasets/                                                  https://pypi.org/project/num2words/
5.1   Fairseq                                         Fairseq F-SP-SD-CH In order to further im-
Below we describe the different models and exper-     prove the acoustic model, we trained an encoder
imental results obtained with Fairseq. All Exper-     in Swiss German (CH) on the SwissDial and
iments are trained with the same configuration as     ArchiMob dataset. We trained a new model
described in Section 4.1 and can be divided into      F-SP-SD-CH with the entire Shared Task train-
three groups: extension of training data, inclusion   ing data and SwissDial and included the pre-trained
of a pre-trained encoder and ensembling.              CH encoder in the training. The BLEU score in
                                                      comparison to F-SP-All is improved by 12.54
5.1.1 Extending the training data                     points.
Fairseq F-SP-0.9 For F-SP-0.9 we trained
                                                      5.1.3 Ensembling
the model from scratch on the Shared Task train-
ing data. We used 176 hours, corresponding to an      Fairseq Ensemble F-SP-SD & F-SP-DE (F-E1)
Intersection over Union (IoU) greater or equal to     In this experiment, we ensembled the models
0.9.                                                  F-SP-SD and F-SP-DE. F-E1 achieves a BLEU
                                                      score of 28.74 . Ensembling is done with the imple-
Fairseq F-SP-All We noted that the model              mentation provided by the Fairseq S2T Toolkit7 . In
F-SP-0.9 generalises very poorly, so for              comparison to F-SP-SD-DE, which combines in
F-SP-All we trained a new model with the entire       the training setup the same training dataset Swiss-
task training data, which corresponds to 293 hours.   Dial as F-SP-SD and the same DE encoder as
Despite partially poorly aligned translations, the    F-SP-DE, the ensembling performs slightly bet-
model benefits from the new data: the BLEU score      ter. In comparison to F-SP-All the BLEU score
is improved by about 4.32 points.                     improves by 9.94 points.
Fairseq F-SP-SD We decided to extend the train-       Fairseq Ensemble F-SP-AM-DE & F-SP-SD-
ing data with the SwissDial Corpus. For this, we      CH (F-E2) After the good performance of
trained a new model F-SP-SD with the entire task      F-E1, we decided to ensemble F-SP-AM-DE
training data plus all data from SwissDial. This      and F-SP-SD-CH. This ensembling improves
data extension improves the score by an additional    the BLEU score in comparison to F-SP-All by
4.81 BLEU points in comparison to F-SP-All.           17.00 points.
5.1.2 Including pre-trained encoder                   Fairseq F-E2 extended (F-E3) Finally, we
Fairseq F-SP-DE We also investigated how to           trained a model on the entire available data for
improve the encoder (acoustic model). We pre-         Swiss German (task, SwissDial and ArchiMob)
trained a Standard German (DE) encoder on the         and used this model to perform ensembling on top
Common Voice German v4 dataset. For F-SP-DE,          of F-E2. For time reasons, we were not able to
we added the pre-trained encoder and trained the      complete the training and the output of this model
model on the entire Shared Task training data. In-    could not been included in the final solution pre-
cluding the DE encoder improves the score by 3.36     sented in 6. We only evaluated an intermediate
BLEU points in comparison to F-SP-All.                status of the model and achieved a score of 36.83
                                                      BLEU points. In comparison to F-SP-All, it
Fairseq F-SP-SD-DE Since both models
                                                      improves the score by 18.03 points.
F-SP-SD and F-SP-DE improved the BLEU
score, we decided to bring the two approaches
together. We trained a new model F-SP-SD-DE
                                                      Table 1 shows the public BLEU scores ob-
with the entire Shared Task training data, Swiss-
                                                      tained with the Fairseq models on the Shared
Dial data and include the pre-trained DE encoder
                                                      Task public part of the test set. The table contains
in the training. This brings an improvement of 8.37
                                                      additional information about applied train sets and
BLEU points in comparison to F-SP-All.
                                                      encoders. F-E3 achieved the best performance
Fairseq F-SP-AM-DE In this model we used the          with a BLEU score of 36.83 on the public part of
entire task training data plus the data from Archi-   the test set (37.4 on the private part). In addition
Mob. For the training we included the pre-trained     to ensembling, the inclusion of a CH encoder in
DE encoder. This setup improves the BLEU score          7
                                                          https://github.com/pytorch/fairseq/
by 14.01 in comparison to F-SP-All.                   issues/223
the training process as well as the extension of the    Table 2 shows the public BLEU scores obtained
training data with the ArchiMob corpus benefited    with the Jasper models on the two different test
the model performance most.                         sets (Jasper-PL-E was only evaluated on the en-
                                                    hanced test set). The best-performing Jasper model
           Table 1: Fairseq results.                is Jasper-PL with a BLEU score of 32.97 on
                                                    the public part of the test set. Using the enhanced
 Model         Train set             Encoder BLEU
                                                    audio data does not confer any advantage on either
 F-SP-0.9      task 0.9              training 14.48
                                                    prediction or pseudo-label fine-tuning compared
 F-SP-All      task all              training 18.8
                                                    to the as-is data. We can, however, see the bene-
 F-SP-SD       task, SwissDial training 23.61
                                                    fit of rather naive pseudo-labelling in this setting
 F-SP-DE       task                  DE       22.16
                                                    where training and testing data are quite different.
 F-SP-SD-DE    task, SwissDial DE             27.17
                                                    Future work could expand on the use of pseudo-
 F-SP-AM-DE task, ArchiMob DE                 32.81
                                                    labelling by using more advanced setups, such as
 F-SP-SD-CH task, SwissDial CH                31.34
                                                    confidence-based (Kahn et al., 2020) or iterative
 F-E1          -                     -        28.74
                                                    (Xu et al., 2020) pseudo-labelling.
 F-E2          -                     -        35.80
 F-E3          -                     -        36.83                 Table 2: Jasper results.

5.2   Jasper                                                Model           Test set     BLEU
                                                            Jasper-FT       task         30.8
Below we describe the different models and exper-
                                                            Jasper-FT       enhanced     26.4
imental results obtained with Jasper.
                                                            Jasper-PL       task         32.97
Jasper-FT For Jasper-FT we fine-tune the                    Jasper-PL       enhanced     31.92
pre-trained Standard German model on the Shared             Jasper-PL-E     enhanced     32.92
Task training data. We used 169 hours, sampled
from the set with an IoU greater or equal to 0.9,       5.3    Wav2vec XLSR-53
which were augmented to 507 hours using 90% and         Below we describe the model and experimental
110% speed perturbation as in Büchi et al. (2020).     results obtained with wav2vec XLSR-53.
Jasper-PL We noted that the task test set dif-          wav2vec      XLSR-53     FT For       wav2vec
fers acoustically from the training data since dif-     XLSR-53 FT we fine-tuned the pre-trained
ferent dialects are present and the audio quality       baseline (as published on HuggingFace9 ) on the
tends to be lower. This motivated the creation of       Shared Task training data. We used 227 hours,
Jasper-PL, where we used pseudo-labeling on             corresponding to an IoU greater or equal than 0.8.
the test set. More precisely, we used the hypothe-      The data was pre-processed as outlined in Section
ses of Jasper-FT on the task test set to fine-tune      4.3.2.
Jasper-FT for 20 additional epochs.
                                                                  Table 3: wav2vec XLSR-53 result.
Jasper-PL-E We decided to further work on the
(comparatively) low-quality audio of the task test       Model                     Train set     BLEU
set and used the Dolby Media Enhance API v1.18           wav2vec XLSR-53 FT        task 0.8      30.39
to create an ”enhanced” version of the task test
set. The Enhance API automatically improves the         6     Ensembling
quality of audio files, e.g. by correcting the volume
and reducing noise and hum. We then fine-tuned          Having trained and evaluated the three models de-
Jasper-FT on this data, this time using the hy-         scribed in Sections 4.1, 4.2 and 4.3, we performed
potheses provided by Jasper-PL as labels since          experiments with two ensembling methods: ma-
these achieve a higher BLEU score.                      jority voting and a hybrid technique combining
                                                        majority voting with perplexity calculation. We
                                                        used the outputs of the best-performing models
                                                        of each of the three systems, aiming to select the
  8                                                       9
    https://dolby.io/developers/                            https://huggingface.co/facebook/
media-processing/api-reference/enhance                  wav2vec2-large-xlsr-53
most reliable translation for each utterance from             Table 4: Ensembling results. The BLEU score achieved
among them. The best-performing models were                   by each model separately and the BLEU score resulting
                                                              from applying ensembling methods on the models’ out-
F-E2 (BLEU score of 35.8010 ), Jasper-PL
                                                              puts (Majority Voting and Hybrid Ensembling)
(BLEU score of 32.97) and wav2vec XLSR-53
FT (BLEU score of 30.4).                                       F-E2       Jasper-   wav2vec     MV        HE
   The models were first categorised based on their                       PL        XLSR-
BLEU scores into a primary, first auxiliary and sec-                                53 FT
ond auxiliary models. F-E2 with the highest score              35.80      32.97     30.39       38.70     37.62
was selected as the primary model, Jasper-PL
with the second best score was set as the first aux-          7    Transcript Post-processing
iliary model and wav2vec XLSR-53 FT was
                                                              Next to the Language Models for Speech Recogni-
used as the second auxiliary model.
                                                              tion, we evaluated an approach to using text-only
   In the first step, we aligned the hypotheses of the        data by training a supervised ”spelling correction”
three models and extracted text passages where all            (SC) model to correct the errors made by the STT
three hypotheses agree, leaving only text excerpts            model explicitly. Instead of predicting the likeli-
where the hypotheses disagree.                                hood of emitting a word based on the surrounding
                                                              context, the SC model only needs to identify likely
Majority Voting (MV) The majority voting con-                 errors in the STT model output and propose alter-
sisted in collecting votes for each text excerpt de-          natives. Intuitively, this task highly depends on the
fined in the previous step: a particular hypothesis           baseline model’s quality: if the model transcribes
receives a vote for each word it has in common                very well, this task can be reduced to simply copy-
with any other hypothesis. The hypothesis with the            ing the input transcript directly to the output.
most votes is chosen as the best candidate trans-                Most recent approaches for transcript post-
lation. If multiple hypotheses score the same, the            processing use a transformer-based method: (Liao
output of the model categorised higher in the hier-           et al., 2021) use a modified RoBERTa structure
archy (primary, first auxiliary, second auxiliary) is         and show an increase of 17.53 BLEU points on
selected.                                                     the self-augmented English Conversational Tele-
                                                              phone Speech data set. On the LibriSpeech dataset,
                                                              (Hrinchuk et al., 2019) show promising results us-
Hybrid Ensembling (HE) The hybrid ensem-                      ing a pre-trained BERT as initialisation for their
bling method combines majority voting with per-               spell correction model, while (Guo et al., 2019)
plexity calculation. If more than one hypothesis              takes a different approach with a bidirectional
scores maximum and the hypotheses with the max-               LSTM.
imum score are not equal, the perplexity of the hy-              We compared different Transformer architec-
potheses is calculated. To this end, we extended the          tures with their corresponding open-sourced pre-
particular text excerpt with 3 context words preced-          trained models and other post-processing methods.
ing and following the excerpt. For these text seg-               The objective for all transformer models was set
ments, we calculated perplexity with a pre-trained            to next-sentence prediction (sequence to sequence
uncased German BERT model11 . The hypothesis                  generation) with a vocabulary size of 30’000, batch
with the lower perplexity was selected.                       size of 16, and beam size for beam search set to 5.
   The results of the experiments are presented in            The models were initialised with pre-trained Ger-
Table 4. Out of the two algorithms we applied on              man embeddings and fine-tuned for up to 120’000
the data, better results could be achieved with the           steps on the Shared Task training set described in
majority voting. The BLEU score improved by 2.9               2.
points from 35.80 to 38.70 when compared to the                   • BERT (Devlin et al., 2018), having both en-
result of the best model (F-E2).                                    coder and decoder initialised with pre-trained
                                                                    weights.
  10
     F-E3 as a last-minute submission could not be used for       • DistilBERT (Sanh et al., 2020), the
ensembling
  11
     https://github.com/dbmdz/berts#                                lightweight alternative to BERT, reducing the
german-bert                                                         training time up to 60%.
    • ELECTRA (Clark et al., 2020), which uses a         candidate out of the provided translations for each
      more sample-efficient pre-training approach        utterance in the public test set. With this solution,
      for the encoder, called replaced token detec-      we achieved a BLEU score of 39.39 on the private
      tion.                                              test set, which resulted in the third place out of four
                                                         contributors in the competition.
    • SymSpell (Garbe, 2020),which is a spelling            Swiss German is a low-resource language, which
      correction algorithm for correcting spelling er-   makes training an STT or a Speech Translation sys-
      rors based on Damerau-Levenshtein distances,       tem a challenging task. However, our experiments
      stored in a pre-trained dictionary.                show that applying ensembling both on various
                                                         models of the same architecture (as in Fairseq mod-
   The following table shows the BLEU scores             els F-E1, F-E2 and F-E3) and on models based
on the public test set, when performing post-            on various architectures (as implemented in our
processing on the output of the majority voting          final solution) trained with limited data can lead
algorithm as described in 6. The Baseline refers         to a score improvement of several BLEU points.
to the BLEU score of the non-processed majority          Pseudo-labeling is another approach which con-
voting output.                                           tributes to model enhancement as we could observe
                                                         with the Jasper-PL model. We will be further
Table 5: Post-processing BLEU scores on the public
test set                                                 investigating these two methods aiming at improv-
                                                         ing the results despite the limited data currently
    System           Baseline     Post-processed         available for Swiss German.
    BERT             38.70        23.26
    DistilBERT       38.70        26.66
    ELECTRA          38.70        14.77                  References
    SymSpell         38.70        30.65                  Alexei Baevski, Henry Zhou, Abdelrahman Mohamed,
                                                           and Michael Auli. 2020. wav2vec 2.0: A Frame-
                                                           work for Self-Supervised Learning of Speech Repre-
   As the evaluations show, most post-processing           sentations. Facebook AI.
attempts decrease the overall BLEU score, with
SymSpell as the most straightforward approach per-       Alexandre Bérard, Olivier Pietquin, Christophe Servan,
forming best. Compared with previous work in this          and Laurent Besacier. 2016. Listen and Translate:
                                                           A Proof of Concept for End-to-End Speech-to-Text
area, this could be explained by the limited amount        Translation. arXiv preprint arXiv:1612.01744.
of data available for training the transformer mod-
els. Due to lack of performance, we exclude the          Matthias Büchi, Malgorzata Anna Ulasik, Manuela
post-processing step in our final solution.               Hürlimann, Fernando Benites, Pius von Däniken,
                                                          and Mark Cieliebak. 2020. ZHAW-InIT at Ger-
                                                          mEval 2020 Task 4: Low-Resource Speech-to-Text.
8    Conclusion                                           In Proceedings of the 5th Swiss Text Analytics Con-
                                                          ference (SwissText) & 16th Conference on Natural
In this paper, we presented our contribution to the       Language Processing (KONVENS). CEUR-WS.
Shared Task ”Swiss German Speech to Standard
German Text” at SwissText 2021. Our solution             Jiatao Gu Changhan Wang, Juan Pino. 2020. Improv-
combines the outputs of three models based on               ing Cross-Lingual Transfer Learning for End-to-End
                                                            Speech Recognition with Speech Translation.
Fairseq, Jasper and Wav2vec XLSR-53 architec-
tures. Because of time and resource constraints,         Kevin Clark, Minh-Thang Luong, Quoc V. Le, and
we used only the labeled data set. Out of the 21           Christopher D. Manning. 2020. ELECTRA: Pre-
experiments we performed with the models, includ-          training Text Encoders as Discriminators Rather
ing transcript post-processing and ensembling, we          Than Generators.
achieved the best result by applying an ensembling       Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
method on the outputs of Fairseq model F-E2                 Kristina Toutanova. 2018. BERT: Pre-training of
(BLEU score of 35.80) as the primary model, and             Deep Bidirectional Transformers for Language Un-
Jasper-PL (32.97) and wav2vec XLSR-53                       derstanding.
FT (30.39) as auxiliary models. We processed the         Eugen Dieth and Christian Schmid-Cadalbert. 1986.
three models’ predictions with a majority voting           Schwyzertütschi dialäktschrift. Sauerländer, Aarau,
algorithm and this way retrieved the most reliable         2.
Wolf Garbe. 2020. SymSpell: Fast spell correction al-     Michel Plüss, Lukas Neukom, and Manfred Vogel.
 gorithm.                                                   2021. SwissText 2021 Task 3: Swiss German
                                                            Speech to Standard German Text. In preparation.
Jinxi Guo, Tara N. Sainath, and Ron J. Weiss. 2019. A
   Spelling Correction Model for End-to-End Speech        Victor Sanh, Lysandre Debut, Julien Chaumond, and
   Recognition.                                             Thomas Wolf. 2020. DistilBERT, a distilled version
                                                            of BERT: smaller, faster, cheaper and lighter.
Oleksii Hrinchuk, Mariya Popova, and Boris Ginsburg.
  2019. Correction of Automatic Speech Recognition        Tanja Schultz, S. Jou, S. Vogel, and S. Saleem. 2004.
  with Transformer Sequence-to-sequence Model.              Using Word Lattice Information for a Tighter Cou-
                                                            pling in Speech Translation Systems. In INTER-
Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà,       SPEECH.
   Javier Jorge, Nahuel Roselló, Adrià Giménez, Al-    Beat Siebenhaar. 2003. Sprachgeographische Aspekte
   bert Sanchis, Jorge Civera, and Alfons Juan. 2020.       der Morphologie und Verschriftung in schweiz-
   Europarl-ST: A Multilingual Corpus for Speech            erdeutschen Chats. Linguistik online, 15(3).
   Translation of Parliamentary Debates. In ICASSP
   2020 - 2020 IEEE International Conference on           Beat Siebenhaar and Alfred Wyler. 1997. Dialekt und
  Acoustics, Speech and Signal Processing (ICASSP),         Hochsprache in der deutschsprachigen Schweiz. Pro
   pages 8229–8233.                                         Helvetia.

Jacob Kahn, Ann Lee, and Awni Hannun. 2020.               Matthias Sperber and Matthias Paulik. 2020. Speech
   Self-training for End-to-End Speech Recognition.        Translation and the End-to-End Promise: Taking
   In ICASSP 2020-2020 IEEE International Confer-          Stock of Where We Are. In Proceedings of the
   ence on Acoustics, Speech and Signal Processing         58th Annual Meeting of the Association for Compu-
  (ICASSP), pages 7084–7088. IEEE.                         tational Linguistics, pages 7409–7421.

Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan          John Richardson Taku Kudo. 2018. SentencePiece: A
   Leary, Oleksii Kuchaiev, Jonathan M Cohen, Huyen         simple and language independent subword tokenizer
   Nguyen, and Ravi Teja Gadde. 2019. Jasper: An            and detokenizer for Neural Text Processing.
   End-to-End Convolutional Neural Acoustic Model.        Elvira Glaser Tanja Samardzic, Yves Scherrer. 2016.
   In Proceedings of Interspeech 2019, pages 71–75.         ArchiMob - A Corpus of Spoken Swiss German.
Junwei Liao, Yu Shi, Ming Gong, Linjun Shou, Se-          A. Waibel, A.N. Jain, A.E. McNair, H. Saito, A.G.
  fik Eskimez, Liyang Lu, Hong Qu, and Michael              Hauptmann, and J. Tebelskis. 1991. JANUS: a
  Zeng. 2021. Generating Human Readable Tran-               speech-to-speech translation system using connec-
  script for Automatic Speech Recognition with Pre-         tionist and symbolic processing strategies. In [Pro-
  trained Language Model.                                   ceedings] ICASSP 91: 1991 International Confer-
                                                            ence on Acoustics, Speech, and Signal Processing,
H. Ney. 1999. Speech Translation: coupling of               pages 793–796 vol.2.
  recognition and translation. In 1999 IEEE In-
  ternational Conference on Acoustics, Speech, and        Changhan Wang, Yun Tang, Xutai Ma, Anne Wu,
  Signal Processing. Proceedings. ICASSP99 (Cat.            Dmytro Okhonko, and Juan Pino. 2020a. fairseq
  No.99CH36258), volume 1, pages 517–520 vol.1.             S2T: Fast Speech-to-Text Modeling with fairseq.

Vassil Panayotov, Guoguo Chen, Daniel Povey, and          Chengyi Wang, Yu Wu, Shujie Liu, Zhenglu Yang, and
  Sanjeev Khudanpur. 2015. Librispeech: An ASR              Ming Zhou. 2020b. Bridging the Gap between Pre-
  corpus based on public domain audio books. In             Training and Fine-Tuning for End-to-End Speech
  2015 IEEE International Conference on Acoustics,          Translation. In Proceedings of the AAAI Conference
  Speech and Signal Processing (ICASSP), pages              on Artificial Intelligence, volume 34, pages 9161–
  5206–5210.                                                9168.
                                                          M. Woszczyna, N. Coccaro, A. Eisele, A. Lavie, A. Mc-
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
                                                            Nair, T. Polzin, I. Rogina, C. P. Rose, T. Sloboda,
  Jing Zhu. 2002. BLEU: a method for automatic eval-
                                                            M. Tomita, J. Tsutsumi, N. Aoki-Waibel, A. Waibel,
  uation of machine translation. In Proceedings of the
                                                            and W. Ward. 1993. Recent Advances in Janus: A
  40th annual meeting of the Association for Compu-
                                                            Speech Translation System. In Proceedings of the
  tational Linguistics, pages 311–318.
                                                           Workshop on Human Language Technology, HLT
Daniel S Park, William Chan, Yu Zhang, Chung-Cheng         ’93, page 211–216, USA. Association for Compu-
  Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le.           tational Linguistics.
  2019. SpecAugment: A Simple Data Augmentation           Qiantong Xu, Tatiana Likhomanenko, Jacob Kahn,
  Method for Automatic Speech Recognition.                  Awni Hannun, Gabriel Synnaeve, and Ronan Col-
                                                            lobert. 2020. Iterative Pseudo-Labeling for Speech
Thomas Hofmann Pelin Dogan-Schönberger, Ju-                Recognition. In Proceedings of Interspeech 2020,
  lian Mäder. 2021. SwissDial: Parallel Multidialectal     pages 1006–1010.
  Corpus of Spoken Swiss German.