=Paper=
{{Paper
|id=Vol-2957/sg_paper3
|storemode=property
|title=ZHAW-CAI: Ensemble Method for Swiss German Speech to Standard German Text (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2957/sg_paper3.pdf
|volume=Vol-2957
|authors=Malgorzata Anna Ulasik,Manuela Hürlimann,Bogumila Dubel,Yves Kaufmann,Silas Rudolf,Jan Deriu,Katsiaryna Mlynchyk,Hans-Peter Hutter,Mark Cieliebak
|dblpUrl=https://dblp.org/rec/conf/swisstext/UlasikHDKRDMHC21
}}
==ZHAW-CAI: Ensemble Method for Swiss German Speech to Standard German Text (short paper)==
ZHAW-CAI: Ensemble Method for Swiss German Speech to Standard German Text Malgorzata Anna Ulasik, Manuela Hürlimann, Bogumila Dubel, Yves Kaufmann, Silas Rudolf, Jan Deriu, Katsiaryna Mlynchyk, Hans-Peter Hutter, and Mark Cieliebak Centre for Artificial Intelligence Zurich University of Applied Sciences {ulas, hueu, deri, mlyn, huhp, ciel}@zhaw.ch bodubel@gmail.com, y.kaufmann@yagan.ch, silasrudolf@gmail.com Abstract contexts, but since there is no single standard writ- ing system, Swiss German speakers usually write This paper presents the contribution of phonetically in their local dialect in informal sit- ZHAW-CAI to the Shared Task ”Swiss uations (Siebenhaar, 2003). On formal occasions German Speech to Standard German Text” such as work meetings and political debate, speech at the SwissText 2021 conference. Our ap- is typically transcribed into Standard German. As proach combines three models based on there is a considerable linguistic distance between the Fairseq, Jasper and Wav2vec architec- Swiss German dialects and Standard German, de- tures trained on multilingual, German and veloping a model for transcribing Swiss German Swiss German data. We applied an ensem- speech into Standard German text actually involves bling algorithm on the predictions of the Speech Translation, which combines STT with Ma- three models in order to retrieve the most chine Translation (MT) (Bérard et al., 2016). reliable candidate out of the provided trans- As a response to the Shared Task “Swiss Ger- lations for each spoken utterance. With the man Speech to Standard German Text” organised ensembling output, we achieved a BLEU at Swisstext 2021, we provided a solution consist- score of 39.39 on the private test set, which ing of three models based on different architectures: gave us the third place out of four contrib- Fairseq (Wang et al., 2020a), Jasper (Li et al., 2019) utors in the competition. and Wav2vec XLSR-5 (Baevski et al., 2020) which were trained with various data sets, both in Stan- 1 Introduction dard German and Swiss German. Their predictions were subsequently fed into a majority voting al- Speech-to-Text (STT) enables transcribing spoken gorithm with the aim to select the most reliable utterances into text. For successfully performing translation. a transformation from speech to a text, the exis- The remainder of this paper is structured as fol- tence of a standardised writing system of the target lows: Section 2 provides the description of the language is of prime importance. This is where Shared Task and Section 3 discusses relevant liter- Swiss German 1 poses a substantial challenge: it ature. In Section 4 we present the systems which does not have a standardised orthography since it make up our final solution, their architecture and functions as the default spoken language in both the applied training data. In section 5 we provide an formal and informal situations, while for writing, overview of all experiments performed with these the Standard German language is used. This phe- models and their outputs. Section 6 lays out the nomenon, called “medial diglossia” (Siebenhaar ensembling approach and section 7 presents the and Wyler, 1997), occurs in the entire German- post-processing experiments we performed on the speaking part of Switzerland, which is additionally predictions of the models. The paper ends with a characterised by a high dialect diversity. Swiss Ger- conclusion presented in section 8. man is increasingly used for writing in informal 2 Shared Task Description Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna- The goal of the Shared Task was to build a system tional (CC BY 4.0) for translating speech in any Swiss German dialect 1 To be precise, there is no single ”Swiss German” lan- guage, but rather a collection of many different regional di- into Standard German text (Plüss et al., 2021). alects that are subsumed with this term. The organisers provided a labelled data set con- taining 293 hours of audio recordings, mostly in made by the STT module are propagated to the the Bernese dialect, transcribed in Standard Ger- MT module (Ney, 1999). Thus, efforts are put into man. Since the alignment between the recordings coupling the STT and MT modules to prevent error and the transcripts was done automatically, each ut- propagation, for instance, by generating multiple terance has an Intersection over Union (IoU) score hypotheses of the STT system via n-best search reflecting its alignment quality. Additionally, there or the creation of lattices (Woszczyna et al., 1993; was an unlabelled data set consisting of 1208 hours Schultz et al., 2004). of recordings, mostly in the Zurich dialect. The so- lutions were evaluated based on a 13 hours test set, End-to-End Approaches model ST as a single which contains recordings of speakers coming from task, where input is speech in the source language, all German-speaking parts of Switzerland. The di- and the output consists of text or speech in the alect distribution of the test set is close to the actual target language. The main issue with this mod- Swiss German dialect distribution in Switzerland. elling approach is the lack of sufficient training The translation accuracy of the provided solu- data. Whereas data for STT typically consists of tions is measured using BLEU, a standard metric several hundreds of hours of transcribed data, most for automatic evaluation of machine translation (Pa- ST datasets contain only a fraction of this amount. pineni et al., 2002). The approach consists in count- For instance, the Europarl-ST corpus contains on ing n-grams in the candidate translation matching average only 42 hours of transcribed data per lan- n-grams in the reference translation without taking guage pair (Iranzo-Sánchez et al., 2020), whereas the word order into account. The metric ranges the Librispeech STT corpus contains around 1000 from 0 to 100. A perfect match results in a score hours of transcribed data (Panayotov et al., 2015). of 100. A score of 0 occurs if there are no matches. For this reason, end-to-end approaches nowadays The tool used by the organisers for evaluating so- rely on leveraging multi-task learning and single lutions is the NLTK implementation of the BLEU language pre-training of the STT and MT submod- score with default parameters2 . Prior to evaluation, ules and use the ST dataset for fine-tuning (Wang both the references and the translations are nor- et al., 2020b). malised: the utterances are lowercased, the punctu- Most cascading approaches rely on data where ation is removed, the numbers are spelled out and access to both the source language transcript and all non-ASCII characters except for the letters ”ä”, its target language translation is needed. However, ”ö”, ”ü” are removed. in our scenario, we do not have access to written The test set was split into a public and a private text of the source language since Swiss German subset of equal sizes. For all evaluations presented is a spoken language, and thus, often directly tran- in this paper, the public test set was used. scribed into Standard German (see 1 for more de- tails). Thus, our models follow the End-to-End 3 Related Work approach. Speech Translation (ST) is the task of translating 4 Systems Description spoken text in a source language to text or speech in a target language. The approaches to solve this This section describes the architecture of the three problem can be put into two categories: cascading models which build the foundation for the experi- approaches and end-to-end approaches (Sperber ments presented in Section 5 and are components of and Paulik, 2020). the final solution which combines the three models’ outputs in an ensembling algorithm. The section Cascaded Approaches work by splitting the also explains what data was used for training the task into two steps: first, an STT model transcribes models. speech of the source language to text in the tar- get language, and then a machine translation (MT) 4.1 Fairseq module translates the generated text into the target language (Waibel et al., 1991). The main issue 4.1.1 Model with the cascaded approach is the fact that errors Fairseq is based on the transformer architecture for 2 Speech-to-Text provided by Fairseq S2T Toolkit https://www.nltk.org/api/nltk. translate.html#nltk.translate.bleu_score. (Wang et al., 2020a), which combines the tasks corpus_bleu of STT and ST under the same encoder-decoder architecture (Changhan Wang, 2020). The exper- decessor of this Shared Task (Büchi et al., 2020). iments were trained with the small transformer The Acoustic Model as per Büchi et al. (2020) con- model with 256 dimensions, 12 Layers encoder, 6 sists of 10x5 blocks and was pre-trained on 537 Layers decoder, 27M parameters, Adam optimiser, hours of Standard German data (see Büchi et al. and inverse square root for the learning rate sched- (2020), Table 2). In all reported experiments, we uler. Decoding is executed with a character-based fine-tuned five blocks on the Shared Task data as SentencePiece model (Taku Kudo, 2018) using an described in Section 5.2 below. We used last year’s n-best decoding strategy with n=5. The acoustic extended language model, a 6-gram model trained model (encoder) can be pre-trained with the same with KenLM, without further fine-tuning on this transformer architecture as described above. year’s data. For the data sources, see Table 2 in Büchi et al. (2020). Decoding was done using beam 4.1.2 Data search with a beam size of 1024. The audios were extracted to 80-dimensional log mel-scale filterbank features (windows with 25 ms 4.2.2 Data size and 10 ms shift) and saved in NumPy format We extracted the audios to 64-dimensional mel- for the training. To alleviate overfitting, speech filterbank features with 20ms window size and data transforms SpecAugment (Park et al., 2019), 10ms overlap as input to the Jasper acoustic model. adopted by Fairseq S2T, were applied. For text The reference texts were preprocessed as described normalisation we used the script provided by the in Büchi et al. (2020). No additional Swiss German task organisers. Additional numbers were spelled audio data was used for training Jasper. out using num2words3 . We use three additional datasets: 4.3 Wav2vec XLSR-53 • SwissDial (Pelin Dogan-Schönberger, 2021): 4.3.1 Model 26 hours of Swiss German Wav2vec XLSR-53 is a cross-lingual extension of wav2vec 2.0 as per Baevski et al. (2020). Pre- • ArchiMob (Tanja Samardzic, 2016): 80 hours trained on 53 different languages, it attempts to of Swiss German learn a quantisation of the latent representations • Common Voice German v4: 483 hours of Ger- shared across languages by solving a contrastive man4 task over masked speech representations. In the experiment below, we fine-tuned wav2vec XLSR- The SwissDial dataset consists of 26 hours of au- 53 on the Shared Task data. No explicit language dios in 8 different Swiss dialects with correspond- model was used to conduct the experiment. ing transcriptions in Swiss dialect and Standard German translations. The Swiss German transcrip- 4.3.2 Data tion rules differ between dialects. ArchiMob con- The labelled data used for fine-tuning XLSR-53 tains 70 hours of audios in 14 different Swiss di- was based on the task training data. However, it alects with transcription in Swiss German, where was further pre-processed removing all utterances each word is additionally provided with a Standard which contained special characters or were detected German normalisation. The transcription rules are as not being in German using langdetect5 . Numeric normalised and are equal for all dialects (Dieth values were replaced by strings using num2words6 . transcription, (Dieth and Schmid-Cadalbert, 1986)). Common Voice German v4 consists of 483 hours 5 Experiments on Individual Models of audios in Standard German with corresponding transcriptions. Sections 5.1 through 5.3 present the experiments we performed to improve the individual models and 4.2 Jasper provide the BLEU scores achieved in each experi- 4.2.1 Model ment. We also discuss approaches to improve the We used the Jasper (Li et al., 2019) configuration model outputs with the use of ensembling (Section corresponding to our best submission in the pre- 6) and post-processing (Section 7). 3 5 https://pypi.org/project/num2words/ https://github.com/Mimino666/ 4 langdetect https://commonvoice.mozilla.org/en/ 6 datasets/ https://pypi.org/project/num2words/ 5.1 Fairseq Fairseq F-SP-SD-CH In order to further im- Below we describe the different models and exper- prove the acoustic model, we trained an encoder imental results obtained with Fairseq. All Exper- in Swiss German (CH) on the SwissDial and iments are trained with the same configuration as ArchiMob dataset. We trained a new model described in Section 4.1 and can be divided into F-SP-SD-CH with the entire Shared Task train- three groups: extension of training data, inclusion ing data and SwissDial and included the pre-trained of a pre-trained encoder and ensembling. CH encoder in the training. The BLEU score in comparison to F-SP-All is improved by 12.54 5.1.1 Extending the training data points. Fairseq F-SP-0.9 For F-SP-0.9 we trained 5.1.3 Ensembling the model from scratch on the Shared Task train- ing data. We used 176 hours, corresponding to an Fairseq Ensemble F-SP-SD & F-SP-DE (F-E1) Intersection over Union (IoU) greater or equal to In this experiment, we ensembled the models 0.9. F-SP-SD and F-SP-DE. F-E1 achieves a BLEU score of 28.74 . Ensembling is done with the imple- Fairseq F-SP-All We noted that the model mentation provided by the Fairseq S2T Toolkit7 . In F-SP-0.9 generalises very poorly, so for comparison to F-SP-SD-DE, which combines in F-SP-All we trained a new model with the entire the training setup the same training dataset Swiss- task training data, which corresponds to 293 hours. Dial as F-SP-SD and the same DE encoder as Despite partially poorly aligned translations, the F-SP-DE, the ensembling performs slightly bet- model benefits from the new data: the BLEU score ter. In comparison to F-SP-All the BLEU score is improved by about 4.32 points. improves by 9.94 points. Fairseq F-SP-SD We decided to extend the train- Fairseq Ensemble F-SP-AM-DE & F-SP-SD- ing data with the SwissDial Corpus. For this, we CH (F-E2) After the good performance of trained a new model F-SP-SD with the entire task F-E1, we decided to ensemble F-SP-AM-DE training data plus all data from SwissDial. This and F-SP-SD-CH. This ensembling improves data extension improves the score by an additional the BLEU score in comparison to F-SP-All by 4.81 BLEU points in comparison to F-SP-All. 17.00 points. 5.1.2 Including pre-trained encoder Fairseq F-E2 extended (F-E3) Finally, we Fairseq F-SP-DE We also investigated how to trained a model on the entire available data for improve the encoder (acoustic model). We pre- Swiss German (task, SwissDial and ArchiMob) trained a Standard German (DE) encoder on the and used this model to perform ensembling on top Common Voice German v4 dataset. For F-SP-DE, of F-E2. For time reasons, we were not able to we added the pre-trained encoder and trained the complete the training and the output of this model model on the entire Shared Task training data. In- could not been included in the final solution pre- cluding the DE encoder improves the score by 3.36 sented in 6. We only evaluated an intermediate BLEU points in comparison to F-SP-All. status of the model and achieved a score of 36.83 BLEU points. In comparison to F-SP-All, it Fairseq F-SP-SD-DE Since both models improves the score by 18.03 points. F-SP-SD and F-SP-DE improved the BLEU score, we decided to bring the two approaches together. We trained a new model F-SP-SD-DE Table 1 shows the public BLEU scores ob- with the entire Shared Task training data, Swiss- tained with the Fairseq models on the Shared Dial data and include the pre-trained DE encoder Task public part of the test set. The table contains in the training. This brings an improvement of 8.37 additional information about applied train sets and BLEU points in comparison to F-SP-All. encoders. F-E3 achieved the best performance Fairseq F-SP-AM-DE In this model we used the with a BLEU score of 36.83 on the public part of entire task training data plus the data from Archi- the test set (37.4 on the private part). In addition Mob. For the training we included the pre-trained to ensembling, the inclusion of a CH encoder in DE encoder. This setup improves the BLEU score 7 https://github.com/pytorch/fairseq/ by 14.01 in comparison to F-SP-All. issues/223 the training process as well as the extension of the Table 2 shows the public BLEU scores obtained training data with the ArchiMob corpus benefited with the Jasper models on the two different test the model performance most. sets (Jasper-PL-E was only evaluated on the en- hanced test set). The best-performing Jasper model Table 1: Fairseq results. is Jasper-PL with a BLEU score of 32.97 on the public part of the test set. Using the enhanced Model Train set Encoder BLEU audio data does not confer any advantage on either F-SP-0.9 task 0.9 training 14.48 prediction or pseudo-label fine-tuning compared F-SP-All task all training 18.8 to the as-is data. We can, however, see the bene- F-SP-SD task, SwissDial training 23.61 fit of rather naive pseudo-labelling in this setting F-SP-DE task DE 22.16 where training and testing data are quite different. F-SP-SD-DE task, SwissDial DE 27.17 Future work could expand on the use of pseudo- F-SP-AM-DE task, ArchiMob DE 32.81 labelling by using more advanced setups, such as F-SP-SD-CH task, SwissDial CH 31.34 confidence-based (Kahn et al., 2020) or iterative F-E1 - - 28.74 (Xu et al., 2020) pseudo-labelling. F-E2 - - 35.80 F-E3 - - 36.83 Table 2: Jasper results. 5.2 Jasper Model Test set BLEU Jasper-FT task 30.8 Below we describe the different models and exper- Jasper-FT enhanced 26.4 imental results obtained with Jasper. Jasper-PL task 32.97 Jasper-FT For Jasper-FT we fine-tune the Jasper-PL enhanced 31.92 pre-trained Standard German model on the Shared Jasper-PL-E enhanced 32.92 Task training data. We used 169 hours, sampled from the set with an IoU greater or equal to 0.9, 5.3 Wav2vec XLSR-53 which were augmented to 507 hours using 90% and Below we describe the model and experimental 110% speed perturbation as in Büchi et al. (2020). results obtained with wav2vec XLSR-53. Jasper-PL We noted that the task test set dif- wav2vec XLSR-53 FT For wav2vec fers acoustically from the training data since dif- XLSR-53 FT we fine-tuned the pre-trained ferent dialects are present and the audio quality baseline (as published on HuggingFace9 ) on the tends to be lower. This motivated the creation of Shared Task training data. We used 227 hours, Jasper-PL, where we used pseudo-labeling on corresponding to an IoU greater or equal than 0.8. the test set. More precisely, we used the hypothe- The data was pre-processed as outlined in Section ses of Jasper-FT on the task test set to fine-tune 4.3.2. Jasper-FT for 20 additional epochs. Table 3: wav2vec XLSR-53 result. Jasper-PL-E We decided to further work on the (comparatively) low-quality audio of the task test Model Train set BLEU set and used the Dolby Media Enhance API v1.18 wav2vec XLSR-53 FT task 0.8 30.39 to create an ”enhanced” version of the task test set. The Enhance API automatically improves the 6 Ensembling quality of audio files, e.g. by correcting the volume and reducing noise and hum. We then fine-tuned Having trained and evaluated the three models de- Jasper-FT on this data, this time using the hy- scribed in Sections 4.1, 4.2 and 4.3, we performed potheses provided by Jasper-PL as labels since experiments with two ensembling methods: ma- these achieve a higher BLEU score. jority voting and a hybrid technique combining majority voting with perplexity calculation. We used the outputs of the best-performing models of each of the three systems, aiming to select the 8 9 https://dolby.io/developers/ https://huggingface.co/facebook/ media-processing/api-reference/enhance wav2vec2-large-xlsr-53 most reliable translation for each utterance from Table 4: Ensembling results. The BLEU score achieved among them. The best-performing models were by each model separately and the BLEU score resulting from applying ensembling methods on the models’ out- F-E2 (BLEU score of 35.8010 ), Jasper-PL puts (Majority Voting and Hybrid Ensembling) (BLEU score of 32.97) and wav2vec XLSR-53 FT (BLEU score of 30.4). F-E2 Jasper- wav2vec MV HE The models were first categorised based on their PL XLSR- BLEU scores into a primary, first auxiliary and sec- 53 FT ond auxiliary models. F-E2 with the highest score 35.80 32.97 30.39 38.70 37.62 was selected as the primary model, Jasper-PL with the second best score was set as the first aux- 7 Transcript Post-processing iliary model and wav2vec XLSR-53 FT was Next to the Language Models for Speech Recogni- used as the second auxiliary model. tion, we evaluated an approach to using text-only In the first step, we aligned the hypotheses of the data by training a supervised ”spelling correction” three models and extracted text passages where all (SC) model to correct the errors made by the STT three hypotheses agree, leaving only text excerpts model explicitly. Instead of predicting the likeli- where the hypotheses disagree. hood of emitting a word based on the surrounding context, the SC model only needs to identify likely Majority Voting (MV) The majority voting con- errors in the STT model output and propose alter- sisted in collecting votes for each text excerpt de- natives. Intuitively, this task highly depends on the fined in the previous step: a particular hypothesis baseline model’s quality: if the model transcribes receives a vote for each word it has in common very well, this task can be reduced to simply copy- with any other hypothesis. The hypothesis with the ing the input transcript directly to the output. most votes is chosen as the best candidate trans- Most recent approaches for transcript post- lation. If multiple hypotheses score the same, the processing use a transformer-based method: (Liao output of the model categorised higher in the hier- et al., 2021) use a modified RoBERTa structure archy (primary, first auxiliary, second auxiliary) is and show an increase of 17.53 BLEU points on selected. the self-augmented English Conversational Tele- phone Speech data set. On the LibriSpeech dataset, (Hrinchuk et al., 2019) show promising results us- Hybrid Ensembling (HE) The hybrid ensem- ing a pre-trained BERT as initialisation for their bling method combines majority voting with per- spell correction model, while (Guo et al., 2019) plexity calculation. If more than one hypothesis takes a different approach with a bidirectional scores maximum and the hypotheses with the max- LSTM. imum score are not equal, the perplexity of the hy- We compared different Transformer architec- potheses is calculated. To this end, we extended the tures with their corresponding open-sourced pre- particular text excerpt with 3 context words preced- trained models and other post-processing methods. ing and following the excerpt. For these text seg- The objective for all transformer models was set ments, we calculated perplexity with a pre-trained to next-sentence prediction (sequence to sequence uncased German BERT model11 . The hypothesis generation) with a vocabulary size of 30’000, batch with the lower perplexity was selected. size of 16, and beam size for beam search set to 5. The results of the experiments are presented in The models were initialised with pre-trained Ger- Table 4. Out of the two algorithms we applied on man embeddings and fine-tuned for up to 120’000 the data, better results could be achieved with the steps on the Shared Task training set described in majority voting. The BLEU score improved by 2.9 2. points from 35.80 to 38.70 when compared to the • BERT (Devlin et al., 2018), having both en- result of the best model (F-E2). coder and decoder initialised with pre-trained weights. 10 F-E3 as a last-minute submission could not be used for • DistilBERT (Sanh et al., 2020), the ensembling 11 https://github.com/dbmdz/berts# lightweight alternative to BERT, reducing the german-bert training time up to 60%. • ELECTRA (Clark et al., 2020), which uses a candidate out of the provided translations for each more sample-efficient pre-training approach utterance in the public test set. With this solution, for the encoder, called replaced token detec- we achieved a BLEU score of 39.39 on the private tion. test set, which resulted in the third place out of four contributors in the competition. • SymSpell (Garbe, 2020),which is a spelling Swiss German is a low-resource language, which correction algorithm for correcting spelling er- makes training an STT or a Speech Translation sys- rors based on Damerau-Levenshtein distances, tem a challenging task. However, our experiments stored in a pre-trained dictionary. show that applying ensembling both on various models of the same architecture (as in Fairseq mod- The following table shows the BLEU scores els F-E1, F-E2 and F-E3) and on models based on the public test set, when performing post- on various architectures (as implemented in our processing on the output of the majority voting final solution) trained with limited data can lead algorithm as described in 6. The Baseline refers to a score improvement of several BLEU points. to the BLEU score of the non-processed majority Pseudo-labeling is another approach which con- voting output. tributes to model enhancement as we could observe with the Jasper-PL model. We will be further Table 5: Post-processing BLEU scores on the public test set investigating these two methods aiming at improv- ing the results despite the limited data currently System Baseline Post-processed available for Swiss German. BERT 38.70 23.26 DistilBERT 38.70 26.66 ELECTRA 38.70 14.77 References SymSpell 38.70 30.65 Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Frame- work for Self-Supervised Learning of Speech Repre- As the evaluations show, most post-processing sentations. Facebook AI. attempts decrease the overall BLEU score, with SymSpell as the most straightforward approach per- Alexandre Bérard, Olivier Pietquin, Christophe Servan, forming best. Compared with previous work in this and Laurent Besacier. 2016. Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text area, this could be explained by the limited amount Translation. arXiv preprint arXiv:1612.01744. of data available for training the transformer mod- els. Due to lack of performance, we exclude the Matthias Büchi, Malgorzata Anna Ulasik, Manuela post-processing step in our final solution. Hürlimann, Fernando Benites, Pius von Däniken, and Mark Cieliebak. 2020. ZHAW-InIT at Ger- mEval 2020 Task 4: Low-Resource Speech-to-Text. 8 Conclusion In Proceedings of the 5th Swiss Text Analytics Con- ference (SwissText) & 16th Conference on Natural In this paper, we presented our contribution to the Language Processing (KONVENS). CEUR-WS. Shared Task ”Swiss German Speech to Standard German Text” at SwissText 2021. Our solution Jiatao Gu Changhan Wang, Juan Pino. 2020. Improv- combines the outputs of three models based on ing Cross-Lingual Transfer Learning for End-to-End Speech Recognition with Speech Translation. Fairseq, Jasper and Wav2vec XLSR-53 architec- tures. Because of time and resource constraints, Kevin Clark, Minh-Thang Luong, Quoc V. Le, and we used only the labeled data set. Out of the 21 Christopher D. Manning. 2020. ELECTRA: Pre- experiments we performed with the models, includ- training Text Encoders as Discriminators Rather ing transcript post-processing and ensembling, we Than Generators. achieved the best result by applying an ensembling Jacob Devlin, Ming-Wei Chang, Kenton Lee, and method on the outputs of Fairseq model F-E2 Kristina Toutanova. 2018. BERT: Pre-training of (BLEU score of 35.80) as the primary model, and Deep Bidirectional Transformers for Language Un- Jasper-PL (32.97) and wav2vec XLSR-53 derstanding. FT (30.39) as auxiliary models. We processed the Eugen Dieth and Christian Schmid-Cadalbert. 1986. three models’ predictions with a majority voting Schwyzertütschi dialäktschrift. Sauerländer, Aarau, algorithm and this way retrieved the most reliable 2. Wolf Garbe. 2020. SymSpell: Fast spell correction al- Michel Plüss, Lukas Neukom, and Manfred Vogel. gorithm. 2021. SwissText 2021 Task 3: Swiss German Speech to Standard German Text. In preparation. Jinxi Guo, Tara N. Sainath, and Ron J. Weiss. 2019. A Spelling Correction Model for End-to-End Speech Victor Sanh, Lysandre Debut, Julien Chaumond, and Recognition. Thomas Wolf. 2020. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. Oleksii Hrinchuk, Mariya Popova, and Boris Ginsburg. 2019. Correction of Automatic Speech Recognition Tanja Schultz, S. Jou, S. Vogel, and S. Saleem. 2004. with Transformer Sequence-to-sequence Model. Using Word Lattice Information for a Tighter Cou- pling in Speech Translation Systems. In INTER- Javier Iranzo-Sánchez, Joan Albert Silvestre-Cerdà, SPEECH. Javier Jorge, Nahuel Roselló, Adrià Giménez, Al- Beat Siebenhaar. 2003. Sprachgeographische Aspekte bert Sanchis, Jorge Civera, and Alfons Juan. 2020. der Morphologie und Verschriftung in schweiz- Europarl-ST: A Multilingual Corpus for Speech erdeutschen Chats. Linguistik online, 15(3). Translation of Parliamentary Debates. In ICASSP 2020 - 2020 IEEE International Conference on Beat Siebenhaar and Alfred Wyler. 1997. Dialekt und Acoustics, Speech and Signal Processing (ICASSP), Hochsprache in der deutschsprachigen Schweiz. Pro pages 8229–8233. Helvetia. Jacob Kahn, Ann Lee, and Awni Hannun. 2020. Matthias Sperber and Matthias Paulik. 2020. Speech Self-training for End-to-End Speech Recognition. Translation and the End-to-End Promise: Taking In ICASSP 2020-2020 IEEE International Confer- Stock of Where We Are. In Proceedings of the ence on Acoustics, Speech and Signal Processing 58th Annual Meeting of the Association for Compu- (ICASSP), pages 7084–7088. IEEE. tational Linguistics, pages 7409–7421. Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan John Richardson Taku Kudo. 2018. SentencePiece: A Leary, Oleksii Kuchaiev, Jonathan M Cohen, Huyen simple and language independent subword tokenizer Nguyen, and Ravi Teja Gadde. 2019. Jasper: An and detokenizer for Neural Text Processing. End-to-End Convolutional Neural Acoustic Model. Elvira Glaser Tanja Samardzic, Yves Scherrer. 2016. In Proceedings of Interspeech 2019, pages 71–75. ArchiMob - A Corpus of Spoken Swiss German. Junwei Liao, Yu Shi, Ming Gong, Linjun Shou, Se- A. Waibel, A.N. Jain, A.E. McNair, H. Saito, A.G. fik Eskimez, Liyang Lu, Hong Qu, and Michael Hauptmann, and J. Tebelskis. 1991. JANUS: a Zeng. 2021. Generating Human Readable Tran- speech-to-speech translation system using connec- script for Automatic Speech Recognition with Pre- tionist and symbolic processing strategies. In [Pro- trained Language Model. ceedings] ICASSP 91: 1991 International Confer- ence on Acoustics, Speech, and Signal Processing, H. Ney. 1999. Speech Translation: coupling of pages 793–796 vol.2. recognition and translation. In 1999 IEEE In- ternational Conference on Acoustics, Speech, and Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Signal Processing. Proceedings. ICASSP99 (Cat. Dmytro Okhonko, and Juan Pino. 2020a. fairseq No.99CH36258), volume 1, pages 517–520 vol.1. S2T: Fast Speech-to-Text Modeling with fairseq. Vassil Panayotov, Guoguo Chen, Daniel Povey, and Chengyi Wang, Yu Wu, Shujie Liu, Zhenglu Yang, and Sanjeev Khudanpur. 2015. Librispeech: An ASR Ming Zhou. 2020b. Bridging the Gap between Pre- corpus based on public domain audio books. In Training and Fine-Tuning for End-to-End Speech 2015 IEEE International Conference on Acoustics, Translation. In Proceedings of the AAAI Conference Speech and Signal Processing (ICASSP), pages on Artificial Intelligence, volume 34, pages 9161– 5206–5210. 9168. M. Woszczyna, N. Coccaro, A. Eisele, A. Lavie, A. Mc- Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Nair, T. Polzin, I. Rogina, C. P. Rose, T. Sloboda, Jing Zhu. 2002. BLEU: a method for automatic eval- M. Tomita, J. Tsutsumi, N. Aoki-Waibel, A. Waibel, uation of machine translation. In Proceedings of the and W. Ward. 1993. Recent Advances in Janus: A 40th annual meeting of the Association for Compu- Speech Translation System. In Proceedings of the tational Linguistics, pages 311–318. Workshop on Human Language Technology, HLT Daniel S Park, William Chan, Yu Zhang, Chung-Cheng ’93, page 211–216, USA. Association for Compu- Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. tational Linguistics. 2019. SpecAugment: A Simple Data Augmentation Qiantong Xu, Tatiana Likhomanenko, Jacob Kahn, Method for Automatic Speech Recognition. Awni Hannun, Gabriel Synnaeve, and Ronan Col- lobert. 2020. Iterative Pseudo-Labeling for Speech Thomas Hofmann Pelin Dogan-Schönberger, Ju- Recognition. In Proceedings of Interspeech 2020, lian Mäder. 2021. SwissDial: Parallel Multidialectal pages 1006–1010. Corpus of Spoken Swiss German.