=Paper=
{{Paper
|id=Vol-2624/germeval-task4-paper1
|storemode=property
|title=GermEval 2020 Task 4: Low-Resource Speech-to-Text
|pdfUrl=https://ceur-ws.org/Vol-2624/germeval-task4-paper1.pdf
|volume=Vol-2624
|authors=Michel Plüss,Lukas Neukom,Manfred Vogel
|dblpUrl=https://dblp.org/rec/conf/swisstext/PlussNV20
}}
==GermEval 2020 Task 4: Low-Resource Speech-to-Text==
<pdf width="1500px">https://ceur-ws.org/Vol-2624/germeval-task4-paper1.pdf</pdf>
<pre>
                GermEval 2020 Task 4: Low-Resource Speech-to-Text

                         Michel Plüss    Lukas Neukom         Manfred Vogel
                                      Institute for Data Science
                  University of Applied Sciences and Arts Northwestern Switzerland
                                        Windisch, Switzerland
                                   michel.pluess@fhnw.ch


                       Abstract                               variants for each word makes direct speech-to-
                                                              text almost impossible. Therefore, for most use
    We present the results and findings of                    cases, a speech-to-text method for Swiss German
    GermEval 2020 Task 4 on Low-Resource                      has to simultaneously translate to Standard Ger-
    Speech-to-Text. Participants were asked                   man. This combination of speech recognition and
    to build a system translating Swiss Ger-                  translation is also referred to as speech translation.
    man speech to Standard German text and                       For this shared task, we built a new dataset for
    minimize its word error rate. The task was                Swiss German to Standard German speech transla-
    based on a new dataset for Swiss German                   tion. We provide data for training containing audio
    to Standard German speech translation,                    and text as well as data for testing containing only
    which contains 74 hours of sentence-level                 the audio. The testing data has to be transcribed
    speech-text-pairs. 3 teams participated,                  and submitted on the task website for evaluation.
    with the winning contribution reaching a                     The remainder of this paper is structured as fol-
    word error rate of 40.29 %.                               lows: The task and the evaluation of submissions
                                                              are described in section 2. Our speech translation
1   Introduction                                              dataset is introduced in section 3. An overview
                                                              of the submissions and results of this task can be
Speech-to-text methods for well-resourced lan-                found in section 4. Finally, section 5 wraps up the
guages like English or Standard German work                   paper and gives directions for future work.
very well. Lüscher et al. (2019) set the current
state-of-the-art on the popular LibriSpeech test-             2   Task Description and Evaluation
other benchmark (Panayotov et al., 2015) with a
word error rate (WER) of as low as 5 %. This                  The goal of the task is to build a sentence level
is in stark contrast to the situation for Swiss               Swiss German to Standard German speech transla-
German, the most frequently used spoken lan-                  tion system. The submission with the lowest WER
guage in Switzerland, for which almost no pub-                wins. We chose WER as opposed to the BLEU
licly available training data for speech-to-text is           score (Papineni et al., 2002), which is often used
available. Apart from the comparatively low num-              for machine translation and speech translation, for
ber of speakers of around 5 million, the main rea-            2 main reasons. Firstly, manual inspection of a
son for this is the lack of a standardized writing            sample of our dataset showed that in most cases,
system. Thus, the official written language in the            the Standard German transcription does not devi-
German part of Switzerland is Standard German,                ate much from the Swiss German speech. Sec-
not Swiss German. Despite this, a lot of Swiss                ondly, we only have a single reference transcrip-
German speakers write in Swiss German, espe-                  tion per utterance but BLEU would require mul-
cially in informal conversations using messaging              tiple reference transcriptions to work well. Addi-
apps, but they resort to phonetical writing in their          tionally, there is not a lot of margin for alternative
local dialect. The multitude of different written             transcriptions.
                                                                 Apart from working with the provided dataset
Copyright © 2020 for this paper by its authors. Use permit-
ted under Creative Commons License Attribution 4.0 Interna-   for training, participants were encouraged to ex-
tional (CC BY 4.0)                                            plore data augmentation methods and transfer
learning approaches to build a better model in this            The structure of the data is the same as in the
low-resource setting.                                        Mozilla Common Voice project4 . Table 2 gives a
   At evaluation time, the following pre-                    short description of the metadata provided in the
processing steps are taken before comparing                  TSV file.
a submission against the reference transcriptions:
                                                             4       Results
   • Transform to lower case
                                                                3 teams participated in the shared task and submit-
    • Remove punctuation , ; : . ? !                            ted a solution. Table 3 shows an overview of the
                                                                results on the public part of the test set.
    • Remove leading and trailing whitespace                       The team in first place, Büchi et al. (2020),
                                                                achieved a WER of 40.29 %. Their approach is
Numbers are not normalized in any way.                          based on a CNN acoustic model called Jasper (Li
                                                                et al., 2019). It is first trained on additional Stan-
3 Data                                                          dard German data and then fine-tuned on the task
                                                                data. The model was trained using the CTC loss
For this task, we created the, to the best of our
                                                                function (Graves et al., 2006). To further im-
knowledge, first publicly available dataset1 for
                                                                prove the results, a language model and data aug-
Swiss German to Standard German speech trans-
                                                                mentation methods were applied. Table 4 shows
lation. It consists of 70 (train) + 4 (test) hours
                                                                some examples of true and predicted sentences
of mostly Swiss German and some Standard Ger-
                                                                and the corresponding WER. Sentence 1 is a good
man speech from the parliament of the canton of
                                                                prediction, especially considering that the words
Bern and corresponding Standard German tran-
                                                                ”gefährdet ist” are missing in the recording due
scriptions. The Swiss German speech is predom-
                                                                to an alignment error. In sentence 2, the align-
inantly in the Bernese dialect. Some parliament
                                                                ment is perfect, but the model chooses the word
members speak in Standard German, hence the
                                                                ”Führungskontrolle” rather than “Feuerungskon-
small part of Standard German speech. Speech-
                                                                trolleur”. This is a seldomly used word in Swiss
text-pairs consist of a single sentence and were at-
                                                                German and therefore hard to get right. Sentence
tained from the raw data using a fully automated
                                                                3 is actually a good prediction of what can be
alignment procedure described in (Plüss et al.,
                                                                heard in the recording, but the sentence was re-
2020). The raw data consisted of audio recordings
                                                                formulated in the transcription. In this case, the
of full meetings, usually between 2 and 3 hours,
                                                                BLEU score with multiple reference transcriptions
and the transcript in a PDF file2 . While the align-
                                                                would better fit the task. Finally, the predicted sen-
ment quality is fairly good, it is certainly not per-
                                                                tence number 4 does not make too much sense and
fect. The most common errors are missing or ad-
                                                                shows that the model still has considerable poten-
ditional words at the beginning or end of a speech
                                                                tial for improvement.
utterance compared to the transcript. Table 1 lists
                                                                   The team in second place, Kew et al. (2020),
a few examples of this. This obviously makes the
                                                                achieved a WER of 45.45 %. They follow a DNN-
task harder and leads to non-avoidable mistakes.
                                                                HMM approach for the acoustic model using a
The transcripts were pre-processed as follows:
                                                                time delay neural network. No additional speech-
    • Transform to lower case                                   to-text data is used. They create a pronuncia-
                                                                tion lexicon specifically adapted to this task. Like
    • Replace or remove all characters except a-z,              Büchi et al., they use a language model and apply
       ä, ö, ü, 0-9, space, punctuation , ; : . ? !          data augmentation methods.
                                                                   The team in third place, Agarwal et al. (2020),
    • Remove leading and trailing whitespace                    achieved a WER of 58.93 %. Their approach is
                                                                based on DeepSpeech (Hannun et al., 2014), an
Details of the replacement and removal operations               end-to-end deep learning system. They use cas-
can be found in our code on GitHub3 .                           caded transfer learning, first training the model
    1
      https://drive.switch.ch/index.php/s/PpUArRmN5Ba5C8J       with English data, then transferring to Standard
    2
      https://www.gr.be.ch/gr/de/index/sessionen/sessionen.html German, then finally to Swiss German. The Archi-
   3
     https://github.com/festivalhopper/germeval-2020-task-
                                                                 4
4/blob/master/transcript preprocessing.py                            https://voice.mozilla.org/en/datasets
    Sentence in Recording                                Sentence in Transcript
    ...der Fall. Wir Motionäre wurden zusammen          Wir Motionäre wurden zusammen mit anderen
    mit anderen Interessengruppen sehr schnell           Interessengruppen sehr schnell eingeladen.
    eingela...
    Das Pricing des Stroms ist relativ klar, es ist      Das Pricing des Stroms ist relativ klar, es ist
    geregelt und die Gewinnspanne garantiert. Ich        geregelt und die Gewinnspanne garantiert.
    will...

                                   Table 1: Examples of alignment errors.

    Attribute                                            Description
    client id                                            Speaker ID
    path                                                 Name of the audio file in the clips folder
    sentence                                             Ground truth Standard German transcription
    up votes, down votes, age, gender, accent            Not available in the current version of the
                                                         dataset

                      Table 2: Description of the metadata in the TSV file of the dataset.


      Rank     Team              WER in %                 dicates that data augmentation methods and lan-
       1       Büchi et al.      40.29                   guage models work well in this low-resource set-
       2       Kew et al.         45.45                   ting. More details about the individual systems
       3       Agarwal et al.     58.93                   can be found in their respective system descrip-
                                                          tion papers, which are published in the SwissText
Table 3: Overview of the shared task’s results, taken     & KONVENS 2020 proceedings.
from the public ranking on the 22nd of May 2020. The         We have made the dataset publicly available5
WER column shows the word error rate in % on the          to the research community beyond the GermEval
public 50 % of the test set.
                                                          competition, hoping to facilitate future research on
                                                          this important topic.
Mob (Samardžić et al., 2016) dataset is used as            In future work, we plan to minimize errors made
additional Swiss German training data. Like the           by the automatic alignment procedure and sub-
other participants, they use a language model and         stantially increase the dataset size by aligning ad-
apply data augmentation methods.                          ditional raw data.

                                                          Acknowledgments
5   Conclusion
                                                          First of all, we would like to thank the parliamen-
We have described GermEval 2020 Task 4 on
                                                          tary services of the canton of Bern for their work
Low-Resource Speech-to-Text. The task used a
                                                          on the transcription of the debates and for pub-
newly created dataset for Swiss German to Stan-
                                                          lishing recordings and transcripts on their website.
dard German speech translation described in sec-
                                                          Without them, this task would not have been pos-
tion 3. 3 teams participated in the task, with the
                                                          sible.
winning team reaching a WER of 40.29 %. This
                                                             We would also like to thank the GermEval 2020
is a good result given that few research has been
                                                          organizers for hosting the Low-Resource Speech-
done on this topic and considering the alignment
                                                          to-Text task and for replying promptly to all our
errors apparent in the dataset due to the fully au-
                                                          inquiries.
tomated alignment procedure. An open question
                                                             We especially thank the GermEval 2020 Task
is how well this model would generalize to other
                                                          4 participants for their interest in the shared task,
Swiss German to Standard German speech trans-
                                                          for their participation, and for their timely feed-
lation datasets or to a Standard German speech-to-
                                                          back, which have helped us make the shared task
text task.
                                                             5
   The evaluation of the results of all teams in-                https://drive.switch.ch/index.php/s/PpUArRmN5Ba5C8J
 ID       True Sentence                              Predicted Sentence                       WER in %
 1        insbesondere kann der kanton               insbesondere kann der kanton mit fi-     28.57
          mit      finanziellen    zuschüssen       nanziellen zuschüsse steuernd ein-
          steuernd eingreifen, wenn die              greifen die versorgungssicherheit
          versorgungssicherheit gefährdet ist.
 2        der feuerungskontrolleur, der von          die führungskontrolle die von den       25.00
          den gemeinden gewählt und einge-          gemeinden gewählt und eingesetzt
          setzt wird, ist neutral.                   wird ist neutral
 3        dabei ist zu beachten, dass der sinn       das ist vielleicht in die debatte        52.94
          dieser brückenangebote auch von           auch wichtig dass der sinn dieser
          der mehrheit nicht infrage gestellt        brückenangebote ist auch von der
          wird.                                      mehrheit nicht infrage gestellt
 4        wissenschaft läuft nicht so, dass         es läuft nicht so dass denen ans        52.00
          ein mäzen, wie ein hansjörg              jürg weise und von er mit seiner
          wyss und vor allem nicht er, mit           fachkompetenz einsprechen in wis-
          seiner fachkompetenz , so in wis-          senschaftliche forschung wissen
          senschaftliche forschung reinreden
          würde.

Table 4: Examples of true sentences compared to the predicted sentences by Büchi et al. with the corresponding
word error rate. Punctuation is removed before calculating the WER.


a success.                                                Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan
                                                             Leary, Oleksii Kuchaiev, Jonathan M. Cohen, Huyen
                                                             Nguyen, and Ravi Teja Gadde. 2019. Jasper: An
                                                             end-to-end convolutional neural acoustic model.
References
                                                          Christoph Lüscher, Eugen Beck, Kazuki Irie, Markus
Aashish Agarwal and Torsten Zesch. 2020. Ltl-ude at         Kitza, Wilfried Michel, Albert Zeyer, Ralf Schlüter,
  low-resource speech-to-text shared task: Investigat-      and Hermann Ney. 2019. Rwth asr systems for lib-
  ing mozilla deepspeech in a low-resource setting. In      rispeech: Hybrid vs attention - w/o data augmenta-
  preparation.                                              tion. In INTERSPEECH.
Matthias Büchi, Malgorzata Anna Ulasik, Manuela          V. Panayotov, G. Chen, D. Povey, and S. Khudanpur.
 Hürlimann, Fernando Benites, Pius von Däniken,            2015. Librispeech: An asr corpus based on public
 and Mark Cieliebak. 2020. Zhaw-init at germeval             domain audio books. In 2015 IEEE International
 2020 task 4: Low-resource speech-to-text. In prepa-         Conference on Acoustics, Speech and Signal Pro-
 ration.                                                     cessing (ICASSP), pages 5206–5210.

Alex Graves, Santiago Fernández, Faustino Gomez,         Kishore Papineni, Salim Roukos, Todd Ward, and
  and Jürgen Schmidhuber. 2006.        Connectionist       Wei Jing Zhu. 2002. Bleu: a method for automatic
  temporal classification: Labelling unsegmented se-        evaluation of machine translation.
  quence data with recurrent neural networks. In Pro-
  ceedings of the 23rd International Conference on        Michel Plüss, Lukas Neukom, and Manfred Vogel.
  Machine Learning, ICML ’06, page 369–376, New             2020. Forced alignment of swiss german speech to
  York, NY, USA. Association for Computing Ma-              standard german text. In preparation.
  chinery.                                                Tanja Samardžić, Yves Scherrer, and Elvira Glaser.
                                                            2016.    Archimob - a corpus of spoken swiss
Awni Hannun, Carl Case, Jared Casper, Bryan Catan-          german. In Language Resources and Evaluation
  zaro, Greg Diamos, Erich Elsen, Ryan Prenger, San-        (LREC 2016), Proceedings of the Tenth Interna-
  jeev Satheesh, Shubho Sengupta, Adam Coates, and          tional Conference on Language Resources and Eval-
  Andrew Y. Ng. 2014. Deep speech: Scaling up end-          uation (LREC 2016), pages 4061–4066. s.n.
  to-end speech recognition.

Tannon Kew, Iuliia Nigmatulina, Lorenz Nagele, and
  Tanja Samardžić. 2020. Uzh tilt: A kaldi recipe for
  swiss german speech to standard german text. In
  preparation.

</pre>