=Paper=
{{Paper
|id=Vol-2624/germeval-task4-paper1
|storemode=property
|title=GermEval 2020 Task 4: Low-Resource Speech-to-Text
|pdfUrl=https://ceur-ws.org/Vol-2624/germeval-task4-paper1.pdf
|volume=Vol-2624
|authors=Michel Plüss,Lukas Neukom,Manfred Vogel
|dblpUrl=https://dblp.org/rec/conf/swisstext/PlussNV20
}}
==GermEval 2020 Task 4: Low-Resource Speech-to-Text==
GermEval 2020 Task 4: Low-Resource Speech-to-Text Michel Plüss Lukas Neukom Manfred Vogel Institute for Data Science University of Applied Sciences and Arts Northwestern Switzerland Windisch, Switzerland michel.pluess@fhnw.ch Abstract variants for each word makes direct speech-to- text almost impossible. Therefore, for most use We present the results and findings of cases, a speech-to-text method for Swiss German GermEval 2020 Task 4 on Low-Resource has to simultaneously translate to Standard Ger- Speech-to-Text. Participants were asked man. This combination of speech recognition and to build a system translating Swiss Ger- translation is also referred to as speech translation. man speech to Standard German text and For this shared task, we built a new dataset for minimize its word error rate. The task was Swiss German to Standard German speech transla- based on a new dataset for Swiss German tion. We provide data for training containing audio to Standard German speech translation, and text as well as data for testing containing only which contains 74 hours of sentence-level the audio. The testing data has to be transcribed speech-text-pairs. 3 teams participated, and submitted on the task website for evaluation. with the winning contribution reaching a The remainder of this paper is structured as fol- word error rate of 40.29 %. lows: The task and the evaluation of submissions are described in section 2. Our speech translation 1 Introduction dataset is introduced in section 3. An overview of the submissions and results of this task can be Speech-to-text methods for well-resourced lan- found in section 4. Finally, section 5 wraps up the guages like English or Standard German work paper and gives directions for future work. very well. Lüscher et al. (2019) set the current state-of-the-art on the popular LibriSpeech test- 2 Task Description and Evaluation other benchmark (Panayotov et al., 2015) with a word error rate (WER) of as low as 5 %. This The goal of the task is to build a sentence level is in stark contrast to the situation for Swiss Swiss German to Standard German speech transla- German, the most frequently used spoken lan- tion system. The submission with the lowest WER guage in Switzerland, for which almost no pub- wins. We chose WER as opposed to the BLEU licly available training data for speech-to-text is score (Papineni et al., 2002), which is often used available. Apart from the comparatively low num- for machine translation and speech translation, for ber of speakers of around 5 million, the main rea- 2 main reasons. Firstly, manual inspection of a son for this is the lack of a standardized writing sample of our dataset showed that in most cases, system. Thus, the official written language in the the Standard German transcription does not devi- German part of Switzerland is Standard German, ate much from the Swiss German speech. Sec- not Swiss German. Despite this, a lot of Swiss ondly, we only have a single reference transcrip- German speakers write in Swiss German, espe- tion per utterance but BLEU would require mul- cially in informal conversations using messaging tiple reference transcriptions to work well. Addi- apps, but they resort to phonetical writing in their tionally, there is not a lot of margin for alternative local dialect. The multitude of different written transcriptions. Apart from working with the provided dataset Copyright © 2020 for this paper by its authors. Use permit- ted under Creative Commons License Attribution 4.0 Interna- for training, participants were encouraged to ex- tional (CC BY 4.0) plore data augmentation methods and transfer learning approaches to build a better model in this The structure of the data is the same as in the low-resource setting. Mozilla Common Voice project4 . Table 2 gives a At evaluation time, the following pre- short description of the metadata provided in the processing steps are taken before comparing TSV file. a submission against the reference transcriptions: 4 Results • Transform to lower case 3 teams participated in the shared task and submit- • Remove punctuation , ; : . ? ! ted a solution. Table 3 shows an overview of the results on the public part of the test set. • Remove leading and trailing whitespace The team in first place, Büchi et al. (2020), achieved a WER of 40.29 %. Their approach is Numbers are not normalized in any way. based on a CNN acoustic model called Jasper (Li et al., 2019). It is first trained on additional Stan- 3 Data dard German data and then fine-tuned on the task data. The model was trained using the CTC loss For this task, we created the, to the best of our function (Graves et al., 2006). To further im- knowledge, first publicly available dataset1 for prove the results, a language model and data aug- Swiss German to Standard German speech trans- mentation methods were applied. Table 4 shows lation. It consists of 70 (train) + 4 (test) hours some examples of true and predicted sentences of mostly Swiss German and some Standard Ger- and the corresponding WER. Sentence 1 is a good man speech from the parliament of the canton of prediction, especially considering that the words Bern and corresponding Standard German tran- ”gefährdet ist” are missing in the recording due scriptions. The Swiss German speech is predom- to an alignment error. In sentence 2, the align- inantly in the Bernese dialect. Some parliament ment is perfect, but the model chooses the word members speak in Standard German, hence the ”Führungskontrolle” rather than “Feuerungskon- small part of Standard German speech. Speech- trolleur”. This is a seldomly used word in Swiss text-pairs consist of a single sentence and were at- German and therefore hard to get right. Sentence tained from the raw data using a fully automated 3 is actually a good prediction of what can be alignment procedure described in (Plüss et al., heard in the recording, but the sentence was re- 2020). The raw data consisted of audio recordings formulated in the transcription. In this case, the of full meetings, usually between 2 and 3 hours, BLEU score with multiple reference transcriptions and the transcript in a PDF file2 . While the align- would better fit the task. Finally, the predicted sen- ment quality is fairly good, it is certainly not per- tence number 4 does not make too much sense and fect. The most common errors are missing or ad- shows that the model still has considerable poten- ditional words at the beginning or end of a speech tial for improvement. utterance compared to the transcript. Table 1 lists The team in second place, Kew et al. (2020), a few examples of this. This obviously makes the achieved a WER of 45.45 %. They follow a DNN- task harder and leads to non-avoidable mistakes. HMM approach for the acoustic model using a The transcripts were pre-processed as follows: time delay neural network. No additional speech- • Transform to lower case to-text data is used. They create a pronuncia- tion lexicon specifically adapted to this task. Like • Replace or remove all characters except a-z, Büchi et al., they use a language model and apply ä, ö, ü, 0-9, space, punctuation , ; : . ? ! data augmentation methods. The team in third place, Agarwal et al. (2020), • Remove leading and trailing whitespace achieved a WER of 58.93 %. Their approach is based on DeepSpeech (Hannun et al., 2014), an Details of the replacement and removal operations end-to-end deep learning system. They use cas- can be found in our code on GitHub3 . caded transfer learning, first training the model 1 https://drive.switch.ch/index.php/s/PpUArRmN5Ba5C8J with English data, then transferring to Standard 2 https://www.gr.be.ch/gr/de/index/sessionen/sessionen.html German, then finally to Swiss German. The Archi- 3 https://github.com/festivalhopper/germeval-2020-task- 4 4/blob/master/transcript preprocessing.py https://voice.mozilla.org/en/datasets Sentence in Recording Sentence in Transcript ...der Fall. Wir Motionäre wurden zusammen Wir Motionäre wurden zusammen mit anderen mit anderen Interessengruppen sehr schnell Interessengruppen sehr schnell eingeladen. eingela... Das Pricing des Stroms ist relativ klar, es ist Das Pricing des Stroms ist relativ klar, es ist geregelt und die Gewinnspanne garantiert. Ich geregelt und die Gewinnspanne garantiert. will... Table 1: Examples of alignment errors. Attribute Description client id Speaker ID path Name of the audio file in the clips folder sentence Ground truth Standard German transcription up votes, down votes, age, gender, accent Not available in the current version of the dataset Table 2: Description of the metadata in the TSV file of the dataset. Rank Team WER in % dicates that data augmentation methods and lan- 1 Büchi et al. 40.29 guage models work well in this low-resource set- 2 Kew et al. 45.45 ting. More details about the individual systems 3 Agarwal et al. 58.93 can be found in their respective system descrip- tion papers, which are published in the SwissText Table 3: Overview of the shared task’s results, taken & KONVENS 2020 proceedings. from the public ranking on the 22nd of May 2020. The We have made the dataset publicly available5 WER column shows the word error rate in % on the to the research community beyond the GermEval public 50 % of the test set. competition, hoping to facilitate future research on this important topic. Mob (Samardžić et al., 2016) dataset is used as In future work, we plan to minimize errors made additional Swiss German training data. Like the by the automatic alignment procedure and sub- other participants, they use a language model and stantially increase the dataset size by aligning ad- apply data augmentation methods. ditional raw data. Acknowledgments 5 Conclusion First of all, we would like to thank the parliamen- We have described GermEval 2020 Task 4 on tary services of the canton of Bern for their work Low-Resource Speech-to-Text. The task used a on the transcription of the debates and for pub- newly created dataset for Swiss German to Stan- lishing recordings and transcripts on their website. dard German speech translation described in sec- Without them, this task would not have been pos- tion 3. 3 teams participated in the task, with the sible. winning team reaching a WER of 40.29 %. This We would also like to thank the GermEval 2020 is a good result given that few research has been organizers for hosting the Low-Resource Speech- done on this topic and considering the alignment to-Text task and for replying promptly to all our errors apparent in the dataset due to the fully au- inquiries. tomated alignment procedure. An open question We especially thank the GermEval 2020 Task is how well this model would generalize to other 4 participants for their interest in the shared task, Swiss German to Standard German speech trans- for their participation, and for their timely feed- lation datasets or to a Standard German speech-to- back, which have helped us make the shared task text task. 5 The evaluation of the results of all teams in- https://drive.switch.ch/index.php/s/PpUArRmN5Ba5C8J ID True Sentence Predicted Sentence WER in % 1 insbesondere kann der kanton insbesondere kann der kanton mit fi- 28.57 mit finanziellen zuschüssen nanziellen zuschüsse steuernd ein- steuernd eingreifen, wenn die greifen die versorgungssicherheit versorgungssicherheit gefährdet ist. 2 der feuerungskontrolleur, der von die führungskontrolle die von den 25.00 den gemeinden gewählt und einge- gemeinden gewählt und eingesetzt setzt wird, ist neutral. wird ist neutral 3 dabei ist zu beachten, dass der sinn das ist vielleicht in die debatte 52.94 dieser brückenangebote auch von auch wichtig dass der sinn dieser der mehrheit nicht infrage gestellt brückenangebote ist auch von der wird. mehrheit nicht infrage gestellt 4 wissenschaft läuft nicht so, dass es läuft nicht so dass denen ans 52.00 ein mäzen, wie ein hansjörg jürg weise und von er mit seiner wyss und vor allem nicht er, mit fachkompetenz einsprechen in wis- seiner fachkompetenz , so in wis- senschaftliche forschung wissen senschaftliche forschung reinreden würde. Table 4: Examples of true sentences compared to the predicted sentences by Büchi et al. with the corresponding word error rate. Punctuation is removed before calculating the WER. a success. Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M. Cohen, Huyen Nguyen, and Ravi Teja Gadde. 2019. Jasper: An end-to-end convolutional neural acoustic model. References Christoph Lüscher, Eugen Beck, Kazuki Irie, Markus Aashish Agarwal and Torsten Zesch. 2020. Ltl-ude at Kitza, Wilfried Michel, Albert Zeyer, Ralf Schlüter, low-resource speech-to-text shared task: Investigat- and Hermann Ney. 2019. Rwth asr systems for lib- ing mozilla deepspeech in a low-resource setting. In rispeech: Hybrid vs attention - w/o data augmenta- preparation. tion. In INTERSPEECH. Matthias Büchi, Malgorzata Anna Ulasik, Manuela V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Hürlimann, Fernando Benites, Pius von Däniken, 2015. Librispeech: An asr corpus based on public and Mark Cieliebak. 2020. Zhaw-init at germeval domain audio books. In 2015 IEEE International 2020 task 4: Low-resource speech-to-text. In prepa- Conference on Acoustics, Speech and Signal Pro- ration. cessing (ICASSP), pages 5206–5210. Alex Graves, Santiago Fernández, Faustino Gomez, Kishore Papineni, Salim Roukos, Todd Ward, and and Jürgen Schmidhuber. 2006. Connectionist Wei Jing Zhu. 2002. Bleu: a method for automatic temporal classification: Labelling unsegmented se- evaluation of machine translation. quence data with recurrent neural networks. In Pro- ceedings of the 23rd International Conference on Michel Plüss, Lukas Neukom, and Manfred Vogel. Machine Learning, ICML ’06, page 369–376, New 2020. Forced alignment of swiss german speech to York, NY, USA. Association for Computing Ma- standard german text. In preparation. chinery. Tanja Samardžić, Yves Scherrer, and Elvira Glaser. 2016. Archimob - a corpus of spoken swiss Awni Hannun, Carl Case, Jared Casper, Bryan Catan- german. In Language Resources and Evaluation zaro, Greg Diamos, Erich Elsen, Ryan Prenger, San- (LREC 2016), Proceedings of the Tenth Interna- jeev Satheesh, Shubho Sengupta, Adam Coates, and tional Conference on Language Resources and Eval- Andrew Y. Ng. 2014. Deep speech: Scaling up end- uation (LREC 2016), pages 4061–4066. s.n. to-end speech recognition. Tannon Kew, Iuliia Nigmatulina, Lorenz Nagele, and Tanja Samardžić. 2020. Uzh tilt: A kaldi recipe for swiss german speech to standard german text. In preparation.