UZH TILT: A Kaldi recipe for Swiss German Speech to Standard German Text Tannon Kew Iuliia Nigmatulina Lorenz Nagele Tanja Samardžić University of Zurich iuliia.nigmatulina, tannon.kew, lorenz.nagele, tanja.samardzic@uzh.ch Abstract architecture allows for better learning of long term temporal dependencies between phonemes in a se- Swiss German Speech-to-Text (STT) is quence. Using iVectors potentially contributes to a challenging task due to the fact that better generalisation to unseen data, and to DNN no single-dominant pronunciation or stan- adaptation with the additional feature normalisation dardised orthography exists. This is com- (Saon et al., 2013; Miao et al., 2015). In addition pounded by a severe lack of appropri- to the data set provided by the organisers, we use ate training data. One potential avenue, an external pronunciation lexicon (Schmidt et al., and that which is investigated as part 2020) and the German section of the Sparcling cor- of the GermEval 2020 Task 4 on Low- pus1 (Graën et al., 2019) to build a robust N-gram Resource Speech-to-Text, is to translate language model, suitable for the target domain. spoken Swiss German into standard Ger- The layout of this report is as follows: Section man text implicitly through STT. In this 2 describes the aim of the shared task and the data paper, we describe our proposed system provided. In Section 3, we describe our approach that makes use of the Kaldi Speech Recog- and the individual components used in our system. nition Toolkit to implement a time delay We report the overall performance of our system neural network (TDNN) Acoustic Model based on a held-out development set and the task (AM) with an extended pronunciation lex- test set in Section 4. Finally, in Section 5, we con- icon and language model. Using this ap- clude with a discussion on some of the advantages proach, we achieve a word error rate of and limitations of our approach. 45.45% on the held-out test set. 2 Data 1 Introduction In this paper, we describe our approach for the The dataset for this shared task was provided by the GermEval 2020 Task 4 on Low-Resource Speech- organisers and comprises a training set of approx- to-Text (Plüss et al., 2020) held as part of the 5th imately 70 hours of annotated speech data from SwissText and the 16th KONVENS Joint Confer- Swiss parliamentary discussions plus an additional ence 2020. The goal of this shared task is to de- 4 hours of audio recordings for system evaluation. velop a STT system capable of converting Swiss This test set only contains recordings of speakers German speech utterances into standard German that are not present in the training data. text. The training data includes a total of 36,572 ut- Our system makes use of the Kaldi Speech terances spoken by 191 different speakers. Each Recognition Toolkit (Povey et al., 2011). Specifi- utterance is annotated with its transcription in stan- cally, we adapt the WSJ chain recipe to integrate dard German and a unique speaker ID. According a time delay neural network (TDNN) component to the description of the shared task data, spoken in the process of training the acoustic model (AM) utterances are predominantly in the Bernese dialect, with iVectors (Peddinti et al., 2015). The TDNN with some in standard German. We enrich the training data with two external Copyright c 2020 for this paper by its authors. Use permitted 1 under Creative Commons License Attribution 4.0 Interna- The Sparcling corpus is described in detail as ‘FEP 9’ in tional (CC BY 4.0) (Graën, 2018). Split No. of Utterances We use 13-dimensional Mel-Frequency Cepstral Coefficients (MFCC) features with cepstral mean- Train 32,916 variance normalisation (CMVN), the first and sec- Dev 3,656 ond derivatives, and Linear Discriminative Analy- Test 2,014 sis (LDA) and Maximum Likelihood Linear Trans- form (MLLT) transformations. In addition, we Table 1: Distribution of the datasets. include 100-dimensional iVectors extracted from each speech frame in order to normalise the varia- sources. First, we derive a high-coverage pronunci- tion between speakers and dialectal varieties. ation lexicon containing more than 38,000 standard To increase the amount of training data and im- German words with an approximate Swiss German prove robustness of the AM, we perform popular pronunciation to facilitate AM training. Second, data augmentation techniques, such as audio speed we add to the N-gram language model (LM) trained perturbation with speed factors of 0.9, 1.0, 1.1, fol- on the shared task data an additional 4-gram LM lowed by volume perturbation with volume factors trained on the German section of the Sparcling cor- sampled from the interval [0.125, 2.0] (Ko et al., pus (Graën et al., 2019). These steps are described 2015). in more detail in Sections 3.2 and 3.3, respectively. The AM was trained with NVIDIA Tesla K80 GPUs and took around 14 hours. 2.1 Preprocessing 3.2 Pronunciation Lexicon Utterance transcriptions are already partially pre- processed with character mapping to a defined set For the pronunciation lexicon, we make use of of allowable characters and lowercasing applied2 . an 11,000 word dictionary mapping standard Ger- Therefore, we only apply one further step for text man words to their Swiss German pronunciations preprocessing, namely tokenisation. We use a sim- (Schmidt et al., 2020). This dictionary contains ple, general-purpose tokeniser trained on German manually annotated pronunciation strings (in the from the Python NLTK module3 . SAMPA alphabet (Wells et al., 1997)) for six ma- Once tokenised, we set aside 10% of the train- jor regional varieties, namely Zurich, St. Gallen, ing data as a development set for the purpose of Bern, Basel, Valais and Nidwalden. Since the task fine-tuning model parameters. Table 1 gives an data predominantly consists of Bernese dialect, we overview of the dataset splits used for this task. use the pronunciations strings for this regional va- riety only. Furthermore, we normalise the standard 3 Methods German words using the same text preprocessing steps as provided in the shared task description (i.e. In this section, we present the main components character mapping and converting to lowercase). of our STT system, namely, the acoustic model, Initially, the SAMPA dictionary provides only pronunciation lexicon and language model. 15% lexical coverage of the shared task dataset. In order to increase this, we train a transformer- 3.1 Acoustic Model based grapheme-to-phoneme (g2p) model4 on the We base our STT system for Swiss German on the available pairs (standard German, Swiss SAMPA) the WSJ chain recipe with the time delay neural and apply it on the words from the dataset for which network (TDNN) architecture provided in the Kaldi manual Swiss SAMPA annotation is missing. We toolkit. The alignment between acoustic signal train the g2p model with the default settings.5 segments and transcriptions is attained with the As a result of this process, we attain a lexicon GMM-HMM discriminative model trained with that provides 97.5% coverage of the shared task a Maximum Mutual Information criterion (MMI) dataset. The remaining 2.5% of items not covered with 4,000 senones and 40,000 Gaussians. in the extended lexicon include tokens consisting of digits (e.g. numbers, dates, etc.) and punctua- 2 https://www.cs.technik.fhnw. 4 ch/speech-to-text-labeling-tool/ https://github.com/cmusphinx/g2p-seq2seq swisstext-2020/competition/1 5 Default settings for g2p-seq2seq are as follows: size of 3 https://www.nltk.org/api/nltk. each hidden layer = 256, number of layers = 3, size of the tokenize.html#module-nltk.tokenize. filter layer in a convolutional layer = 512, number of heads in toktok multi-attention mechanism = 4. AM LM Dev Test TDNN-iVector STLM+SparcLM 43.69 45.45 Table 2: WER results attained on a held-out development set and 50% of the test set. tion (e.g. web addresses) since the original SAMPA for the submission7 : the system achieves a WER dictionary does not contain such characters. While of 43.69% and 45.45%, respectively. The model the overall word-level accuracy of the g2p model, was tuned with language model weights (LMWT) estimated on a held out test set, is only 39%, the ranging from 7 to 17, and different word insertion output of the model is still useful for the STT sys- penalty values (WIP) of 0.0, 0.5 and 1.0. Opti- tem since it provides good coverage with plausible mal parameters (LMWT = 9 and WIP = 0.0) were g2p mappings confirmed by manual inspection of determined according to the best WER on the held- the output. out development set and then applied in order to decode the test set for this submission. 3.3 Language model The language modeling component used in our 5 Discussion system is a statistical N-gram backoff LM. We An assessment of our system output transcriptions train two 4-gram LMs with interpolated modified against audio samples from the test data reveals Kneser-Ney smoothing (Chen and Goodman, 1999) that the results are comprehensible and depict the using the MITLM toolkit6 (Hsu and Glass, 2008) speech utterance well in most cases. Common and combine them using linear interpolation. The errors include single missing words in the tran- first LM is estimated on the basis of our training scription, separated writing of compounds (e.g. bil- data split (see Table 1). For simplicity, we refer to dung direktor instead of bildungsdirektor) and the this model as the shared task LM (STLM). While absence of numbers. The latter can easily be ex- this LM ensures that we capture the domain of the plained by the fact that our lexicon does not include shared task data well, it is limited in terms of size digits and thus needs to be further extended in order and vocabulary. In order to improve the robustness to cover such common lexical items. of our system, we incorporate additional language We also noticed that words at the beginning and data by estimating a second 4-gram LM on the Ger- end of the audio samples are cut off in many cases, man section of the Sparcling corpus (Graën et al., making it difficult for the system to recognise these 2019). words correctly. In addition, it is clear that speech The Sparcling corpus is a cleaned and nor- utterances do not necessarily correspond to single malised version of the Europarl corpus (Koehn, sentences in many cases, but rather sentence frag- 2005), which contains a large collection of parallel ments, or in some cases multiple sentences8 . The texts based on debates published in the proceedings LM, however, is trained largely on complete sen- of the European Parliament. In total, the Sparcling tences and could thus fail to account for N-gram corpus provides 1.75M German utterances which sequences that bridge typical sentence boundaries. are considered to be close to the target domain. The resulting LM is too large to be used directly, so we 6 Conclusion prune it using the SRILM toolkit (Stolcke, 2002), setting a threshold of 10−8 . We refer to this model In this paper, we have described our proposed solu- as the Sparcling LM (SparcLM). Once pruned, we tion for the GermEval 2020 Task 4: Low-Resource linearly interpolate the STLM and the SparcLM Speech-to-Text challenge. We have implemented with weights λ = 0.7 and λ = 0.3, respectively. an advanced TDNN AM using popular acoustic speech data augmentation techniques available as 4 Results part of the Kaldi Speech Recognition Toolkit. Our 7 Table 2 reports the WER results attained by our This result is automatically calculated and published on the shared task’s public leader board upon submission. TDNN-iVector STT system (see 3.3) on the held- 8 A manual evaluation of a sample of 100 speech utterances out development set and on the test set provided from the test set show that 58 are not complete sentences, of which 25 also contain fragments from the preceding or 6 https://github.com/mitlm/mitlm following utterance. model achieves a WER of 45.45% on the public Michel Plüss, Lukas Neukom, and Manfred Vogel. part of the task’s test set, which we believe is com- 2020. GermEval 2020 Task 4: Low-Resource Speech-to-Text. In preparation. petitive given the amount of training data and the major challenges involved in STT for languages Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas with a high degree of dialectal variability such as Burget, Ondrej Glembek, Nagendra Goel, Mirko Swiss German. Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al. 2011. The kaldi speech recogni- tion toolkit. In IEEE 2011 workshop on automatic Acknowledgments speech recognition and understanding, CONF. IEEE Signal Processing Society. We would like to thank Fransisco Campillo from Spitch AG, who helped us in setting up our initial George Saon, Hagen Soltau, David Nahamoo, and STT system that was used as a springboard for our Michael Picheny. 2013. Speaker adaptation of neu- ral network acoustic models using i-vectors. In 2013 experiments and investigations. IEEE Workshop on Automatic Speech Recognition and Understanding, pages 55–59. IEEE. References Larissa Schmidt, Lucy Linder, Sandra Djambazovska, Alexandros Lazaridis, Tanja Samardžić, and Claudiu Stanley F Chen and Joshua Goodman. 1999. An Musat. 2020. A swiss german dictionary: Variation Empirical Study of Smoothing Techniques for Lan- in speech and writing. guage Modeling. Computer Speech & Language, 13(4):359–394. Andreas Stolcke. 2002. SRILM — An Extensible Lan- guage Modeling Toolkit. In Proceedings of the Sev- Johannes Graën. 2018. Exploiting alignment in multi- enth International Conference on Spoken Language parallel corpora for applications in linguistics and Processing (ICSLP), Denver, USA. language learning. Ph.D. thesis, University of John C Wells et al. 1997. Sampa computer readable Zurich. phonetic alphabet. Handbook of standards and re- sources for spoken language systems, 4. Johannes Graën, Tannon Kew, Anastassia Shaitarova, and Martin Volk. 2019. Modelling large parallel corpora: The zurich parallel corpus collection. In Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut für Deutsche Sprache. Bo-June (Paul) Hsu and James R. Glass. 2008. Iterative language model estimation: Efficient data structure & algorithms. In In Proceedings of the Ninth An- nual Conference of the International Speech Com- munication Association, pages 841–844, Brisbane, Australia. Tom Ko, Vijayaditya Peddinti, Daniel Povey, and San- jeev Khudanpur. 2015. Audio augmentation for speech recognition. In Sixteenth Annual Conference of the International Speech Communication Associ- ation. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT summit, vol- ume 5, pages 79–86. Citeseer. Yajie Miao, Hao Zhang, and Florian Metze. 2015. Speaker adaptive training of deep neural network acoustic models using i-vectors. IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, 23(11):1938–1949. Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khu- danpur. 2015. A time delay neural network architec- ture for efficient modeling of long temporal contexts. In Sixteenth Annual Conference of the International Speech Communication Association.