=Paper=
{{Paper
|id=Vol-2624/germeval-task4-paper4
|storemode=property
|title=UZH TILT: A Kaldi recipe for Swiss German Speech to Standard German Text
|pdfUrl=https://ceur-ws.org/Vol-2624/germeval-task4-paper4.pdf
|volume=Vol-2624
|authors=Tannon Kew,Iuliia Nigmatulina,Lorenz Nagele,Tanja Samardžić
|dblpUrl=https://dblp.org/rec/conf/swisstext/KewNNS20
}}
==UZH TILT: A Kaldi recipe for Swiss German Speech to Standard German Text==
UZH TILT: A Kaldi recipe for Swiss German Speech
to Standard German Text
Tannon Kew
Iuliia Nigmatulina
Lorenz Nagele
Tanja Samardžić
University of Zurich
iuliia.nigmatulina, tannon.kew, lorenz.nagele, tanja.samardzic@uzh.ch
Abstract architecture allows for better learning of long term
temporal dependencies between phonemes in a se-
Swiss German Speech-to-Text (STT) is quence. Using iVectors potentially contributes to
a challenging task due to the fact that better generalisation to unseen data, and to DNN
no single-dominant pronunciation or stan- adaptation with the additional feature normalisation
dardised orthography exists. This is com- (Saon et al., 2013; Miao et al., 2015). In addition
pounded by a severe lack of appropri- to the data set provided by the organisers, we use
ate training data. One potential avenue, an external pronunciation lexicon (Schmidt et al.,
and that which is investigated as part 2020) and the German section of the Sparcling cor-
of the GermEval 2020 Task 4 on Low- pus1 (Graën et al., 2019) to build a robust N-gram
Resource Speech-to-Text, is to translate language model, suitable for the target domain.
spoken Swiss German into standard Ger- The layout of this report is as follows: Section
man text implicitly through STT. In this 2 describes the aim of the shared task and the data
paper, we describe our proposed system provided. In Section 3, we describe our approach
that makes use of the Kaldi Speech Recog- and the individual components used in our system.
nition Toolkit to implement a time delay We report the overall performance of our system
neural network (TDNN) Acoustic Model based on a held-out development set and the task
(AM) with an extended pronunciation lex- test set in Section 4. Finally, in Section 5, we con-
icon and language model. Using this ap- clude with a discussion on some of the advantages
proach, we achieve a word error rate of and limitations of our approach.
45.45% on the held-out test set.
2 Data
1 Introduction
In this paper, we describe our approach for the The dataset for this shared task was provided by the
GermEval 2020 Task 4 on Low-Resource Speech- organisers and comprises a training set of approx-
to-Text (Plüss et al., 2020) held as part of the 5th imately 70 hours of annotated speech data from
SwissText and the 16th KONVENS Joint Confer- Swiss parliamentary discussions plus an additional
ence 2020. The goal of this shared task is to de- 4 hours of audio recordings for system evaluation.
velop a STT system capable of converting Swiss This test set only contains recordings of speakers
German speech utterances into standard German that are not present in the training data.
text. The training data includes a total of 36,572 ut-
Our system makes use of the Kaldi Speech terances spoken by 191 different speakers. Each
Recognition Toolkit (Povey et al., 2011). Specifi- utterance is annotated with its transcription in stan-
cally, we adapt the WSJ chain recipe to integrate dard German and a unique speaker ID. According
a time delay neural network (TDNN) component to the description of the shared task data, spoken
in the process of training the acoustic model (AM) utterances are predominantly in the Bernese dialect,
with iVectors (Peddinti et al., 2015). The TDNN with some in standard German.
We enrich the training data with two external
Copyright c 2020 for this paper by its authors. Use permitted
1
under Creative Commons License Attribution 4.0 Interna- The Sparcling corpus is described in detail as ‘FEP 9’ in
tional (CC BY 4.0) (Graën, 2018).
Split No. of Utterances We use 13-dimensional Mel-Frequency Cepstral
Coefficients (MFCC) features with cepstral mean-
Train 32,916
variance normalisation (CMVN), the first and sec-
Dev 3,656
ond derivatives, and Linear Discriminative Analy-
Test 2,014
sis (LDA) and Maximum Likelihood Linear Trans-
form (MLLT) transformations. In addition, we
Table 1: Distribution of the datasets.
include 100-dimensional iVectors extracted from
each speech frame in order to normalise the varia-
sources. First, we derive a high-coverage pronunci- tion between speakers and dialectal varieties.
ation lexicon containing more than 38,000 standard To increase the amount of training data and im-
German words with an approximate Swiss German prove robustness of the AM, we perform popular
pronunciation to facilitate AM training. Second, data augmentation techniques, such as audio speed
we add to the N-gram language model (LM) trained perturbation with speed factors of 0.9, 1.0, 1.1, fol-
on the shared task data an additional 4-gram LM lowed by volume perturbation with volume factors
trained on the German section of the Sparcling cor- sampled from the interval [0.125, 2.0] (Ko et al.,
pus (Graën et al., 2019). These steps are described 2015).
in more detail in Sections 3.2 and 3.3, respectively. The AM was trained with NVIDIA Tesla K80
GPUs and took around 14 hours.
2.1 Preprocessing
3.2 Pronunciation Lexicon
Utterance transcriptions are already partially pre-
processed with character mapping to a defined set For the pronunciation lexicon, we make use of
of allowable characters and lowercasing applied2 . an 11,000 word dictionary mapping standard Ger-
Therefore, we only apply one further step for text man words to their Swiss German pronunciations
preprocessing, namely tokenisation. We use a sim- (Schmidt et al., 2020). This dictionary contains
ple, general-purpose tokeniser trained on German manually annotated pronunciation strings (in the
from the Python NLTK module3 . SAMPA alphabet (Wells et al., 1997)) for six ma-
Once tokenised, we set aside 10% of the train- jor regional varieties, namely Zurich, St. Gallen,
ing data as a development set for the purpose of Bern, Basel, Valais and Nidwalden. Since the task
fine-tuning model parameters. Table 1 gives an data predominantly consists of Bernese dialect, we
overview of the dataset splits used for this task. use the pronunciations strings for this regional va-
riety only. Furthermore, we normalise the standard
3 Methods German words using the same text preprocessing
steps as provided in the shared task description (i.e.
In this section, we present the main components character mapping and converting to lowercase).
of our STT system, namely, the acoustic model, Initially, the SAMPA dictionary provides only
pronunciation lexicon and language model. 15% lexical coverage of the shared task dataset.
In order to increase this, we train a transformer-
3.1 Acoustic Model based grapheme-to-phoneme (g2p) model4 on the
We base our STT system for Swiss German on the available pairs (standard German, Swiss SAMPA)
the WSJ chain recipe with the time delay neural and apply it on the words from the dataset for which
network (TDNN) architecture provided in the Kaldi manual Swiss SAMPA annotation is missing. We
toolkit. The alignment between acoustic signal train the g2p model with the default settings.5
segments and transcriptions is attained with the As a result of this process, we attain a lexicon
GMM-HMM discriminative model trained with that provides 97.5% coverage of the shared task
a Maximum Mutual Information criterion (MMI) dataset. The remaining 2.5% of items not covered
with 4,000 senones and 40,000 Gaussians. in the extended lexicon include tokens consisting
of digits (e.g. numbers, dates, etc.) and punctua-
2
https://www.cs.technik.fhnw.
4
ch/speech-to-text-labeling-tool/ https://github.com/cmusphinx/g2p-seq2seq
swisstext-2020/competition/1 5
Default settings for g2p-seq2seq are as follows: size of
3
https://www.nltk.org/api/nltk. each hidden layer = 256, number of layers = 3, size of the
tokenize.html#module-nltk.tokenize. filter layer in a convolutional layer = 512, number of heads in
toktok multi-attention mechanism = 4.
AM LM Dev Test
TDNN-iVector STLM+SparcLM 43.69 45.45
Table 2: WER results attained on a held-out development set and 50% of the test set.
tion (e.g. web addresses) since the original SAMPA for the submission7 : the system achieves a WER
dictionary does not contain such characters. While of 43.69% and 45.45%, respectively. The model
the overall word-level accuracy of the g2p model, was tuned with language model weights (LMWT)
estimated on a held out test set, is only 39%, the ranging from 7 to 17, and different word insertion
output of the model is still useful for the STT sys- penalty values (WIP) of 0.0, 0.5 and 1.0. Opti-
tem since it provides good coverage with plausible mal parameters (LMWT = 9 and WIP = 0.0) were
g2p mappings confirmed by manual inspection of determined according to the best WER on the held-
the output. out development set and then applied in order to
decode the test set for this submission.
3.3 Language model
The language modeling component used in our 5 Discussion
system is a statistical N-gram backoff LM. We An assessment of our system output transcriptions
train two 4-gram LMs with interpolated modified against audio samples from the test data reveals
Kneser-Ney smoothing (Chen and Goodman, 1999) that the results are comprehensible and depict the
using the MITLM toolkit6 (Hsu and Glass, 2008) speech utterance well in most cases. Common
and combine them using linear interpolation. The errors include single missing words in the tran-
first LM is estimated on the basis of our training scription, separated writing of compounds (e.g. bil-
data split (see Table 1). For simplicity, we refer to dung direktor instead of bildungsdirektor) and the
this model as the shared task LM (STLM). While absence of numbers. The latter can easily be ex-
this LM ensures that we capture the domain of the plained by the fact that our lexicon does not include
shared task data well, it is limited in terms of size digits and thus needs to be further extended in order
and vocabulary. In order to improve the robustness to cover such common lexical items.
of our system, we incorporate additional language We also noticed that words at the beginning and
data by estimating a second 4-gram LM on the Ger- end of the audio samples are cut off in many cases,
man section of the Sparcling corpus (Graën et al., making it difficult for the system to recognise these
2019). words correctly. In addition, it is clear that speech
The Sparcling corpus is a cleaned and nor- utterances do not necessarily correspond to single
malised version of the Europarl corpus (Koehn, sentences in many cases, but rather sentence frag-
2005), which contains a large collection of parallel ments, or in some cases multiple sentences8 . The
texts based on debates published in the proceedings LM, however, is trained largely on complete sen-
of the European Parliament. In total, the Sparcling tences and could thus fail to account for N-gram
corpus provides 1.75M German utterances which sequences that bridge typical sentence boundaries.
are considered to be close to the target domain. The
resulting LM is too large to be used directly, so we 6 Conclusion
prune it using the SRILM toolkit (Stolcke, 2002),
setting a threshold of 10−8 . We refer to this model In this paper, we have described our proposed solu-
as the Sparcling LM (SparcLM). Once pruned, we tion for the GermEval 2020 Task 4: Low-Resource
linearly interpolate the STLM and the SparcLM Speech-to-Text challenge. We have implemented
with weights λ = 0.7 and λ = 0.3, respectively. an advanced TDNN AM using popular acoustic
speech data augmentation techniques available as
4 Results part of the Kaldi Speech Recognition Toolkit. Our
7
Table 2 reports the WER results attained by our This result is automatically calculated and published on
the shared task’s public leader board upon submission.
TDNN-iVector STT system (see 3.3) on the held- 8
A manual evaluation of a sample of 100 speech utterances
out development set and on the test set provided from the test set show that 58 are not complete sentences,
of which 25 also contain fragments from the preceding or
6
https://github.com/mitlm/mitlm following utterance.
model achieves a WER of 45.45% on the public Michel Plüss, Lukas Neukom, and Manfred Vogel.
part of the task’s test set, which we believe is com- 2020. GermEval 2020 Task 4: Low-Resource
Speech-to-Text. In preparation.
petitive given the amount of training data and the
major challenges involved in STT for languages Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas
with a high degree of dialectal variability such as Burget, Ondrej Glembek, Nagendra Goel, Mirko
Swiss German. Hannemann, Petr Motlicek, Yanmin Qian, Petr
Schwarz, et al. 2011. The kaldi speech recogni-
tion toolkit. In IEEE 2011 workshop on automatic
Acknowledgments speech recognition and understanding, CONF. IEEE
Signal Processing Society.
We would like to thank Fransisco Campillo from
Spitch AG, who helped us in setting up our initial George Saon, Hagen Soltau, David Nahamoo, and
STT system that was used as a springboard for our Michael Picheny. 2013. Speaker adaptation of neu-
ral network acoustic models using i-vectors. In 2013
experiments and investigations. IEEE Workshop on Automatic Speech Recognition
and Understanding, pages 55–59. IEEE.
References Larissa Schmidt, Lucy Linder, Sandra Djambazovska,
Alexandros Lazaridis, Tanja Samardžić, and Claudiu
Stanley F Chen and Joshua Goodman. 1999. An Musat. 2020. A swiss german dictionary: Variation
Empirical Study of Smoothing Techniques for Lan- in speech and writing.
guage Modeling. Computer Speech & Language,
13(4):359–394. Andreas Stolcke. 2002. SRILM — An Extensible Lan-
guage Modeling Toolkit. In Proceedings of the Sev-
Johannes Graën. 2018. Exploiting alignment in multi- enth International Conference on Spoken Language
parallel corpora for applications in linguistics and Processing (ICSLP), Denver, USA.
language learning. Ph.D. thesis, University of
John C Wells et al. 1997. Sampa computer readable
Zurich.
phonetic alphabet. Handbook of standards and re-
sources for spoken language systems, 4.
Johannes Graën, Tannon Kew, Anastassia Shaitarova,
and Martin Volk. 2019. Modelling large parallel
corpora: The zurich parallel corpus collection. In
Challenges in the Management of Large Corpora
(CMLC-7). Leibniz-Institut für Deutsche Sprache.
Bo-June (Paul) Hsu and James R. Glass. 2008. Iterative
language model estimation: Efficient data structure
& algorithms. In In Proceedings of the Ninth An-
nual Conference of the International Speech Com-
munication Association, pages 841–844, Brisbane,
Australia.
Tom Ko, Vijayaditya Peddinti, Daniel Povey, and San-
jeev Khudanpur. 2015. Audio augmentation for
speech recognition. In Sixteenth Annual Conference
of the International Speech Communication Associ-
ation.
Philipp Koehn. 2005. Europarl: A parallel corpus for
statistical machine translation. In MT summit, vol-
ume 5, pages 79–86. Citeseer.
Yajie Miao, Hao Zhang, and Florian Metze. 2015.
Speaker adaptive training of deep neural network
acoustic models using i-vectors. IEEE/ACM Trans-
actions on Audio, Speech, and Language Processing,
23(11):1938–1949.
Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khu-
danpur. 2015. A time delay neural network architec-
ture for efficient modeling of long temporal contexts.
In Sixteenth Annual Conference of the International
Speech Communication Association.