UZH TILT: A Kaldi recipe for Swiss German Speech
                              to Standard German Text
                               Tannon Kew
                            Iuliia Nigmatulina
                              Lorenz Nagele
                             Tanja Samardžić
                            University of Zurich
iuliia.nigmatulina, tannon.kew, lorenz.nagele, tanja.samardzic@uzh.ch


                          Abstract                                architecture allows for better learning of long term
                                                                  temporal dependencies between phonemes in a se-
      Swiss German Speech-to-Text (STT) is                        quence. Using iVectors potentially contributes to
      a challenging task due to the fact that                     better generalisation to unseen data, and to DNN
      no single-dominant pronunciation or stan-                   adaptation with the additional feature normalisation
      dardised orthography exists. This is com-                   (Saon et al., 2013; Miao et al., 2015). In addition
      pounded by a severe lack of appropri-                       to the data set provided by the organisers, we use
      ate training data. One potential avenue,                    an external pronunciation lexicon (Schmidt et al.,
      and that which is investigated as part                      2020) and the German section of the Sparcling cor-
      of the GermEval 2020 Task 4 on Low-                         pus1 (Graën et al., 2019) to build a robust N-gram
      Resource Speech-to-Text, is to translate                    language model, suitable for the target domain.
      spoken Swiss German into standard Ger-                         The layout of this report is as follows: Section
      man text implicitly through STT. In this                    2 describes the aim of the shared task and the data
      paper, we describe our proposed system                      provided. In Section 3, we describe our approach
      that makes use of the Kaldi Speech Recog-                   and the individual components used in our system.
      nition Toolkit to implement a time delay                    We report the overall performance of our system
      neural network (TDNN) Acoustic Model                        based on a held-out development set and the task
      (AM) with an extended pronunciation lex-                    test set in Section 4. Finally, in Section 5, we con-
      icon and language model. Using this ap-                     clude with a discussion on some of the advantages
      proach, we achieve a word error rate of                     and limitations of our approach.
      45.45% on the held-out test set.
                                                                  2    Data
  1   Introduction
  In this paper, we describe our approach for the                 The dataset for this shared task was provided by the
  GermEval 2020 Task 4 on Low-Resource Speech-                    organisers and comprises a training set of approx-
  to-Text (Plüss et al., 2020) held as part of the 5th           imately 70 hours of annotated speech data from
  SwissText and the 16th KONVENS Joint Confer-                    Swiss parliamentary discussions plus an additional
  ence 2020. The goal of this shared task is to de-               4 hours of audio recordings for system evaluation.
  velop a STT system capable of converting Swiss                  This test set only contains recordings of speakers
  German speech utterances into standard German                   that are not present in the training data.
  text.                                                              The training data includes a total of 36,572 ut-
     Our system makes use of the Kaldi Speech                     terances spoken by 191 different speakers. Each
  Recognition Toolkit (Povey et al., 2011). Specifi-              utterance is annotated with its transcription in stan-
  cally, we adapt the WSJ chain recipe to integrate               dard German and a unique speaker ID. According
  a time delay neural network (TDNN) component                    to the description of the shared task data, spoken
  in the process of training the acoustic model (AM)              utterances are predominantly in the Bernese dialect,
  with iVectors (Peddinti et al., 2015). The TDNN                 with some in standard German.
                                                                     We enrich the training data with two external
  Copyright c 2020 for this paper by its authors. Use permitted
                                                                     1
  under Creative Commons License Attribution 4.0 Interna-              The Sparcling corpus is described in detail as ‘FEP 9’ in
  tional (CC BY 4.0)                                              (Graën, 2018).
            Split     No. of Utterances                    We use 13-dimensional Mel-Frequency Cepstral
                                                        Coefficients (MFCC) features with cepstral mean-
            Train            32,916
                                                        variance normalisation (CMVN), the first and sec-
            Dev              3,656
                                                        ond derivatives, and Linear Discriminative Analy-
            Test             2,014
                                                        sis (LDA) and Maximum Likelihood Linear Trans-
                                                        form (MLLT) transformations. In addition, we
         Table 1: Distribution of the datasets.
                                                        include 100-dimensional iVectors extracted from
                                                        each speech frame in order to normalise the varia-
sources. First, we derive a high-coverage pronunci-     tion between speakers and dialectal varieties.
ation lexicon containing more than 38,000 standard         To increase the amount of training data and im-
German words with an approximate Swiss German           prove robustness of the AM, we perform popular
pronunciation to facilitate AM training. Second,        data augmentation techniques, such as audio speed
we add to the N-gram language model (LM) trained        perturbation with speed factors of 0.9, 1.0, 1.1, fol-
on the shared task data an additional 4-gram LM         lowed by volume perturbation with volume factors
trained on the German section of the Sparcling cor-     sampled from the interval [0.125, 2.0] (Ko et al.,
pus (Graën et al., 2019). These steps are described    2015).
in more detail in Sections 3.2 and 3.3, respectively.      The AM was trained with NVIDIA Tesla K80
                                                        GPUs and took around 14 hours.
2.1    Preprocessing
                                                        3.2    Pronunciation Lexicon
Utterance transcriptions are already partially pre-
processed with character mapping to a defined set       For the pronunciation lexicon, we make use of
of allowable characters and lowercasing applied2 .      an 11,000 word dictionary mapping standard Ger-
Therefore, we only apply one further step for text      man words to their Swiss German pronunciations
preprocessing, namely tokenisation. We use a sim-       (Schmidt et al., 2020). This dictionary contains
ple, general-purpose tokeniser trained on German        manually annotated pronunciation strings (in the
from the Python NLTK module3 .                          SAMPA alphabet (Wells et al., 1997)) for six ma-
   Once tokenised, we set aside 10% of the train-       jor regional varieties, namely Zurich, St. Gallen,
ing data as a development set for the purpose of        Bern, Basel, Valais and Nidwalden. Since the task
fine-tuning model parameters. Table 1 gives an          data predominantly consists of Bernese dialect, we
overview of the dataset splits used for this task.      use the pronunciations strings for this regional va-
                                                        riety only. Furthermore, we normalise the standard
3     Methods                                           German words using the same text preprocessing
                                                        steps as provided in the shared task description (i.e.
In this section, we present the main components         character mapping and converting to lowercase).
of our STT system, namely, the acoustic model,             Initially, the SAMPA dictionary provides only
pronunciation lexicon and language model.               15% lexical coverage of the shared task dataset.
                                                        In order to increase this, we train a transformer-
3.1    Acoustic Model                                   based grapheme-to-phoneme (g2p) model4 on the
We base our STT system for Swiss German on the          available pairs (standard German, Swiss SAMPA)
the WSJ chain recipe with the time delay neural         and apply it on the words from the dataset for which
network (TDNN) architecture provided in the Kaldi       manual Swiss SAMPA annotation is missing. We
toolkit. The alignment between acoustic signal          train the g2p model with the default settings.5
segments and transcriptions is attained with the           As a result of this process, we attain a lexicon
GMM-HMM discriminative model trained with               that provides 97.5% coverage of the shared task
a Maximum Mutual Information criterion (MMI)            dataset. The remaining 2.5% of items not covered
with 4,000 senones and 40,000 Gaussians.                in the extended lexicon include tokens consisting
                                                        of digits (e.g. numbers, dates, etc.) and punctua-
  2
    https://www.cs.technik.fhnw.
                                                           4
ch/speech-to-text-labeling-tool/                              https://github.com/cmusphinx/g2p-seq2seq
swisstext-2020/competition/1                               5
                                                              Default settings for g2p-seq2seq are as follows: size of
  3
    https://www.nltk.org/api/nltk.                      each hidden layer = 256, number of layers = 3, size of the
tokenize.html#module-nltk.tokenize.                     filter layer in a convolutional layer = 512, number of heads in
toktok                                                  multi-attention mechanism = 4.
                                 AM                LM                     Dev       Test
                                 TDNN-iVector      STLM+SparcLM           43.69     45.45

                  Table 2: WER results attained on a held-out development set and 50% of the test set.


tion (e.g. web addresses) since the original SAMPA            for the submission7 : the system achieves a WER
dictionary does not contain such characters. While            of 43.69% and 45.45%, respectively. The model
the overall word-level accuracy of the g2p model,             was tuned with language model weights (LMWT)
estimated on a held out test set, is only 39%, the            ranging from 7 to 17, and different word insertion
output of the model is still useful for the STT sys-          penalty values (WIP) of 0.0, 0.5 and 1.0. Opti-
tem since it provides good coverage with plausible            mal parameters (LMWT = 9 and WIP = 0.0) were
g2p mappings confirmed by manual inspection of                determined according to the best WER on the held-
the output.                                                   out development set and then applied in order to
                                                              decode the test set for this submission.
3.3       Language model
The language modeling component used in our                   5    Discussion
system is a statistical N-gram backoff LM. We                 An assessment of our system output transcriptions
train two 4-gram LMs with interpolated modified               against audio samples from the test data reveals
Kneser-Ney smoothing (Chen and Goodman, 1999)                 that the results are comprehensible and depict the
using the MITLM toolkit6 (Hsu and Glass, 2008)                speech utterance well in most cases. Common
and combine them using linear interpolation. The              errors include single missing words in the tran-
first LM is estimated on the basis of our training            scription, separated writing of compounds (e.g. bil-
data split (see Table 1). For simplicity, we refer to         dung direktor instead of bildungsdirektor) and the
this model as the shared task LM (STLM). While                absence of numbers. The latter can easily be ex-
this LM ensures that we capture the domain of the             plained by the fact that our lexicon does not include
shared task data well, it is limited in terms of size         digits and thus needs to be further extended in order
and vocabulary. In order to improve the robustness            to cover such common lexical items.
of our system, we incorporate additional language                We also noticed that words at the beginning and
data by estimating a second 4-gram LM on the Ger-             end of the audio samples are cut off in many cases,
man section of the Sparcling corpus (Graën et al.,           making it difficult for the system to recognise these
2019).                                                        words correctly. In addition, it is clear that speech
   The Sparcling corpus is a cleaned and nor-                 utterances do not necessarily correspond to single
malised version of the Europarl corpus (Koehn,                sentences in many cases, but rather sentence frag-
2005), which contains a large collection of parallel          ments, or in some cases multiple sentences8 . The
texts based on debates published in the proceedings           LM, however, is trained largely on complete sen-
of the European Parliament. In total, the Sparcling           tences and could thus fail to account for N-gram
corpus provides 1.75M German utterances which                 sequences that bridge typical sentence boundaries.
are considered to be close to the target domain. The
resulting LM is too large to be used directly, so we          6    Conclusion
prune it using the SRILM toolkit (Stolcke, 2002),
setting a threshold of 10−8 . We refer to this model          In this paper, we have described our proposed solu-
as the Sparcling LM (SparcLM). Once pruned, we                tion for the GermEval 2020 Task 4: Low-Resource
linearly interpolate the STLM and the SparcLM                 Speech-to-Text challenge. We have implemented
with weights λ = 0.7 and λ = 0.3, respectively.               an advanced TDNN AM using popular acoustic
                                                              speech data augmentation techniques available as
4       Results                                               part of the Kaldi Speech Recognition Toolkit. Our
                                                                  7
Table 2 reports the WER results attained by our                     This result is automatically calculated and published on
                                                              the shared task’s public leader board upon submission.
TDNN-iVector STT system (see 3.3) on the held-                    8
                                                                    A manual evaluation of a sample of 100 speech utterances
out development set and on the test set provided              from the test set show that 58 are not complete sentences,
                                                              of which 25 also contain fragments from the preceding or
    6
        https://github.com/mitlm/mitlm                        following utterance.
model achieves a WER of 45.45% on the public                Michel Plüss, Lukas Neukom, and Manfred Vogel.
part of the task’s test set, which we believe is com-         2020. GermEval 2020 Task 4: Low-Resource
                                                              Speech-to-Text. In preparation.
petitive given the amount of training data and the
major challenges involved in STT for languages              Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas
with a high degree of dialectal variability such as           Burget, Ondrej Glembek, Nagendra Goel, Mirko
Swiss German.                                                 Hannemann, Petr Motlicek, Yanmin Qian, Petr
                                                              Schwarz, et al. 2011. The kaldi speech recogni-
                                                              tion toolkit. In IEEE 2011 workshop on automatic
Acknowledgments                                               speech recognition and understanding, CONF. IEEE
                                                              Signal Processing Society.
We would like to thank Fransisco Campillo from
Spitch AG, who helped us in setting up our initial          George Saon, Hagen Soltau, David Nahamoo, and
STT system that was used as a springboard for our             Michael Picheny. 2013. Speaker adaptation of neu-
                                                              ral network acoustic models using i-vectors. In 2013
experiments and investigations.                               IEEE Workshop on Automatic Speech Recognition
                                                              and Understanding, pages 55–59. IEEE.

References                                                  Larissa Schmidt, Lucy Linder, Sandra Djambazovska,
                                                              Alexandros Lazaridis, Tanja Samardžić, and Claudiu
Stanley F Chen and Joshua Goodman. 1999. An                   Musat. 2020. A swiss german dictionary: Variation
   Empirical Study of Smoothing Techniques for Lan-           in speech and writing.
   guage Modeling. Computer Speech & Language,
  13(4):359–394.                                            Andreas Stolcke. 2002. SRILM — An Extensible Lan-
                                                              guage Modeling Toolkit. In Proceedings of the Sev-
Johannes Graën. 2018. Exploiting alignment in multi-         enth International Conference on Spoken Language
  parallel corpora for applications in linguistics and        Processing (ICSLP), Denver, USA.
  language learning. Ph.D. thesis, University of
                                                            John C Wells et al. 1997. Sampa computer readable
  Zurich.
                                                              phonetic alphabet. Handbook of standards and re-
                                                              sources for spoken language systems, 4.
Johannes Graën, Tannon Kew, Anastassia Shaitarova,
  and Martin Volk. 2019. Modelling large parallel
  corpora: The zurich parallel corpus collection. In
  Challenges in the Management of Large Corpora
  (CMLC-7). Leibniz-Institut für Deutsche Sprache.

Bo-June (Paul) Hsu and James R. Glass. 2008. Iterative
  language model estimation: Efficient data structure
  & algorithms. In In Proceedings of the Ninth An-
  nual Conference of the International Speech Com-
  munication Association, pages 841–844, Brisbane,
  Australia.

Tom Ko, Vijayaditya Peddinti, Daniel Povey, and San-
  jeev Khudanpur. 2015. Audio augmentation for
  speech recognition. In Sixteenth Annual Conference
  of the International Speech Communication Associ-
  ation.

Philipp Koehn. 2005. Europarl: A parallel corpus for
  statistical machine translation. In MT summit, vol-
  ume 5, pages 79–86. Citeseer.

Yajie Miao, Hao Zhang, and Florian Metze. 2015.
  Speaker adaptive training of deep neural network
  acoustic models using i-vectors. IEEE/ACM Trans-
  actions on Audio, Speech, and Language Processing,
  23(11):1938–1949.

Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khu-
   danpur. 2015. A time delay neural network architec-
   ture for efficient modeling of long temporal contexts.
   In Sixteenth Annual Conference of the International
  Speech Communication Association.