<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UZH TILT: A Kaldi recipe for Swiss German Speech to Standard German Text</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Zurich iuliia.nigmatulina</institution>
          ,
          <addr-line>tannon.kew, lorenz.nagele</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Swiss German Speech-to-Text (STT) is a challenging task due to the fact that no single-dominant pronunciation or standardised orthography exists. This is compounded by a severe lack of appropriate training data. One potential avenue, and that which is investigated as part of the GermEval 2020 Task 4 on LowResource Speech-to-Text, is to translate spoken Swiss German into standard German text implicitly through STT. In this paper, we describe our proposed system that makes use of the Kaldi Speech Recognition Toolkit to implement a time delay neural network (TDNN) Acoustic Model (AM) with an extended pronunciation lexicon and language model. Using this approach, we achieve a word error rate of 45.45% on the held-out test set.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        In this paper, we describe our approach for the
GermEval 2020 Task 4 on Low-Resource
Speechto-Text
        <xref ref-type="bibr" rid="ref9">(Plu¨ss et al., 2020)</xref>
        held as part of the 5th
SwissText and the 16th KONVENS Joint
Conference 2020. The goal of this shared task is to
develop a STT system capable of converting Swiss
German speech utterances into standard German
text.
      </p>
      <p>
        Our system makes use of the Kaldi Speech
Recognition Toolkit
        <xref ref-type="bibr" rid="ref10">(Povey et al., 2011)</xref>
        .
Specifically, we adapt the WSJ chain recipe to integrate
a time delay neural network (TDNN) component
in the process of training the acoustic model (AM)
with iVectors
        <xref ref-type="bibr" rid="ref5 ref8">(Peddinti et al., 2015)</xref>
        . The TDNN
architecture allows for better learning of long term
temporal dependencies between phonemes in a
sequence. Using iVectors potentially contributes to
better generalisation to unseen data, and to DNN
adaptation with the additional feature normalisation
        <xref ref-type="bibr" rid="ref11 ref7">(Saon et al., 2013; Miao et al., 2015)</xref>
        . In addition
to the data set provided by the organisers, we use
an external pronunciation lexicon
        <xref ref-type="bibr" rid="ref12">(Schmidt et al.,
2020)</xref>
        and the German section of the Sparcling
corpus1
        <xref ref-type="bibr" rid="ref3">(Grae¨n et al., 2019)</xref>
        to build a robust N-gram
language model, suitable for the target domain.
      </p>
      <p>The layout of this report is as follows: Section
2 describes the aim of the shared task and the data
provided. In Section 3, we describe our approach
and the individual components used in our system.
We report the overall performance of our system
based on a held-out development set and the task
test set in Section 4. Finally, in Section 5, we
conclude with a discussion on some of the advantages
and limitations of our approach.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Data</title>
      <p>The dataset for this shared task was provided by the
organisers and comprises a training set of
approximately 70 hours of annotated speech data from
Swiss parliamentary discussions plus an additional
4 hours of audio recordings for system evaluation.
This test set only contains recordings of speakers
that are not present in the training data.</p>
      <p>The training data includes a total of 36,572
utterances spoken by 191 different speakers. Each
utterance is annotated with its transcription in
standard German and a unique speaker ID. According
to the description of the shared task data, spoken
utterances are predominantly in the Bernese dialect,
with some in standard German.</p>
      <p>
        We enrich the training data with two external
1The Sparcling corpus is described in detail as ‘FEP 9’ in
        <xref ref-type="bibr" rid="ref2">(Grae¨n, 2018)</xref>
        .
sources. First, we derive a high-coverage
pronunciation lexicon containing more than 38,000 standard
German words with an approximate Swiss German
pronunciation to facilitate AM training. Second,
we add to the N-gram language model (LM) trained
on the shared task data an additional 4-gram LM
trained on the German section of the Sparcling
corpus
        <xref ref-type="bibr" rid="ref3">(Grae¨n et al., 2019)</xref>
        . These steps are described
in more detail in Sections 3.2 and 3.3, respectively.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Preprocessing</title>
        <p>Utterance transcriptions are already partially
preprocessed with character mapping to a defined set
of allowable characters and lowercasing applied2.
Therefore, we only apply one further step for text
preprocessing, namely tokenisation. We use a
simple, general-purpose tokeniser trained on German
from the Python NLTK module3.</p>
        <p>Once tokenised, we set aside 10% of the
training data as a development set for the purpose of
fine-tuning model parameters. Table 1 gives an
overview of the dataset splits used for this task.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Methods</title>
      <p>In this section, we present the main components
of our STT system, namely, the acoustic model,
pronunciation lexicon and language model.
3.1</p>
      <sec id="sec-3-1">
        <title>Acoustic Model</title>
        <p>We base our STT system for Swiss German on the
the WSJ chain recipe with the time delay neural
network (TDNN) architecture provided in the Kaldi
toolkit. The alignment between acoustic signal
segments and transcriptions is attained with the
GMM-HMM discriminative model trained with
a Maximum Mutual Information criterion (MMI)
with 4,000 senones and 40,000 Gaussians.</p>
        <p>2https://www.cs.technik.fhnw.
ch/speech-to-text-labeling-tool/
swisstext-2020/competition/1</p>
        <p>3https://www.nltk.org/api/nltk.
tokenize.html#module-nltk.tokenize.
toktok</p>
        <p>We use 13-dimensional Mel-Frequency Cepstral
Coefficients (MFCC) features with cepstral
meanvariance normalisation (CMVN), the first and
second derivatives, and Linear Discriminative
Analysis (LDA) and Maximum Likelihood Linear
Transform (MLLT) transformations. In addition, we
include 100-dimensional iVectors extracted from
each speech frame in order to normalise the
variation between speakers and dialectal varieties.</p>
        <p>
          To increase the amount of training data and
improve robustness of the AM, we perform popular
data augmentation techniques, such as audio speed
perturbation with speed factors of 0.9, 1.0, 1.1,
followed by volume perturbation with volume factors
sampled from the interval [0:125; 2:0]
          <xref ref-type="bibr" rid="ref5">(Ko et al.,
2015)</xref>
          .
        </p>
        <p>The AM was trained with NVIDIA Tesla K80
GPUs and took around 14 hours.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Pronunciation Lexicon</title>
        <p>
          For the pronunciation lexicon, we make use of
an 11,000 word dictionary mapping standard
German words to their Swiss German pronunciations
          <xref ref-type="bibr" rid="ref12">(Schmidt et al., 2020)</xref>
          . This dictionary contains
manually annotated pronunciation strings (in the
SAMPA alphabet
          <xref ref-type="bibr" rid="ref14">(Wells et al., 1997)</xref>
          ) for six
major regional varieties, namely Zurich, St. Gallen,
Bern, Basel, Valais and Nidwalden. Since the task
data predominantly consists of Bernese dialect, we
use the pronunciations strings for this regional
variety only. Furthermore, we normalise the standard
German words using the same text preprocessing
steps as provided in the shared task description (i.e.
character mapping and converting to lowercase).
        </p>
        <p>Initially, the SAMPA dictionary provides only
15% lexical coverage of the shared task dataset.
In order to increase this, we train a
transformerbased grapheme-to-phoneme (g2p) model4 on the
available pairs (standard German, Swiss SAMPA)
and apply it on the words from the dataset for which
manual Swiss SAMPA annotation is missing. We
train the g2p model with the default settings.5</p>
        <p>As a result of this process, we attain a lexicon
that provides 97.5% coverage of the shared task
dataset. The remaining 2.5% of items not covered
in the extended lexicon include tokens consisting
of digits (e.g. numbers, dates, etc.) and
punctua</p>
        <sec id="sec-3-2-1">
          <title>4https://github.com/cmusphinx/g2p-seq2seq</title>
          <p>5Default settings for g2p-seq2seq are as follows: size of
each hidden layer = 256, number of layers = 3, size of the
filter layer in a convolutional layer = 512, number of heads in
multi-attention mechanism = 4.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>TDNN-iVector STLM+SparcLM</title>
          <p>
            Dev
tion (e.g. web addresses) since the original SAMPA
dictionary does not contain such characters. While
the overall word-level accuracy of the g2p model,
estimated on a held out test set, is only 39%, the
output of the model is still useful for the STT
system since it provides good coverage with plausible
g2p mappings confirmed by manual inspection of
the output.
The language modeling component used in our
system is a statistical N-gram backoff LM. We
train two 4-gram LMs with interpolated modified
Kneser-Ney smoothing
            <xref ref-type="bibr" rid="ref1">(Chen and Goodman, 1999)</xref>
            using the MITLM toolkit6
            <xref ref-type="bibr" rid="ref4">(Hsu and Glass, 2008)</xref>
            and combine them using linear interpolation. The
first LM is estimated on the basis of our training
data split (see Table 1). For simplicity, we refer to
this model as the shared task LM (STLM). While
this LM ensures that we capture the domain of the
shared task data well, it is limited in terms of size
and vocabulary. In order to improve the robustness
of our system, we incorporate additional language
data by estimating a second 4-gram LM on the
German section of the Sparcling corpus
            <xref ref-type="bibr" rid="ref3">(Grae¨n et al.,
2019)</xref>
            .
          </p>
          <p>
            The Sparcling corpus is a cleaned and
normalised version of the Europarl corpus
            <xref ref-type="bibr" rid="ref6">(Koehn,
2005)</xref>
            , which contains a large collection of parallel
texts based on debates published in the proceedings
of the European Parliament. In total, the Sparcling
corpus provides 1.75M German utterances which
are considered to be close to the target domain. The
resulting LM is too large to be used directly, so we
prune it using the SRILM toolkit
            <xref ref-type="bibr" rid="ref13">(Stolcke, 2002)</xref>
            ,
setting a threshold of 10 8. We refer to this model
as the Sparcling LM (SparcLM). Once pruned, we
linearly interpolate the STLM and the SparcLM
with weights = 0:7 and = 0:3, respectively.
4
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>for the submission7: the system achieves a WER
of 43.69% and 45.45%, respectively. The model
was tuned with language model weights (LMWT)
ranging from 7 to 17, and different word insertion
penalty values (WIP) of 0.0, 0.5 and 1.0.
Optimal parameters (LMWT = 9 and WIP = 0.0) were
determined according to the best WER on the
heldout development set and then applied in order to
decode the test set for this submission.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>An assessment of our system output transcriptions
against audio samples from the test data reveals
that the results are comprehensible and depict the
speech utterance well in most cases. Common
errors include single missing words in the
transcription, separated writing of compounds (e.g.
bildung direktor instead of bildungsdirektor) and the
absence of numbers. The latter can easily be
explained by the fact that our lexicon does not include
digits and thus needs to be further extended in order
to cover such common lexical items.</p>
      <p>We also noticed that words at the beginning and
end of the audio samples are cut off in many cases,
making it difficult for the system to recognise these
words correctly. In addition, it is clear that speech
utterances do not necessarily correspond to single
sentences in many cases, but rather sentence
fragments, or in some cases multiple sentences8. The
LM, however, is trained largely on complete
sentences and could thus fail to account for N-gram
sequences that bridge typical sentence boundaries.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>In this paper, we have described our proposed
solution for the GermEval 2020 Task 4: Low-Resource
Speech-to-Text challenge. We have implemented
an advanced TDNN AM using popular acoustic
speech data augmentation techniques available as
part of the Kaldi Speech Recognition Toolkit. Our
7This result is automatically calculated and published on
the shared task’s public leader board upon submission.</p>
      <p>8A manual evaluation of a sample of 100 speech utterances
from the test set show that 58 are not complete sentences,
of which 25 also contain fragments from the preceding or
following utterance.
model achieves a WER of 45.45% on the public
part of the task’s test set, which we believe is
competitive given the amount of training data and the
major challenges involved in STT for languages
with a high degree of dialectal variability such as
Swiss German.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>We would like to thank Fransisco Campillo from
Spitch AG, who helped us in setting up our initial
STT system that was used as a springboard for our
experiments and investigations.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Stanley F Chen and Joshua Goodman</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>An Empirical Study of Smoothing Techniques for Language Modeling</article-title>
          .
          <source>Computer Speech &amp; Language</source>
          ,
          <volume>13</volume>
          (
          <issue>4</issue>
          ):
          <fpage>359</fpage>
          -
          <lpage>394</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Johannes</surname>
            <given-names>Grae¨n.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Exploiting alignment in multiparallel corpora for applications in linguistics and language learning</article-title>
          .
          <source>Ph.D. thesis</source>
          , University of Zurich.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Johannes</given-names>
            <surname>Grae¨n</surname>
          </string-name>
          , Tannon Kew, Anastassia Shaitarova, and
          <string-name>
            <given-names>Martin</given-names>
            <surname>Volk</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Modelling large parallel corpora: The zurich parallel corpus collection</article-title>
          .
          <source>In Challenges in the Management of Large Corpora (CMLC-7)</source>
          . Leibniz-Institut fu¨r Deutsche Sprache.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Bo-June</surname>
          </string-name>
          (Paul) Hsu and
          <string-name>
            <surname>James R. Glass</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Iterative language model estimation: Efficient data structure &amp; algorithms</article-title>
          . In
          <source>In Proceedings of the Ninth Annual Conference of the International Speech Communication Association</source>
          , pages
          <fpage>841</fpage>
          -
          <lpage>844</lpage>
          , Brisbane, Australia.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Tom</given-names>
            <surname>Ko</surname>
          </string-name>
          , Vijayaditya Peddinti, Daniel Povey, and
          <string-name>
            <given-names>Sanjeev</given-names>
            <surname>Khudanpur</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Audio augmentation for speech recognition</article-title>
          .
          <source>In Sixteenth Annual Conference of the International Speech Communication Association.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Philipp</given-names>
            <surname>Koehn</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Europarl: A parallel corpus for statistical machine translation</article-title>
          .
          <source>In MT summit</source>
          , volume
          <volume>5</volume>
          , pages
          <fpage>79</fpage>
          -
          <lpage>86</lpage>
          . Citeseer.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Yajie</given-names>
            <surname>Miao</surname>
          </string-name>
          , Hao Zhang, and
          <string-name>
            <given-names>Florian</given-names>
            <surname>Metze</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Speaker adaptive training of deep neural network acoustic models using i-vectors</article-title>
          .
          <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          ,
          <volume>23</volume>
          (
          <issue>11</issue>
          ):
          <fpage>1938</fpage>
          -
          <lpage>1949</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Vijayaditya</surname>
            <given-names>Peddinti</given-names>
          </string-name>
          , Daniel Povey, and
          <string-name>
            <given-names>Sanjeev</given-names>
            <surname>Khudanpur</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>A time delay neural network architecture for efficient modeling of long temporal contexts</article-title>
          .
          <source>In Sixteenth Annual Conference of the International Speech Communication Association.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Michel</given-names>
            <surname>Plu</surname>
          </string-name>
          <article-title>¨ss, Lukas Neukom</article-title>
          , and
          <string-name>
            <given-names>Manfred</given-names>
            <surname>Vogel</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>GermEval 2020 Task 4</article-title>
          :
          <string-name>
            <surname>Low-Resource</surname>
          </string-name>
          Speech-to-Text. In preparation.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Povey</surname>
          </string-name>
          , Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian,
          <string-name>
            <given-names>Petr</given-names>
            <surname>Schwarz</surname>
          </string-name>
          , et al.
          <year>2011</year>
          .
          <article-title>The kaldi speech recognition toolkit</article-title>
          .
          <source>In IEEE 2011 workshop on automatic speech recognition and understanding, CONF. IEEE Signal Processing Society.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>George</given-names>
            <surname>Saon</surname>
          </string-name>
          , Hagen Soltau, David Nahamoo,
          <string-name>
            <given-names>and Michael</given-names>
            <surname>Picheny</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Speaker adaptation of neural network acoustic models using i-vectors</article-title>
          .
          <source>In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding</source>
          , pages
          <fpage>55</fpage>
          -
          <lpage>59</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Larissa</given-names>
            <surname>Schmidt</surname>
          </string-name>
          , Lucy Linder, Sandra Djambazovska, Alexandros Lazaridis, Tanja Samardzˇic´, and
          <string-name>
            <given-names>Claudiu</given-names>
            <surname>Musat</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>A swiss german dictionary: Variation in speech and writing</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Stolcke</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>SRILM - An Extensible Language Modeling Toolkit</article-title>
          .
          <source>In Proceedings of the Seventh International Conference on Spoken Language Processing (ICSLP)</source>
          , Denver, USA.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>John C Wells</surname>
          </string-name>
          et al.
          <year>1997</year>
          .
          <article-title>Sampa computer readable phonetic alphabet. Handbook of standards and resources for spoken language systems, 4</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>