Future Actions for Swiss German —
                                  Workshop Results at SwissText 2018
           Tanja Samardi            Mark Cieliebak                                            Jan Milan Deriu
       Language and Space Lab     School of Engineering                                     School of Engineering
         University of Zurich       Zurich University                                         Zurich University
    tanja.samardzic@uzh.ch         of Applied Sciences                                       of Applied Sciences
                              mark.cieliebak@zhaw.ch                                       jan.deriu@zhaw.ch


                                                                      tasks such as speech recognition, translation, or nor-
                                                                      malisation. All involved parties need more data and
                         Abstract                                     would greatly benefit from an exchange and future
                                                                      joint development. The goal of this workshop was to
     The goal of this workshop was to initiate col-                   bring together researchers working on Swiss German
     laborations among companies and academic                         in companies and in academic institutions in order to
     institutions for developing Swiss German re-                     identify the common needs and modes of future col-
     sources and activities. The need for such an                     laborations.
     initiative is created by a growing interest for
     applying automatic text processing technolo-                     2   Overview of the Current Activities
     gies to Swiss German, which takes place in
     the context of particularly scarce data sets.
     We have considered potential modes for a                             As part of the preparations for the workshop, we
     collaborative data development and manage-                       have conducted an informal survey among the re-
     ment. The outcome of the workshop are de-                        searchers and institutions we knew were interested or
     fined common interests, priorities, and the                      already working on Swiss German automatic process-
     first steps in future synchronised efforts.                      ing. We asked the contacted persons to share with us
                                                                      an overview of their data sets, tools, and general activ-
                                                                      ities related to Swiss German. We have received ten
1    Introduction
                                                                      responses, coming from the following organisations:
Automatic processing of Swiss German has long been
regard as not needed, as standard German is regularly                      Institute of Computational Linguistics, Uni-
used in public communication in Switzerland. This                          versity of Zurich
view, however, has recently changed following the in-                      Swiss Re
creased presence of local varieties in public commu-
                                                                           School of Business and Engineering Vaud
nication (mostly on the Internet). This brought several
companies and academic institutions to start working                       Swisscom
on automatic processing of Swiss German. This has                          Institute of Applied Information Technol-
resulted in the development of initial data sets that can                  ogy, Zurich University of Applied Sciences
be used for training models for automatic speech and
                                                                           School of Applied Linguistics, Zurich Uni-
text processing. These data sets are, however, scat-
                                                                           versity of Applied Sciences
tered across different institutions that produced them
and not easily accessible to the researchers outside the                   Spitch AG
host institutions. On the other hand, each data set in-                    University of Helsinki
dividually is too small to allow good performance on
                                                                           Slowsoft
In: Mark Cieliebak, Don Tuggener and Fernando Benites (eds.):
Proceedings of the 3rd Swiss Text Analytics Conference (Swiss-          Based on these responses and our own insight, we
Text 2018), Winterthur, Switzerland, June 2018                        have composed an initial overview of the existing


                                                                 1
                                                                 95
 Institution                                               Group/Individual
 School of Business and Engineering Vaud (HEIG-VD)         Andrei Popescu-Belis
 The Idiap Research Institute (Idiap)                      Walliserdeutsch
 Schweizerisches Idiotikon
 University of Geneva (UniGe)                              Manuela Schnenberger, Eric Haeberli
 University of Helsinki                                    Yves Scherrer
 University of Zurich (UZH)                                Language and Space Lab, German Depart-
                                                           ment, Romance Department, Institute of
                                                           Computational Linguistics
 Zurich University of Applied Sciences (ZHAW)              Text Analytics and Dialogue Systems Group

                      Table 1: Research groups in academic institutions in alphabetic order
                                                         give in the parentheses the corresponding abbrevia-
                                                         tions, which are easily recognisable to the Swiss audi-
                     Company                             ence.
                     Recapp                                 An additional institution where some work on
                     Spitch                              Swiss German is likely to take place is ETH Zurich,
                     SpinningBytes                       but we have not established a relevant contact up to
                     Slowsoft                            this point.
                     Swisscom                               Considerable efforts have already been invested in
                                                         processing Swiss German in the companies listed in
 Table 2: Companies working on/with Swiss German         Table 2. Additionally, one company, Telepathy Labs,
                                                         is associated with a published piece of work on Swiss
                                                         German, but their engagement is yet to be confirmed.
data, tools, and addressed processing tasks. This in-
formation is presented in the remainder of this section. 2.2 Data Sets
While we did our best to collect as much information
                                                         Here we list the Swiss German data sets produced and
as possible in the present moment, this review is not
                                                         made available by the groups listed above and some
to be regarded as an exhaustive inventory, but rather
                                                         other groups who worked on Swiss German in the
as a first step towards a complete inventory that will
                                                         past. This list relies on three publications that ap-
be developed through collaborative work.
                                                         peared in the proceedings of the Language Resources
                                                         and Evaluation Conference (LREC 2018), all provid-
2.1 The List of Institutions
                                                         ing informative overviews of the state of the art:
In the research on automatic processing of Swiss Ger-
man both practical and scholarly sides are equally pro-     • SB-CH: A Swiss German Corpus with Sentiment
nounced. In order to get the standard natural language         Annotations
processing work for Swiss German, we need to un-               R. Grubenmann, D. Tuggener, P. Von Dniken, J.
derstand and address the details of its particular and         Deriu, M. Cieliebak
complex usage practices. There is thus a considerable       • Machine Translation of Low-Resource Spoken
overlap between the work on developing end-user ap-            Dialects: Strategies for Normalizing Swiss Ger-
plications, primarily done in the companies, design            man
of algorithmic solutions, typically at applied univer-         P.-E. Honnet, A. Popescu-Belis, C. Musat, M.
sities, and data-driven study of linguistic variation in       Baeriswyl
Swiss German, primarily performed in academic in-
stitutions.                                                 • Strategies and Challenges for Crowdsourcing
    We have identified several groups inside academic          Regional Dialect Perception Data for Swiss Ger-
institutions where some work relevant to the auto-             man and Swiss French
matic processing of Swiss German is taking place.              J.-P. Goldman, S. Clematide, M. Avanzi, R. Tan-
They are listed in Table 1. For Swiss institution, we          dler


                                                      2
                                                      96
   We divide the data sets into two major types: text           2.3    Processing Tasks and Tools
corpora (Table 3) and lexica (Table 4). For each listed
                                                                Natural language processing can potentially involve
item, we specify the institution or the group that de-
                                                                many different tasks for which specific tools are de-
veloped it and the most important characteristics of
                                                                veloped. We list here those tasks that have been ad-
the data set. The column “Text” specifies whether
                                                                dressed for Swiss German or that are mentioned as
the resource contains text. “Sound” whether it con-
                                                                current activities in the contacted institutions. As-
tains recorded speech (both are present when sound
                                                                suming the view of natural language processing as
recordings are transcribed). “Norm/Trans” specifies
                                                                a pipeline, or a stream, we divide the tasks into two
whether there is word level normalisation of writing
                                                                groups: upstream tasks (Table 5) and end-user tasks
or full translation to standard German. We join these
                                                                (including annotation tools, Table 6). The output of
two features together because normalisation typically
                                                                the first group of tasks is not necessarily visible to the
involves standard German writing applied to Swiss
                                                                end user, but rather used as input to the end-user tasks.
German. The difference between the two is that nor-
                                                                    The parentheses in these two tables indicate differ-
malised text is not necessarily proper standard Ger-
                                                                ent remarks. In Table 5, they are used to specify the
man in the sense of orthography, grammar and style.
                                                                institution where the work on the given task is per-
The last column “PoS” specifies whether the text is
                                                                formed. In Table 6, the parentheses indicate that it is
annotated with part-of-speech tags. We list institution
                                                                not clear at this point whether the tasks are attempted
name where it is clear which institution is responsible
                                                                specifically for Swiss German, since most of the in-
for the resources, otherwise, we provide the informa-
                                                                stitutions develop their applications for multiple lan-
tion about its authors.
                                                                guages. Although it is not surprising, it is interesting
                                                                to note that Table 5 lists mostly academic institutions,
   There are two specific remarks regarding Table 3.            while companies are more involved in tasks listed in
First, note that two corpora contain additional anno-           Table 6.
tation: the corpus SB-CH sentiment and the UniGe
corpus syntax. Second, the data used in the project             3     Future Actions
“din dialkt” are often taken from already existing re-
sources, which means that there are considerable over-          Most of the groups listed in our overview were rep-
laps between this set and other known sources.                  resented at the workshop, which was generally very
                                                                well attended. The live discussion that followed the
                                                                overview of the current state of the resources and tools
   Regarding Table 4, we use parentheses “( )” to sig-
                                                                addressed the following points.
nal two remarks. First, mapping to standard German
referred to in the column “Norm/Trans” exists in the            Comments on the overview
resources built by the Idiotikon team, but it is not en-
coded in the same way as in the corpora. This applies           The comments from the audience on the presented re-
to the part-of-speech information too: while there is           view showed that several participants were involved
information on the word types in the dictionary, these          in projects on speech recognition with Recapp, not
codes do not follow usual German tag sets. Second,              covered by the overview. They also pointed out the
the data used for the projects by Leemann et al. are            resources developed by Slowsoft (transcribed Swiss
drawn from other sources (SDS), resulting in a con-             German sentences, pronounced by one speaker), that
siderable overlap. Also, mapping to standard Ger-               were not listed in the overview due to a miscommu-
man and part-of-speech information is likely inherited          nication. Other potential sources of data were men-
from SDS.                                                       tioned, such as SRF subtitles that are in standard Ger-
                                                                man, but aligned with Swiss German sound source.
   Overall, Table 3 and 4 show that most available
                                                                 Identifying common needs and priorities for fur-
data sets come from academic institutions. While
                                                                ther development
companies can be expected to have developed their
own resources too, descriptions of these resources are          The discussions on the common needs revolved
yet to be shared.                                               around the question of writing for Swiss German


                                                           3
                                                           97
            Project                 Who                        Text   Sound    Norm/Trans     PoS
            ArchiMob                UZH+Spitch                  X       X          X           X
            BE-Novel                Honnet et al.               X                  X
            NOAH                    UZH                         X                              X
            Phonogram               UZH                         X       X           X
            SB-CH (Sentiment)       ZHAW+SpinningBytes          X
            sms4science             UZH+Swisscom                X                   X          X
            walliserdeutsch         Idiap                       X       X           X
            Wil corpus (Syntax)     UniGe                       X       X                      X
            WUS (WhatsApp)          UZH                         X                   X          X
            din dialkt              UZH                        (X)     (X)         (X)        (X)
            In progress             UZH                         X       X

      Table 3: An initial inventory of Swiss German text corpora available for training processing tools.


            Project                 Who                        Text   Sound    Norm/Trans     PoS
            Swiss German Atlas      Schweizerisches             X                 (X)         (X)
            (SDS)                   Idiotikon
            Swiss German Dic-       Schweizerisches             X                  (X)        (X)
            tionary                 Idiotikon
            Dialkt pp               Leemann et al.1            (X)                 (X)        (X)
            Voice pp                Leemann et al.2            (X)     (X)         (X)        (X)
            Pronunciation           Spitch                      X       X
            BE-Lexicon              Honnet et al.               X                   X
            ZH-Lexicon              Honnet et al.               X                   X

Table 4: An initial inventory of Swiss German lexical resources (potentially) relevant to automatic processing.


        Task                        Who
        Speech recognition          Spitch, UZH
        Normalisation               ArchiMob, SMS, WhatsApp (UZH)
        Anonymisation               Swisscom, UZH
        Morphology (finite-state)   Scherrer PhD thesis (UniGe), Baumgartner MA thesis (UZH)
        PoS                         ArchiMob, NOAH, SMS, WhatsApp (UZH)
        Syntax                      Forst MA thesis (UniL), Aepli MA thesis (UZH)

                     Table 5: Upstream tasks that have been attempted for Swiss German


                                                      4
                                                      98
                     Task                         Who
                     Active learning              Swisscom, UZH
                     Dialect identification       Swisscom, UZH, ZHAW
                     Normalisation                UZH
                     Sentiment annotation         Swisscom, SpinningBytes
                     Sentiment classification     (Swisscom), SpinningBytes, ZHAW, (Spitch)
                     Speaker identification       (Swisscom), Spitch
                     Speech Synthesis             (Slowsoft), (Swisscom), (Spitch)
                     Transcription                Swisscom-AILA

        Table 6: End-user applications and annotation tools that are being developed for Swiss German

texts. As there is no official standard, the text is likely          • Start collaborating through proposing the first In-
to be written either in a non-standard way (as in the                  noSuisse grant as soon as possible.
user-generated content) or in standard German (as in
the case of SRF transcriptions, for instance). Mapping               • Organise shortly a follow-up workshop where
speech to a standard writing came out as a common                      the ideas for the InnoSuisse proposal will be
need. Defining and implementing a common writ-                         sketched (2/3 participants were interested in at-
ing standard seems to be one of the potential topics                   tending such a workshop).
for collaboration. The discussion also showed that, to               • Elaborate and share a detailed inventory of the
define other common needs, we would need to anal-                      existing resources and tools, including the infor-
yse real use scenarios and identify the tasks based on                 mation on the conditions of use and data samples.
them.
                                                                     • Formulate processing tasks based on use cases.
Modes of collaboration
                                                                      This plan can be considered the main outcome of
The discussion on this topic showed that there is an               the workshop, together with the established contacts
interest for collaboration, but that it will take consid-          and identified commitments to work together on syn-
erable work in order to make it work. One obstacle                 chronising efforts invested in processing Swiss Ger-
is the fact that companies typically work with sen-                man.
sitive data that cannot be shared. Another problem
is that sharing data requires additional work (such as
anonymisation, detailed documentation). With a clear
idea of the potential benefits, the groups working on
Swiss German might be ready to invest more effort in
order to facilitate collaboration.

Funding possibilities
For the moment, the foundation InnoSuisse seems like
the best choice for submitting proposals. It enables
developing solutions for an identified need without a
concrete business plan. Other foundations targeting
the exchange between academia and companies can
be considered.

Next steps
Based on the previous discussions, we have defined
the following actions as the next steps towards estab-
lishing a collaborative network for Swiss German:


                                                              5
                                                              99