<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tanja Samardi</string-name>
          <email>tanja.samardzic@uzh.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark Cieliebak</string-name>
          <email>mark.cieliebak@zhaw.ch</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan Milan Deriu</string-name>
          <email>jan.deriu@zhaw.ch</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>In: Mark Cieliebak, Don Tuggener and Fernando Benites (eds.):</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Institute of Computational Linguistics, Uni-</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Language and Space Lab, University of Zurich</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Proceedings of the 3rd Swiss Text Analytics Conference (SwissText 2018)</institution>
          ,
          <addr-line>Winterthur, Switzerland, June 2018</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Engineering, Zurich University, of Applied Sciences</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>versity of Zurich</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>The goal of this workshop was to initiate collaborations among companies and academic institutions for developing Swiss German resources and activities. The need for such an initiative is created by a growing interest for applying automatic text processing technologies to Swiss German, which takes place in the context of particularly scarce data sets. We have considered potential modes for a collaborative data development and management. The outcome of the workshop are defined common interests, priorities, and the first steps in future synchronised efforts.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Automatic processing of Swiss German has long been
regard as not needed, as standard German is regularly
used in public communication in Switzerland. This
view, however, has recently changed following the
increased presence of local varieties in public
communication (mostly on the Internet). This brought several
companies and academic institutions to start working
on automatic processing of Swiss German. This has
resulted in the development of initial data sets that can
be used for training models for automatic speech and
text processing. These data sets are, however,
scattered across different institutions that produced them
and not easily accessible to the researchers outside the
host institutions. On the other hand, each data set
individually is too small to allow good performance on
tasks such as speech recognition, translation, or
normalisation. All involved parties need more data and
would greatly benefit from an exchange and future
joint development. The goal of this workshop was to
bring together researchers working on Swiss German
in companies and in academic institutions in order to
identify the common needs and modes of future
collaborations.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Overview of the Current Activities</title>
      <p>As part of the preparations for the workshop, we
have conducted an informal survey among the
researchers and institutions we knew were interested or
already working on Swiss German automatic
processing. We asked the contacted persons to share with us
an overview of their data sets, tools, and general
activities related to Swiss German. We have received ten
responses, coming from the following organisations:</p>
      <sec id="sec-2-1">
        <title>Swisscom</title>
        <p>School of Business and Engineering Vaud
Institute of Applied Information
Technology, Zurich University of Applied Sciences
School of Applied Linguistics, Zurich
University of Applied Sciences</p>
      </sec>
      <sec id="sec-2-2">
        <title>Spitch AG</title>
      </sec>
      <sec id="sec-2-3">
        <title>University of Helsinki</title>
      </sec>
      <sec id="sec-2-4">
        <title>Slowsoft Based on these responses and our own insight, we have composed an initial overview of the existing</title>
      </sec>
      <sec id="sec-2-5">
        <title>Institution</title>
        <p>School of Business and Engineering Vaud (HEIG-VD)
The Idiap Research Institute (Idiap)
Schweizerisches Idiotikon
University of Geneva (UniGe)
University of Helsinki
University of Zurich (UZH)
Zurich University of Applied Sciences (ZHAW)</p>
      </sec>
      <sec id="sec-2-6">
        <title>Group/Individual</title>
        <p>Andrei Popescu-Belis
Walliserdeutsch
Manuela Schnenberger, Eric Haeberli
Yves Scherrer
Language and Space Lab, German
Department, Romance Department, Institute of
Computational Linguistics
Text Analytics and Dialogue Systems Group</p>
      </sec>
      <sec id="sec-2-7">
        <title>Company</title>
        <p>Recapp
Spitch
SpinningBytes
Slowsoft
Swisscom
data, tools, and addressed processing tasks. This
information is presented in the remainder of this section.
While we did our best to collect as much information
as possible in the present moment, this review is not
to be regarded as an exhaustive inventory, but rather
as a first step towards a complete inventory that will
be developed through collaborative work.
2.1</p>
        <sec id="sec-2-7-1">
          <title>The List of Institutions</title>
          <p>In the research on automatic processing of Swiss
German both practical and scholarly sides are equally
pronounced. In order to get the standard natural language
processing work for Swiss German, we need to
understand and address the details of its particular and
complex usage practices. There is thus a considerable
overlap between the work on developing end-user
applications, primarily done in the companies, design
of algorithmic solutions, typically at applied
universities, and data-driven study of linguistic variation in
Swiss German, primarily performed in academic
institutions.</p>
          <p>We have identified several groups inside academic
institutions where some work relevant to the
automatic processing of Swiss German is taking place.
They are listed in Table 1. For Swiss institution, we</p>
        </sec>
        <sec id="sec-2-7-2">
          <title>Data Sets</title>
          <p>Here we list the Swiss German data sets produced and
made available by the groups listed above and some
other groups who worked on Swiss German in the
past. This list relies on three publications that
appeared in the proceedings of the Language Resources
and Evaluation Conference (LREC 2018), all
providing informative overviews of the state of the art:
• SB-CH: A Swiss German Corpus with Sentiment
Annotations
R. Grubenmann, D. Tuggener, P. Von Dniken, J.</p>
          <p>Deriu, M. Cieliebak
• Machine Translation of Low-Resource Spoken
Dialects: Strategies for Normalizing Swiss
German
P.-E. Honnet, A. Popescu-Belis, C. Musat, M.</p>
          <p>Baeriswyl
• Strategies and Challenges for Crowdsourcing
Regional Dialect Perception Data for Swiss
German and Swiss French
J.-P. Goldman, S. Clematide, M. Avanzi, R.
Tandler</p>
          <p>We divide the data sets into two major types: text
corpora (Table 3) and lexica (Table 4). For each listed
item, we specify the institution or the group that
developed it and the most important characteristics of
the data set. The column “Text” specifies whether
the resource contains text. “Sound” whether it
contains recorded speech (both are present when sound
recordings are transcribed). “Norm/Trans” specifies
whether there is word level normalisation of writing
or full translation to standard German. We join these
two features together because normalisation typically
involves standard German writing applied to Swiss
German. The difference between the two is that
normalised text is not necessarily proper standard
German in the sense of orthography, grammar and style.
The last column “PoS” specifies whether the text is
annotated with part-of-speech tags. We list institution
name where it is clear which institution is responsible
for the resources, otherwise, we provide the
information about its authors.</p>
          <p>There are two specific remarks regarding Table 3.
First, note that two corpora contain additional
annotation: the corpus SB-CH sentiment and the UniGe
corpus syntax. Second, the data used in the project
“din dialkt” are often taken from already existing
resources, which means that there are considerable
overlaps between this set and other known sources.</p>
          <p>Regarding Table 4, we use parentheses “( )” to
signal two remarks. First, mapping to standard German
referred to in the column “Norm/Trans” exists in the
resources built by the Idiotikon team, but it is not
encoded in the same way as in the corpora. This applies
to the part-of-speech information too: while there is
information on the word types in the dictionary, these
codes do not follow usual German tag sets. Second,
the data used for the projects by Leemann et al. are
drawn from other sources (SDS), resulting in a
considerable overlap. Also, mapping to standard
German and part-of-speech information is likely inherited
from SDS.</p>
          <p>Overall, Table 3 and 4 show that most available
data sets come from academic institutions. While
companies can be expected to have developed their
own resources too, descriptions of these resources are
yet to be shared.
2.3</p>
        </sec>
        <sec id="sec-2-7-3">
          <title>Processing Tasks and Tools</title>
          <p>Natural language processing can potentially involve
many different tasks for which specific tools are
developed. We list here those tasks that have been
addressed for Swiss German or that are mentioned as
current activities in the contacted institutions.
Assuming the view of natural language processing as
a pipeline, or a stream, we divide the tasks into two
groups: upstream tasks (Table 5) and end-user tasks
(including annotation tools, Table 6). The output of
the first group of tasks is not necessarily visible to the
end user, but rather used as input to the end-user tasks.</p>
          <p>The parentheses in these two tables indicate
different remarks. In Table 5, they are used to specify the
institution where the work on the given task is
performed. In Table 6, the parentheses indicate that it is
not clear at this point whether the tasks are attempted
specifically for Swiss German, since most of the
institutions develop their applications for multiple
languages. Although it is not surprising, it is interesting
to note that Table 5 lists mostly academic institutions,
while companies are more involved in tasks listed in
Table 6.
3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Future Actions</title>
      <p>Most of the groups listed in our overview were
represented at the workshop, which was generally very
well attended. The live discussion that followed the
overview of the current state of the resources and tools
addressed the following points.</p>
      <sec id="sec-3-1">
        <title>Comments on the overview</title>
        <p>The comments from the audience on the presented
review showed that several participants were involved
in projects on speech recognition with Recapp, not
covered by the overview. They also pointed out the
resources developed by Slowsoft (transcribed Swiss
German sentences, pronounced by one speaker), that
were not listed in the overview due to a
miscommunication. Other potential sources of data were
mentioned, such as SRF subtitles that are in standard
German, but aligned with Swiss German sound source.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Identifying common needs and priorities for further development</title>
        <p>The discussions on the common needs revolved
around the question of writing for Swiss German</p>
        <sec id="sec-3-2-1">
          <title>Project</title>
          <p>ArchiMob
BE-Novel
NOAH
Phonogram
SB-CH (Sentiment)
sms4science
walliserdeutsch
Wil corpus (Syntax)
WUS (WhatsApp)
din dialkt
In progress
Who
UZH+Spitch
Honnet et al.</p>
          <p>UZH
UZH
ZHAW+SpinningBytes
UZH+Swisscom
Idiap
UniGe
UZH
UZH
UZH</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Project</title>
          <p>Swiss German Atlas
(SDS)
Swiss German
Dictionary
Dialkt pp
Voice pp
Pronunciation
BE-Lexicon
ZH-Lexicon
Who
Schweizerisches
Idiotikon
Schweizerisches
Idiotikon
Leemann et al.1
Leemann et al.2
Spitch
Honnet et al.</p>
          <p>Honnet et al.</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>Norm/Trans PoS</title>
          <p>X
X
X
X
X
X
X
X
X
(X)
X
X
X
(X)
(X)
X
X
X</p>
        </sec>
        <sec id="sec-3-2-4">
          <title>Text Sound</title>
          <p>X
X
X
X
(X)
X
(X)
X
X
X
X
X
X
X
(X)</p>
        </sec>
        <sec id="sec-3-2-5">
          <title>Norm/Trans</title>
          <p>(X)
(X)
(X)
(X)
X
X
X
X
X
X
X
(X)
PoS
(X)
(X)
(X)
(X)</p>
        </sec>
        <sec id="sec-3-2-6">
          <title>Task</title>
          <p>Active learning
Dialect identification
Normalisation
Sentiment annotation
Sentiment classification
Speaker identification
Speech Synthesis
Transcription
texts. As there is no official standard, the text is likely
to be written either in a non-standard way (as in the
user-generated content) or in standard German (as in
the case of SRF transcriptions, for instance). Mapping
speech to a standard writing came out as a common
need. Defining and implementing a common
writing standard seems to be one of the potential topics
for collaboration. The discussion also showed that, to
define other common needs, we would need to
analyse real use scenarios and identify the tasks based on
them.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Modes of collaboration</title>
        <p>The discussion on this topic showed that there is an
interest for collaboration, but that it will take
considerable work in order to make it work. One obstacle
is the fact that companies typically work with
sensitive data that cannot be shared. Another problem
is that sharing data requires additional work (such as
anonymisation, detailed documentation). With a clear
idea of the potential benefits, the groups working on
Swiss German might be ready to invest more effort in
order to facilitate collaboration.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Funding possibilities</title>
        <p>For the moment, the foundation InnoSuisse seems like
the best choice for submitting proposals. It enables
developing solutions for an identified need without a
concrete business plan. Other foundations targeting
the exchange between academia and companies can
be considered.</p>
      </sec>
      <sec id="sec-3-5">
        <title>Next steps</title>
        <p>Based on the previous discussions, we have defined
the following actions as the next steps towards
establishing a collaborative network for Swiss German:
• Start collaborating through proposing the first
InnoSuisse grant as soon as possible.
• Organise shortly a follow-up workshop where
the ideas for the InnoSuisse proposal will be
sketched (2/3 participants were interested in
attending such a workshop).
• Elaborate and share a detailed inventory of the
existing resources and tools, including the
information on the conditions of use and data samples.
• Formulate processing tasks based on use cases.</p>
        <p>This plan can be considered the main outcome of
the workshop, together with the established contacts
and identified commitments to work together on
synchronising efforts invested in processing Swiss
German.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>