Future Actions for Swiss German — Workshop Results at SwissText 2018 Tanja Samardi Mark Cieliebak Jan Milan Deriu Language and Space Lab School of Engineering School of Engineering University of Zurich Zurich University Zurich University tanja.samardzic@uzh.ch of Applied Sciences of Applied Sciences mark.cieliebak@zhaw.ch jan.deriu@zhaw.ch tasks such as speech recognition, translation, or nor- malisation. All involved parties need more data and Abstract would greatly benefit from an exchange and future joint development. The goal of this workshop was to The goal of this workshop was to initiate col- bring together researchers working on Swiss German laborations among companies and academic in companies and in academic institutions in order to institutions for developing Swiss German re- identify the common needs and modes of future col- sources and activities. The need for such an laborations. initiative is created by a growing interest for applying automatic text processing technolo- 2 Overview of the Current Activities gies to Swiss German, which takes place in the context of particularly scarce data sets. We have considered potential modes for a As part of the preparations for the workshop, we collaborative data development and manage- have conducted an informal survey among the re- ment. The outcome of the workshop are de- searchers and institutions we knew were interested or fined common interests, priorities, and the already working on Swiss German automatic process- first steps in future synchronised efforts. ing. We asked the contacted persons to share with us an overview of their data sets, tools, and general activ- ities related to Swiss German. We have received ten 1 Introduction responses, coming from the following organisations: Automatic processing of Swiss German has long been regard as not needed, as standard German is regularly Institute of Computational Linguistics, Uni- used in public communication in Switzerland. This versity of Zurich view, however, has recently changed following the in- Swiss Re creased presence of local varieties in public commu- School of Business and Engineering Vaud nication (mostly on the Internet). This brought several companies and academic institutions to start working Swisscom on automatic processing of Swiss German. This has Institute of Applied Information Technol- resulted in the development of initial data sets that can ogy, Zurich University of Applied Sciences be used for training models for automatic speech and School of Applied Linguistics, Zurich Uni- text processing. These data sets are, however, scat- versity of Applied Sciences tered across different institutions that produced them and not easily accessible to the researchers outside the Spitch AG host institutions. On the other hand, each data set in- University of Helsinki dividually is too small to allow good performance on Slowsoft In: Mark Cieliebak, Don Tuggener and Fernando Benites (eds.): Proceedings of the 3rd Swiss Text Analytics Conference (Swiss- Based on these responses and our own insight, we Text 2018), Winterthur, Switzerland, June 2018 have composed an initial overview of the existing 1 95 Institution Group/Individual School of Business and Engineering Vaud (HEIG-VD) Andrei Popescu-Belis The Idiap Research Institute (Idiap) Walliserdeutsch Schweizerisches Idiotikon University of Geneva (UniGe) Manuela Schnenberger, Eric Haeberli University of Helsinki Yves Scherrer University of Zurich (UZH) Language and Space Lab, German Depart- ment, Romance Department, Institute of Computational Linguistics Zurich University of Applied Sciences (ZHAW) Text Analytics and Dialogue Systems Group Table 1: Research groups in academic institutions in alphabetic order give in the parentheses the corresponding abbrevia- tions, which are easily recognisable to the Swiss audi- Company ence. Recapp An additional institution where some work on Spitch Swiss German is likely to take place is ETH Zurich, SpinningBytes but we have not established a relevant contact up to Slowsoft this point. Swisscom Considerable efforts have already been invested in processing Swiss German in the companies listed in Table 2: Companies working on/with Swiss German Table 2. Additionally, one company, Telepathy Labs, is associated with a published piece of work on Swiss German, but their engagement is yet to be confirmed. data, tools, and addressed processing tasks. This in- formation is presented in the remainder of this section. 2.2 Data Sets While we did our best to collect as much information Here we list the Swiss German data sets produced and as possible in the present moment, this review is not made available by the groups listed above and some to be regarded as an exhaustive inventory, but rather other groups who worked on Swiss German in the as a first step towards a complete inventory that will past. This list relies on three publications that ap- be developed through collaborative work. peared in the proceedings of the Language Resources and Evaluation Conference (LREC 2018), all provid- 2.1 The List of Institutions ing informative overviews of the state of the art: In the research on automatic processing of Swiss Ger- man both practical and scholarly sides are equally pro- • SB-CH: A Swiss German Corpus with Sentiment nounced. In order to get the standard natural language Annotations processing work for Swiss German, we need to un- R. Grubenmann, D. Tuggener, P. Von Dniken, J. derstand and address the details of its particular and Deriu, M. Cieliebak complex usage practices. There is thus a considerable • Machine Translation of Low-Resource Spoken overlap between the work on developing end-user ap- Dialects: Strategies for Normalizing Swiss Ger- plications, primarily done in the companies, design man of algorithmic solutions, typically at applied univer- P.-E. Honnet, A. Popescu-Belis, C. Musat, M. sities, and data-driven study of linguistic variation in Baeriswyl Swiss German, primarily performed in academic in- stitutions. • Strategies and Challenges for Crowdsourcing We have identified several groups inside academic Regional Dialect Perception Data for Swiss Ger- institutions where some work relevant to the auto- man and Swiss French matic processing of Swiss German is taking place. J.-P. Goldman, S. Clematide, M. Avanzi, R. Tan- They are listed in Table 1. For Swiss institution, we dler 2 96 We divide the data sets into two major types: text 2.3 Processing Tasks and Tools corpora (Table 3) and lexica (Table 4). For each listed Natural language processing can potentially involve item, we specify the institution or the group that de- many different tasks for which specific tools are de- veloped it and the most important characteristics of veloped. We list here those tasks that have been ad- the data set. The column “Text” specifies whether dressed for Swiss German or that are mentioned as the resource contains text. “Sound” whether it con- current activities in the contacted institutions. As- tains recorded speech (both are present when sound suming the view of natural language processing as recordings are transcribed). “Norm/Trans” specifies a pipeline, or a stream, we divide the tasks into two whether there is word level normalisation of writing groups: upstream tasks (Table 5) and end-user tasks or full translation to standard German. We join these (including annotation tools, Table 6). The output of two features together because normalisation typically the first group of tasks is not necessarily visible to the involves standard German writing applied to Swiss end user, but rather used as input to the end-user tasks. German. The difference between the two is that nor- The parentheses in these two tables indicate differ- malised text is not necessarily proper standard Ger- ent remarks. In Table 5, they are used to specify the man in the sense of orthography, grammar and style. institution where the work on the given task is per- The last column “PoS” specifies whether the text is formed. In Table 6, the parentheses indicate that it is annotated with part-of-speech tags. We list institution not clear at this point whether the tasks are attempted name where it is clear which institution is responsible specifically for Swiss German, since most of the in- for the resources, otherwise, we provide the informa- stitutions develop their applications for multiple lan- tion about its authors. guages. Although it is not surprising, it is interesting to note that Table 5 lists mostly academic institutions, There are two specific remarks regarding Table 3. while companies are more involved in tasks listed in First, note that two corpora contain additional anno- Table 6. tation: the corpus SB-CH sentiment and the UniGe corpus syntax. Second, the data used in the project 3 Future Actions “din dialkt” are often taken from already existing re- sources, which means that there are considerable over- Most of the groups listed in our overview were rep- laps between this set and other known sources. resented at the workshop, which was generally very well attended. The live discussion that followed the overview of the current state of the resources and tools Regarding Table 4, we use parentheses “( )” to sig- addressed the following points. nal two remarks. First, mapping to standard German referred to in the column “Norm/Trans” exists in the Comments on the overview resources built by the Idiotikon team, but it is not en- coded in the same way as in the corpora. This applies The comments from the audience on the presented re- to the part-of-speech information too: while there is view showed that several participants were involved information on the word types in the dictionary, these in projects on speech recognition with Recapp, not codes do not follow usual German tag sets. Second, covered by the overview. They also pointed out the the data used for the projects by Leemann et al. are resources developed by Slowsoft (transcribed Swiss drawn from other sources (SDS), resulting in a con- German sentences, pronounced by one speaker), that siderable overlap. Also, mapping to standard Ger- were not listed in the overview due to a miscommu- man and part-of-speech information is likely inherited nication. Other potential sources of data were men- from SDS. tioned, such as SRF subtitles that are in standard Ger- man, but aligned with Swiss German sound source. Overall, Table 3 and 4 show that most available Identifying common needs and priorities for fur- data sets come from academic institutions. While ther development companies can be expected to have developed their own resources too, descriptions of these resources are The discussions on the common needs revolved yet to be shared. around the question of writing for Swiss German 3 97 Project Who Text Sound Norm/Trans PoS ArchiMob UZH+Spitch X X X X BE-Novel Honnet et al. X X NOAH UZH X X Phonogram UZH X X X SB-CH (Sentiment) ZHAW+SpinningBytes X sms4science UZH+Swisscom X X X walliserdeutsch Idiap X X X Wil corpus (Syntax) UniGe X X X WUS (WhatsApp) UZH X X X din dialkt UZH (X) (X) (X) (X) In progress UZH X X Table 3: An initial inventory of Swiss German text corpora available for training processing tools. Project Who Text Sound Norm/Trans PoS Swiss German Atlas Schweizerisches X (X) (X) (SDS) Idiotikon Swiss German Dic- Schweizerisches X (X) (X) tionary Idiotikon Dialkt pp Leemann et al.1 (X) (X) (X) Voice pp Leemann et al.2 (X) (X) (X) (X) Pronunciation Spitch X X BE-Lexicon Honnet et al. X X ZH-Lexicon Honnet et al. X X Table 4: An initial inventory of Swiss German lexical resources (potentially) relevant to automatic processing. Task Who Speech recognition Spitch, UZH Normalisation ArchiMob, SMS, WhatsApp (UZH) Anonymisation Swisscom, UZH Morphology (finite-state) Scherrer PhD thesis (UniGe), Baumgartner MA thesis (UZH) PoS ArchiMob, NOAH, SMS, WhatsApp (UZH) Syntax Forst MA thesis (UniL), Aepli MA thesis (UZH) Table 5: Upstream tasks that have been attempted for Swiss German 4 98 Task Who Active learning Swisscom, UZH Dialect identification Swisscom, UZH, ZHAW Normalisation UZH Sentiment annotation Swisscom, SpinningBytes Sentiment classification (Swisscom), SpinningBytes, ZHAW, (Spitch) Speaker identification (Swisscom), Spitch Speech Synthesis (Slowsoft), (Swisscom), (Spitch) Transcription Swisscom-AILA Table 6: End-user applications and annotation tools that are being developed for Swiss German texts. As there is no official standard, the text is likely • Start collaborating through proposing the first In- to be written either in a non-standard way (as in the noSuisse grant as soon as possible. user-generated content) or in standard German (as in the case of SRF transcriptions, for instance). Mapping • Organise shortly a follow-up workshop where speech to a standard writing came out as a common the ideas for the InnoSuisse proposal will be need. Defining and implementing a common writ- sketched (2/3 participants were interested in at- ing standard seems to be one of the potential topics tending such a workshop). for collaboration. The discussion also showed that, to • Elaborate and share a detailed inventory of the define other common needs, we would need to anal- existing resources and tools, including the infor- yse real use scenarios and identify the tasks based on mation on the conditions of use and data samples. them. • Formulate processing tasks based on use cases. Modes of collaboration This plan can be considered the main outcome of The discussion on this topic showed that there is an the workshop, together with the established contacts interest for collaboration, but that it will take consid- and identified commitments to work together on syn- erable work in order to make it work. One obstacle chronising efforts invested in processing Swiss Ger- is the fact that companies typically work with sen- man. sitive data that cannot be shared. Another problem is that sharing data requires additional work (such as anonymisation, detailed documentation). With a clear idea of the potential benefits, the groups working on Swiss German might be ready to invest more effort in order to facilitate collaboration. Funding possibilities For the moment, the foundation InnoSuisse seems like the best choice for submitting proposals. It enables developing solutions for an identified need without a concrete business plan. Other foundations targeting the exchange between academia and companies can be considered. Next steps Based on the previous discussions, we have defined the following actions as the next steps towards estab- lishing a collaborative network for Swiss German: 5 99