5th Swiss Text Analytics Conference (SwissText) & 16th
     Conference on Natural Language Processing (KONVENS):
                  Abstracts of the Applied Track


Contents
1   FLIE: Form Labelling for Information Extraction                                            2

2   Ranking of Social Reading Reviews Based on Richness in Narrative Absorption                2

3   The “Multilingual Anonymisation Toolkit for Public Administrations” (MAPA) Project         3

4   Showcase: Language analytics and semantic search for unknown document varieties            3

5   Towards a regionally representative and socio-demographically diverse resource of Swiss
    German                                                                                    4

6   Deep learning and visual tools for analyzing and monitoring integrity risks                5

7   Exploring German BERT model pre-training from scratch                                      5

8   Speech-to-Text Insights                                                                    6

9   Enabling conversational-based leadership training through advanced natural language un-
    derstanding                                                                               6

10 Interactive Poem Generation: when Language Models support Human Creativity                  7

11 A conversational recommender system based on neural NLP models                              7

12 Swiss German Speech-to-Text with Kaldi                                                      8

13 Biomedical relation extraction with state-of-the-art neural models                          9

14 MedMon: multilingual social media mining for disease monitoring                             9

15 A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation 10

16 Following, understanding, and supporting service-oriented person-to-person communica-
   tions                                                                                 10

17 Text Mining Technologies for Animal Health Surveillance                                    11

18 Assigning Grant Applications to Reviewers via Text Analysis                                11

19 Named entity recognition for job description mining                                        12

20 Company Name Disambiguation                                                                13

                                                1
1   FLIE: Form Labelling for Information Extraction
Ela Pustulka-Hunt, Thomas Hanne, Phillip Gachnang and Pasquale Biafora
Information extraction from forms is a challenging topic with high practical relevance, in par-
ticular for the insurance industry in Switzerland. We have gathered over 20’000 anonymized
insurance policies and related documents in German, French, English and Italian and have pro-
totyped an automated method for information extraction. We tested this method with three
policy types in German.
    Given a user schema, expressed as a list of attributes to be found in an insurance policy, we
extract the relevant information and map it to the attributes. To do that, we first extract the text
from pdf and generate the bounding boxes as a csv. We then reconstruct a page, group the text
boxes into horizontal groups and columns within groups and annotate the geometry. 24 policies
coming from various insurers and representing three policy types were annotated manually
by the user with the desired attribute names. Machine learning was used to propagate this
annotation in two steps: first, text was tagged as being metadata or data, and in the second step,
attribute names were mapped to the extracted text. The accuracy of the first step is now at 88%,
and in the second step we can map the attributes which appear over 8 times in the documents
with similar accuracy, while other attributes are often singletons and cannot be mapped yet.
Data extraction uses those annotations to produce the required output for the user. With more
annotated data, we will be able to reach the required accuracy of over 90%.


2   Ranking of Social Reading Reviews Based on Richness in Narrative Ab-
    sorption
Piroska Lendvai, Uwe Reichel, Simone Rebora and Moniek Kuijpers
Book reviews on social platforms are generated in large quantities by non-specialist avid read-
ers, and contain subjective evaluations pertaining to one’s own reading experience. Social read-
ing reviews often feature an under-researched phenomenon: Narrative Absorption, i.e. the
extent to which immersion inro the book’s narrative took place during reading. Absorption can
be reflected by statements such as ’I was completely hooked’ and pertain to a complexity of di-
mensions such as attention, emotional engagement, mental imagery, and transportation. Based
on a set of user-generated reviews that we manually annotated (cf. Rebora et al. 2020), the
detection of reading absorption with NLP approaches has been investigated in e.g. Lendvai,
Rebora and Kuijpers (2019), Lendvai et al. (2020).
    We work on a pipeline to retrieve and rank absorption-rich user reviews from a large, un-
labeled document dump (6+ million reviews in English), in order to allow for the preselection
of subsets of the dump that undergo manual annotation. We fine-tuned BERT (Devlin et al.,
2018) for a supervised absorption detection task on 16k review sentences absorption-annotated
by us (Absorption vs. Nonabsorption), and evaluated it on a held-out dataset of 149 reviews,
achieving .75 macro F1 mean (support: 1,011 vs. 3,510 sentences).
    Our current focus was to create a model that aggregates sentence level prediction scores on
the document level. To this end, BERT’s sentence level absorption probabilities were averaged
per review and were used to train a linear regression model on the full corpus to predict Ab-
sorption Richness, defined as the proportion of sentences annotated as expressing absorption
in a review. Review-level Absorption Richness regression lowers classification error relative
to the baseline, defined as the review-level proportion of absorption classifications by taking
the argmax of BERT’s logits (Mean Average Errors of .08 vs. .11 and Spearman correlation of
.73 vs. .65, respectively). The increase of the Spearman’s rank correlation coefficient directly
expresses that a review ranking by linear regression predictions corresponds more closely to the


                                                 2
ground truth ranking than a ranking solely based on BERT. We utilize the regression model in
Absorption-Richness-based document filtering, to facilitate the benchmarking and analysis of
social reading reviews in our large document dump.
    References:
    J. Devlin, M.W. Chang, K. Lee, K. Toutanova (2018). BERT: Pre-training of deep bidirec-
tional transformers for language understanding. arXiv preprint arXiv:1810.04805.
    P. Lendvai, S. Rebora, M. Kuijpers (2019). Identification of Reading Absorption in User-
Generated Book 2019 Reviews. In: Proc. of the 15th Conference on Natural Language Pro-
cessing (KONVENS 2019): Kaleidoscope Abstracts. Erlangen, Germany: German Society for
Computational Linguistics Language Technology, pp. 271-272.
    P. Lendvai, S. Daranyi, C. Geng, M. Kuijpers, O. Lopez de Lacalle, J.C. Mensonides, S.
Rebora, U. Reichel (2020). Detection of Reading Absorption in User-Generated Book Reviews:
Resources Creation and Evaluation. In: Proc. of The 12th Language Resources and Evaluation
Conference (LREC 2020), pp. 4835-4841.
    S. Rebora, P. Lendvai M. Kuijpers (2020). Annotating Reader Absorption. In: Proc. of
Digital Humanities Conference (DH2020).


3   The “Multilingual Anonymisation Toolkit for Public Administrations”
    (MAPA) Project
Paula Reichenberg, Artūrs Vasiļevskis and Manuel Herranz
The European Union’s new ‘Open Data Directive’ aims to stimulate the publishing and sharing
of dynamic data by public administrations, thus furthering the development of language tech-
nologies, NLP research and translation. However, such data sharing is only possible subject to
compliance with the General Data Protection Regulation (GDPR). For this reason, the Euro-
pean Commission has commissioned the development of a multilingual anonymization toolkit
for public administrations.
    Pangeanic and Tilde, together with CNRS (www.cnrs.fr), ELDA (www.elra.info/en), the
University of Malta (www.um.edu.mt), Vicomtech (www.vicomtech.org) and SEAD (Spanish
Agency for Digital Advancement) have been awarded EU funds to develop such open-source
toolkit for all EU languages, able to detect and de-identify personal data (name, addresses,
emails, credit card and bank accounts, etc.). The anonymisation toolkit is based on Named-
Entity Recognition (NER) techniques, using neural networks approaches. Pre-trained models
such as BERT (Delvin et al., 2018) and preprocessing of text using regular expressions are in-
cluded.The toolkit will provide support to EU public administrations in complying with GDPR
requirements, in particular in the health and legal fields.
    In this short presentation, Manuel Herranz, CEO of Pangeanic, and Artūrs Vasiļevskis, Head
of Machine Translation Solutions at Tilde, will discuss the challenges of the MAPA project,
their strategy, the results reached so far and the perspective it opens for public administrations
and the industry.


4   Showcase: Language analytics and semantic search for unknown doc-
    ument varieties
Holger Keibel, Elisabeth Maier and Tobias Christen
HIBU is a proprietary solution platform based on which Karakun (Basel) builds customer solu-
tions around Enterprise Search and Text Analytics. In this talk, we present a solution by DSwiss
(Zürich): high-security digital safes which allow users to store, exchange, but also search any


                                                3
type of documents and other security-relevant data. The focus will be on the text analytics
aspects of the solution developed with HIBU. Since the uploaded data can contain any sort of
content, the solution supports users to organize their data in two ways: by a hierarchical folder
structure and by means of facets (search filters). Some of the default facets are derived from
structured metadata as file format or date, while others are populated dynamically by semantic
taggers and classifiers as e.g. semantic document type, persons, locations mentioned in the doc-
ument. Especially these filters have proven very useful to support document and data retrieval.
We touch on the challenges of analyzing and indexing documents in a highly secure, multiple-
encrypted environment and will then discuss joint ongoing work to support the individual needs
of users even better: (1) use state-of-the-art neural network architectures to classify and extract
more types of information from documents to provide a broader range of filters; (2) personalize
the trained models that create the search filters; and (3) add a workflow engine with text-based
triggers (e.g. proposing a specific folder when uploading a document).


5   Towards a regionally representative and socio-demographically diverse
    resource of Swiss German
Péter Jeszenszky, Burcu Demiray, Carina Steiner and Adrian Leemann
When it comes to representing its vast regional diversity, Swiss German is under-resourced
for text-to-speech and speech-to-text tasks. The database we aim to build enriches existing
resources by representing low resource regional varieties and by matching dialect variation to
diverse socio-demographic backgrounds.
    We plan to compile a database based on two projects. The SDATS [1] (Swiss German
Dialects Across Time and Space) project, focusing on language variation and change, collects
about 2000 hours of recordings from 125 survey sites (8 speakers/locality). Local dialects of the
respondents, women and men from two age groups, with different professional backgrounds, are
recorded. The ongoing structured interviews (to be finished by Summer 2021) involve prompt-
ing certain words and phrases, reading a text previously translated from Standard German to
the local dialect by the speaker, semi-structured speech and spontaneous general interaction
with the interviewer. The audio recordings come with rich background information (mobility,
social networks, personality, attitude etc.), which enables the characterisation of sociolinguistic
variation beside regional variation.
    EAR [2] data contains non-intrusive records of spontaneous speech from healthy older indi-
viduals, mainly including everyday interactions in Swiss German. We invite EAR participants
for SDATS interviews, making it possible to match linguistic variables across the spontaneous
EAR records and the structured and spontaneous parts of the SDATS interview with the same
person.
    We plan the automated phonetic transcription of the data and aligning the results to Standard
German. With the combination of the two sources, a more realistic picture about spontaneous
language use will be available, which, especially annotated with the rich metadata, can become
a useful resource in Swiss German text-to-speech and speech-to-text tasks.
    At the conference, we plan to present the roadmap of data collection, cleaning, matching
and analysis. Besides, we plan to show some sound samples along with potential future uses of
the database.
    Péter Jeszenszky Resume: Péter is a geographer and data scientist interested in linguistic
variation and its geographic and socio-demographic causes. He finished his PhD in Geographic
Information Science at the University of Zurich in 2018, mainly working with Swiss German
morphosyntactic data. He was a postdoctoral researcher at the Ritsumeikan University in Ky-
oto, Japan, with the SNF Early PostDoc.Mobility grant, where he studied spatial and historical


                                                4
variation of Japanese dialects. He is now at the University of Bern working in the SDATS
project as a postdoc.
    Intended Audience: Stakeholders interested in the following topics: enriching their Swiss
German databases with spatially and socio-demographically diverse data; machine translation
and transcription of Swiss German; validating present Swiss German databases using matched
spontaneous and clearly uttered speech; generating Swiss German speech or text.


6   Deep learning and visual tools for analyzing and monitoring integrity
    risks
Albert Weichselbraun, Christian Hauser, Sandro Hörler and Anina Havelka
Risks jeopardizing the integrity of an organization are widespread. According to a 2018 study
by PricewaterhouseCoopers, almost 40% of Swiss companies haven been affected by illegal and
unethical behavior, such as embezzlement, cybercrime, corruption, fraud, money laundering
and anti-competitive agreements. Although the number of cases within Switzerland is relatively
low, the financial impact of these incidents is still above the global average. The University
of Applied Science of the Grisons conducts research that applies web intelligence and deep
learning to the task of supporting Swiss companies in identifying and mitigating integrity risks.
Historical data is used for training an LSTM classifier to recognize national and international
media coverage on corruption. Afterwards, we apply transfer learning techniques to the task of
adapting the classifier to a wide range of integrity topics such as human rights, labor conditions
and sustainability. The adapted classifier assigns scores to News articles that indicate their
relevance to the topic of integrity. Sophisticated visual tools use the annotated documents for
(i) tracking and visualizing past integrity management gaps and their respective impacts, (ii)
identifying whether organizations have been mentioned positively or negatively in these events,
and (iii) leveraging media coverage on upcoming integrity stories for predicting and discovering
existing blind spots within a company’s governance.


7   Exploring German BERT model pre-training from scratch
Branden Chan, Stefan Schweter and Timo Möller
In this work we provide interesting insights into BERT model pre-training from scratch for
German. We experiment with different corpora and subword masking techniques.
    The two current available BERT models for German (from Deepset and DBMDZ) were
trained on similar amounts of data (16GB). With the availability of larger corpora such as the
OSCAR corpus, that has an uncompressed size of 145GB for German and the recently intro-
duced whole word masking technique that is used in the preprocessing step, we train BERT base
and large models with different subword masking techniques and training data sizes, ranging
from 16GB up to 160GB of text.
    In order to show which subword masking technique improve or harm performance and if
larger training corpora really improve performance significantly, we perform an extensive eval-
uation of our models over the course of pre-training on various German downstream tasks. Our
BERT large model achieves new state-of-the-art results on GermEval 2018.
    All trained models will be made publicly available to the research community.
    Branden Chan is a Stanford graduate in computational linguistics. He now works for
deepset.ai as a machine learning engineer bringing the latest NLP techniques to the industry.
He is part of the team that open sourced German BERT and a regular contributor to the trans-
fer learning framework FARM. Currently he is experimenting with German language model
pre-training with a range of different architectures.

                                                5
    The intended audience ranges from researchers to developers. Researchers might be inter-
ested in our detailed evaluation. Developers might be interested in the integration of our models
into the Hugging Face Transformers library and Deepset’s FARM.


8    Speech-to-Text Insights
Manuela Hürlimann, Malgorzata Anna Ulasik, Philippe Schläpfer, Fernando Benites de Azevedo
E Souza, Katsiaryna Mlynchyk, Pius von Däniken, Flurin Gishamer, Lina Scarborough, Olesya
Ogorodnikova, Tracey Etheridge, Nitin Kumar, Badrudin Stanicki and Mark Cieliebak
Generating high quality transcripts from spoken dialogues (e.g. meetings or interviews) is not
a trivial task. Many different Automatic Speech Recognition (ASR) engines exist, both com-
mercial and open source. Two key tasks need to be solved: partitioning the speech according to
the different speakers (Diarization), and recognizing what is being said (Speech-to-Text). The
quality of the resulting transcript and its usability are influenced by many different factors. In
this talk we are going to present multiple insights and techniques which can improve the output
quality of ASR. We will address topics such as:
    • the recording setting, e.g. which microphone setup is going to give the best results?
    • error analysis, e.g. what are typical errors? How can we measure only semantically
      meaningful errors?
    • confidence scoring, e.g. how can we create more reliable confidence scores for the STT
      and diarization output?
The main goal of our contribution is to present best-practice approaches which can improve both
the diarization as well as the transcription quality. Our insights are based on extensive research
and experiments, including an evaluation of 10 STT engines and error analysis of more than 70
hours of transcribed speech in German and English.


9    Enabling conversational-based leadership training through advanced
     natural language understanding
Daniele Puccinelli, Sandra Mitrovic, Denis Broggini, Giancarlo Corti, Luca Chiarabini, Ric-
cardo Mazza, Fabio Rinaldi and Andrea Laus
SkillGym (www.skillgym.com) is a computer-based training system that enables in-role and
prospective leaders to develop their communication skills by presenting them with realistic
simulations of workplace situations. SkillGym walks the end user through a sequence of videos
related to a specific management situation by showing a rich set of alternatives as text boxes.
SkillGym also provides extensive feedback, which enables users to review a conversation step
by step, and learn the implications of their behavior at each step.
    Feedback from SkillGym users praises its engaging training environment. To make simu-
lations even more realistic, our goal is to move from the existing point-and-click interface to a
voice-based interface. Achieving this goal requires cutting-edge natural language understand-
ing to interpret the user input in the context of the ongoing flow of the simulated interaction.
Our proposed solution is to carry out feature extraction based on the output of a commodity
speech-to-text engine so that a dialog state tracker can select the next video based on the user
input. Notably, the user must be guided through textual hints to ensure that she provides input
that is coherent with the training goals of SkillGym. Moreover, the dialog state tracker must
handle all situations where the user input is not aligned with the training goals (e.g. off-topic
comments, disambiguation).

                                                6
    Short CV: Daniele Puccinelli is a senior lecturer and researcher at SUPSI and holds a Ph.D.
from the University of Notre Dame (USA). His current research interests lie in human-computer
interaction.
    Intended Audience: practitioners


10    Interactive Poem Generation: when Language Models support Hu-
      man Creativity
Andrei Popescu-Belis, Aris Xanthos, Valentin Minder, Àlex R. Atrio, Gabriel Luthier and Anto-
nio Rodriguez
Neural language models, which are probability distributions over sequences of words or char-
acters, have recently enabled the generation of fluent sentences and even short texts. However,
controlling such models in order to convey specific meanings remains difficult. To study how
language modeling can be constrained with text-level features, we have designed a system for
interactive poem generation, which enables the joint writing of a poem by a human and a ma-
chine. The human first selects the intended form of the poem, e.g. a sonnet or a haiku, although
internally any numbers of stanzas and lines are allowed. Using a general-domain neural lan-
guage model at the character-level, trained on French poems, the system generates a first draft
respecting the form. The draft can be modulated according to a desired combination of specific
topics (e.g. art, love, or nature) by modifying a number of words using topic-specific language
models. Similarly, the draft can be modulated in terms of emotions (happiness, sadness, or aver-
sion). To express their creativity and improve the readability of the poem, humans are allowed
to edit it at any stage of the creative process. A strategy to improve rhyming patterns is currently
explored. The system has been active since mid-February in the Digital Lyric exhibition. All
poems are logged in a database, from which descriptive statistics can be extracted. The system
can be demonstrated live at the conference using a large touchscreen.
    Bio of the presenter: Andrei Popescu-Belis is professor of computer science at HEIG-VD
/ HES-SO and a lecturer at EPFL. He is a graduate of the École Polytechnique, with a PhD
from the University of Paris-Sud. He has been a researcher in human language technology
at the University of Geneva and at the Idiap Research Institute. His interests are in machine
translation, information retrieval and human-computer interaction. He has published over 150
refereed papers and edited 12 books/proceedings.
    Intended audience: This talk will be of interest to researchers and developers of language
technologies, especially those using deep neural language models to generate texts. The talk
will also be relevant to those interested in digital humanities and creativity support tools.


11    A conversational recommender system based on neural NLP models
Sandra Mitrović, Vani Kanjirangat, Denis Broggini, Lorenzo Cimasoni, Marco Alberti, Alessan-
dro Antonucci and Fabio Rinaldi
Abstract: In this project, we focus on conversational recommender systems that allow users to
specify their preferences through a sequence of dynamically customized interactions, as con-
trasted to traditional ones. In particular, we seek to improve an online recommendation platform
of Stagend (stagend.com) that aims at finding the most suitable performer (”an item”) for a par-
ticular event specified by an event organizer (”a user”). In the first phase, an adaptive, Bayesian
methods-based approach was used to sequentially update the model given a new piece of in-
formation, e.g. performer’s answer to organizer’s question. However, in a real-time setting,
delayed/incomplete interactions (e.g. missing reply), can hamper the system efficiency.


                                                 7
    To overcome this issue, and also to avoid unnecessary burden on performer (in cases when
the answer is already available in performer’s biography or previous events’ conversations),
we investigate the ways of enhancing the Bayesian approach with NLP methods. Specifically,
we adopt a question-answering BERT-based approach to either provide a confident automated
answer based on the existing information, or to indicate uncertainty and thus, the necessity of
contacting the performer. Additionally, given that Stagend operates in multilingual markets, we
benchmark different multilingual models such as multilingual BERT and XLM-RoBERTa, as
well as compare these with separate language models per each of the target languages (DE +
Swiss DE challenge, FR, IT, EN).
    Short CV: Sandra Mitrović is a postdoctoral researcher at IDSIA (Dalle Molle Institute for
Artificial Intelligence) since November 2019. She has a background in Applied Mathemat-
ics and Computer Science (University of Montenegro). She did Masters in Data Mining and
Knowledge Management at Université Pierre et Marie Curie, Paris 6 and her PhD at KU Leu-
ven. Her research interests encompass natural language processing, representation learning,
(social) network analysis and machine learning in general.
    Intended Audience: project managers, developers


12    Swiss German Speech-to-Text with Kaldi
Iuliia Nigmatulina, Tannon Kew and Tanja Samardžić
Recent improvements in speech technology enable its increasing use in a range of applications,
including chatbots, online speech translation and smart home devices, among others. While
speech technology already achieves strong results for standardised languages, for languages
without orthography, with high regional variation and limited training resources, such as Swiss
German, it remains a considerable challenge. A high degree of dialectal variability combined
with a lack of standardisation leads to extremely sparse data that decreases the quality of align-
ments between the acoustic signal and its labels and, therefore, the final accuracy.
    To tackle the challenge of speech-to-text for Swiss German, we built a speech recognition
system using an adapted Kaldi toolkit recipe on multi-dialectal speech data from the ArchiMob
corpus. The system was separately trained on two types of writing in the target texts: a) an
approximate acoustic transcription that provides a close correspondence between labels and the
acoustic signal and b) a normalised writing that potentially reduces the lexical variability. We
find that the system trained on the normalised transcriptions currently achieves better results
in word error rate (40.81% vs. 54.39%) but underperforms the system trained on the acoustic
transcriptions on the character level (character error rate) (23.19% vs. 22.19%). We investigate
possible improvements of both approaches and present the outcomes.
    CV: Iuliia Nigmatulina received her MA degree in Psycholinguistics and Phonetics from
St.Petersbutg State University. She is now a master student in Computational Linguistics and
Speech Processing, at the University of Zürich. She is writing currently her master thesis about
Acoustic modelling for Swiss German ASR. Her research interests are in the area of automatic
speech recognition, sound analysis, phonetics and human-computer interaction.
    Tannon Kew: I am a master’s student in Computational Linguistics at the University of
Zurich in Switzerland. I have a background in Linguistics and language teaching. Throughout
my studies, I have worked on multiple projects relating to the development and applicability
of large parallel language corpora. In my current research project, I have focused on language
representation and modelling for Swiss German speech-to-text systems, under the supervision
of Dr. Tanja Samardžić.
    The intended audience: developers, project managers, data specialists.


                                                8
13    Biomedical relation extraction with state-of-the-art neural models
Vani Kanjirangat and Fabio Rinaldi
Typically text mining systems are based upon the identification of mentions of domain entities
of relevance (entity recognition and linking), and the identification of their relationships, such
as the role of genes in certain diseases, or protein-protein interactions.
    We experimented the efficacy of state-of-the-art neural models for extracting high-quality
relations from biomedical abstracts. The transformer models, BERT and its biomedical coun-
terpart, BIOBERT were tested as classification models as well as embeddings features.
    Experiments were conducted on reference datasets such as the CHEMPROT dataset (Chemical-
Protein relations) and the CDR dataset (Chemical-Disease relations). Depending on the dataset
used, the tasks varied from binary to multi-class classification and intra-sentential to inter-
sentential relation spans. By modelling the problem as a sentence pair classification task, we
found that our approach had comparable results with the SOTA models and specifically im-
proved inter-sentential results.
    Our research centers on improving the relation extraction models, by analyzing the features
captured by the current models. Experiments are done on visualizing the attention flow to ex-
ploit the features that were involved in deciding the relations by existing models. These analysis
are quite important, especially when the black-box nature of the neural models is considered to
be a main pitfall specifically restricting their practical applications.
    Short CV: I am currently working as a Researcher in the Natural Language Processing (NLP)
lab of IDSIA, Switzerland. I completed my PhD in NLP, which was primarily centered on inte-
grating machine learning and NLP techniques for text plagiarism detection. Ongoing research
work includes Biomedical Text Mining, Semantic Shift Detection and Visual Summary Gen-
erations using NLP techniques, Temporal Embeddings, Transformers and other Deep Learning
models. Alongside, I am working on projects aligned with application of deep learning models
in financial and question answering domains.
    Intended Audience: project managers, developers


14    MedMon: multilingual social media mining for disease monitoring
Joseph Cornelius, Tilia Ellendorff, Nico Colic, Lenz Furrer, Albert Weichselbraun, Raul Rodriguez-
Esteban, Philipp Kuntschik, Mathias Leddin, Juergen Gottowik and Fabio Rinaldi
The MedMon project (“Monitoring of internet resources for pharmaceutical research and devel-
opment”) is a collaborative InnoSuisse project between the University of Zurich, University of
Applied Science of the Grisons, and Roche. The project aims to monitor different social plat-
forms on the internet (e.g., Twitter, Reddit, and Medical Forums) to assess patients’ perception
of their specific disease burden and to discover unmet medical needs. By automating the gath-
ering of patient insights, we enable a more patient-centered drug development and surveillance,
particularly for rare diseases.
    Bringing together various sources of multilingual micro-posts for disease monitoring has
the advantage of ensuring a complete picture by integrating information from all source-types.
However, all monitored source-types are inherently different, each posing their own challenges
for computational processing.
    We discuss specific characteristics, advantages and disadvantages of each source-type and
condition (e.g. Parkinson, Multiple Sclerosis, Angelman Syndrome) in the context of automatic
medical monitoring. Using the sub-task of personal health mention recognition as an example,
we showcase how we addressed these challenges in practice.
    Our results give further insights on how to optimally benefit from these multilingual re-
sources and how to integrate them into an efficient model which can be applied in the context

                                               9
of different disease patterns.
    Additionally, in the context of this project the academic partners participated in an interna-
tional challenge about social media mining for health, achieving top results in two tasks, using
deep-learning BERT-based models. Specific methods and results will be presented.
    Short CV: Joseph Cornelius works as a research assistant at the Institute of Computational
Linguistics at the UZH. He holds a MSc in Neural System and Computation from UZH and
ETH. During his master’s studies, he has been working on automatic text summarization. His
research focuses on state-of-the-art deep learning methods (BERT, BioBERT, etc) for NLP in
the biomedical domain. He participated in scientific challenges focusing on social media mining
for health, obtaining top scores in two of them.
    Intended Audience: project managers, developers


15    A Methodology for Creating Question Answering Corpora Using In-
      verse Data Annotation
Jan Deriu, Katsiaryna Mlynchyk, Philippe Schläpfer, Alvaro Rodrigo, Dirk von Grünigen, Kurt
Stockinger, Eneko Agirre and Mark Cieliebak
In this paper, we introduce a novel methodology to efficiently construct a corpus for question
answering over structured data. For this, we introduce an intermediate representation that is
based on the logical query plan in a database called Operation Trees (OT). This representation
allows us to invert the annotation process without losing flexibility in the types of queries that we
generate. Furthermore, it allows for fine-grained alignment of query tokens to OT operations.
    In our method, we randomly generate OTs from a context-free grammar. Afterwards, an-
notators have to write the appropriate natural language question that is represented by the OT.
Finally, the annotators assign the tokens to the OT operations. We apply the method to cre-
ate a new corpus OTTA (Operation Trees and Token Assignment), a large semantic parsing
corpus for evaluating natural language interfaces to databases. We compare OTTA to Spider
and LC-QuaD 2.0 and show that our methodology more than triples the annotation speed while
maintaining the complexity of the queries. Finally, we train a state-of-the-art semantic parsing
model on our data and show that our corpus is a challenging dataset and that the token alignment
can be leveraged to increase the performance significantly.
    This work has been partially funded by the LIH-LITH project supported by the EU ERA-Net
CHIST-ERA; the Swiss National Science Foundation (20CH21174237); the Agencia Estatal de
Investigación (AEI, Spain) projects PCIN-2017-118and PCIN-2017-085; the INODE project
supported by the European Unions Horizon 2020 research and innovation program under grant
agreement No863410.


16    Following, understanding, and supporting service-oriented person-to-
      person communications
Alexandros Paramythis, Doris Paramythis and Andreas Putzinger
Automation in enterprise service provision has proliferated in recent years. In service-based
communications, such automation typically has the form of Chatbots or Interactive Voice Re-
sponse systems, of varying sophistication. Despite very significant improvements achieved in
the corresponding technologies, recent studies show that in the domain of service-oriented com-
munications, person-to-person interaction is highly more effective and efficient. This has given
rise to a new generation of products that seek to empower humans engaging in such interaction,
rather than replace them.


                                                 10
    The main prerequisites for providing support during person-to-person communication are:
on the one hand, being able to observe the ongoing interaction as it happens, bringing it to
a computable form in (near) real time (e.g., through automatic speech recognition); and, on
the other hand, being able to semantically interpret utterances in context. The second part
specifically entails natural language understanding coupled with a semantic representation of
the domain of intercourse that can be used for reasoning.
    In this presentation we outline our experiences with applying approaches from the fields of
natural language processing and ontological domain modeling for the interpretation of dialogue
acts, and also for the analysis of domain-specific data (e.g., product documentation), targeted to
identifying the pieces of information most relevant to an ongoing person-to-person dialogue in
real time.


17    Text Mining Technologies for Animal Health Surveillance
Fabio Rinaldi, Anne Goehring, Corinne Gurtner, John Berezowski, Michele Bodmer, Irene
Zuehlke and Celine Faverjon
We describe the outcomes of a collaborative project between the Vetsuisse faculty of the Univer-
sity of Bern and the Institute of Computational Linguistics of the University of Zurich, aimed
at exploiting text mining technologies in the analysis of pathology reports from multiple Swiss
veterinary laboratories. An online tool has been developed which allows the dynamic process-
ing of batches of reports for the extraction of relevant signals, which in turn can be used for
statistical analysis in epidemiological studies. The process is based on the identification in the
reports of terminological items referring to relevant domain concepts. The terminologies used
in the project are sourced from several ontological resources. We have also developed a semi-
automated process to cross-map our ontological resources through a reference ontology, such
as the UMLS.
    In a first step we evaluated the completeness and validity of the necropsy data. In a second
step, we combined information extracted from the three necropsy data sources, and investigated
factors associated with necropsy submissions at three different levels – ”national” , “farm” and
“individual” – and according to age, region and time of the year.
    An interactive dashboard application enables data exploration. The combined pathology
data from several veterinary pathology laboratories can be spatially and temporally displayed
for different types of analysis. All aspects of the projects have been assessed for their potential
benefits for animal health surveillance.
    Short CV of the presenter: Fabio Rinaldi leads the NLP research group at the Dalle Molle
Institute for Artificial Intelligence Research (IDSIA). Previously he was a lecturer and senior
researcher at the University of Zurich, as well as a PI in numerous research projects, which he
acquired and managed.
    He has an academic background in computer science and more than 25 years experience
in NLP research, with a specific focus on applications in the biomedical domain, such as au-
tomatic analysis of the scientific literature, of clinical reports, and health-related social media
discussions. He also authored more than 100 scientific papers (including more than 30 journal
papers).
    Intended Audience: decision makers, project managers


18    Assigning Grant Applications to Reviewers via Text Analysis
Anne Jorstad
The Swiss National Science Foundation normally finds the most appropriate expert reviewer for

                                                11
each grant application by hand. However, when the pool of reviewers is known in advance, this
process can be performed more efficiently using text mining.
    An application can be represented by the text of its title, keywords, and abstract. Potential
reviewers can be defined by similar texts from their publications. We have tested a variety of
techniques to define the similarity between pairs of texts, followed by an optimization procedure
to determine the final matching, given constraints about the number of applications allowed per
reviewer.
    The biggest challenge is due to the fact that the amount of discriminatory information pro-
vided in these texts varies widely between disciplines. Humanities and social sciences texts
tend to use standard language vocabulary such as “law” or “urban”, while the hard sciences
include very specific terminology like “SARS-CoV-2” or “latent semantic analysis”. And some
expressions overlap, but carry different meanings in different fields, such as “family” or “sup-
port”, which are generally not meant in the context of “family of algorithms” or “support vector
machines”.
    We aim to develop a system that can appropriately assign applications to reviewers for fund-
ing schemes as multi-disciplinary as Spark (“rapid funding of unconventional ideas”) and as
mono-disciplinary as our new Coronavirus call. We note that this algorithm will not be applied
for all funding schemes at the SNSF.
    Intended Audience: Developers and decision makers. Specifically those who need to pair
texts from a variety of topics simultaneously. We would also like to get feedback from re-
searchers in related fields to improve all aspects of our algorithm.
    Author CV:
    Professional Experience:
     • Swiss National Science Foundation, Data Scientist, 2014-Present
     • Ecole Polytechnique Fédérale de Lausanne (EPFL), Postdoc, 2012-Present
     • Johns Hopkins Applied Physics Lab, Research Intern, 2008-2010 (summers)
     Education:
     • PhD, Applied Mathematics, University of Maryland, USA, 2012
     • Visiting Doctoral Student, ENS Cachan, Paris, France, 2010
     • Master, Mathematics, University of Wisconsin, USA, 2007
     • Bachelor, Mathematics (Computer Science Concentration), Cornell University, USA, 2005


19     Named entity recognition for job description mining
Dina Wieman, Khan Ozol, Natalia Korchagina, Claudio Bonesana, Anastassia Shaitarova and
Fabio Rinaldi
In a collaborative project with a major pharma company we explored name entity recognition
(NER) strategies applied to job/resume mining tasks. In the project we leveraged advanced
NER approaches in order to identify job titles, organization names, and geographical locations
which are the essential parts of a job mining task, such as recruiting, tracking job candidates
and job recommendation. This process is currently based on the manual analysis of hundreds
of CVs, often with no relevance for a specific position or a profile.
    Despite the existence of many commercial providers of similar services, there are no pub-
licly available datasets to evaluate the advertised algorithms. The existing pre-trained NER
models such as spaCy models, and Stanford NER models were trained on blogs, news and

                                               12
media. Their performance drops significantly when applied on the sentences taken from the re-
sumes, since titles, locations and organization names in a resume are often written in the manner
of a heading.
    We asked domain experts to manually annotate a reference dataset of free-text job title de-
scription extracted from CVs, used it to train a deep-learning model, and compared the results
against the reference models mentioned above. We were able to outperform both pre-trained
models by a significant margin. Our NER models have been integrated in a prototype system
which demonstrates a more dynamic and flexible data analysis compared to baseline commer-
cial solutions.
    Short CV: Fabio Rinaldi leads the NLP research group at the Dalle Molle Institute for Arti-
ficial Intelligence Research (IDSIA). Previously he was a lecturer and senior researcher at the
University of Zurich, as well as a PI in numerous research projects, which he acquired and
managed.
    He has an academic background in computer science and more than 25 years experience
in NLP research, with a specific focus on applications in the biomedical domain, such as au-
tomatic analysis of the scientific literature, of clinical reports, and health-related social media
discussions. He also authored more than 100 scientific papers (including more than 30 journal
papers).
    Intended Audience: decision makers, project managers


20    Company Name Disambiguation
Ahmad Aghaebrahimian and Mark Cieliebak
Company Name Disambiguation (CND) is a form of Named Entity Disambiguation where dif-
ferent textual representations of a company name are linked to its formal name. For instance,
the company ‘ArcelorMittal SA’ is often referred to as ‘Arcelor Mittal Group’, ‘Mittal Steel’, or
simply ‘Mittal Co.’. The task of mapping these surface forms to the same company formal name
is known as CND or in a more general term, Named Entity Disambiguation (NED). NED is a
crucial task in many Natural Language Processing applications such as entity linking, record
linkage, knowledge base construction, or relation extraction, to name a few. It has been shown
that parameter-less models for NED do not generalize to other domains very well. On the other
hand, parametric learning models do not scale well with a large number of candidate names
which is often the case for CND since the number of company formal names usually exceeds
hundreds of thousands of instances. Yet another challenge is multilingual NED; while company
formal names are often in English, texts and company mentions are in another language which
makes string matching impractical.
    In this talk, I elaborate on a wide range of techniques we use to tackle these challenges for
a proprietary CND system. I will talk about our parameterized and non-parameterized models,
string normalization, encoding and disambiguation on the scale. Eventually, I present the audi-
ence with the state-of-the-art results we obtained on three publicly available datasets using our
CND system.


                                                13