As Simple as Possible: Using the R Tidyverse for
     Multilingual Information Extraction.
   IMS Unipd at CLEF eHealth 2020 Task 1

                              Giorgio Maria Di Nunzio1,2
               1
                   Dept. of Information Engineering – University of Padua
                      2
                        Dept. of Mathematics – University of Padua
                             giorgiomaria.dinunzio@unipd.it


        Abstract. In this paper, we report the results of our participation to
        the CLEF eHealth 2020 Task on “Multilingual Information Extraction”.
        This task focuses on coding of medical textual data using the Interna-
        tional Statistical Classification of Diseases and Related Health Problems
        (ICD) in Spanish. The main objective of our participation to this task
        is the study of reproducible experiments that use minimal effort to be
        set up and run and that can be used as a baseline. The contribution of
        our experiments to this task can be summarized as follows: the imple-
        mentation of a reproducible pipeline for text analysis that uses universal
        dependency parsing; an evaluation of simple classifiers based on perfect
        matches on different morphological levels together with a tf-idf approach.


1     Introduction
CLEF eHealth is an evaluation challenge in the medical domain where the goal is
to provide researchers with datasets, evaluation frameworks, and events. In the
CLEF eHealth 2020 edition [1], the organizers set up two tasks to evaluate re-
trieval systems on different domains. In this paper, we report the results of our
participation to the CLEF eHealth Task 1 “Multilingual Information Extrac-
tion” [2]. The 2020 task focuses on the evaluation of systems that automatically
code clinical textual data in Spanish with ICD codes. In this edition, we continue
our line of research that we have been following in the last two years [4, 3]: to
study and share reproducible systems that require minimal effort to be run in
order to create useful baselines for the research community. In particular, we
participated in two of the three subtasks available: subtask 1, ICD10-CM codes
assignment to evaluate systems that predict ICD10-CM codes for the classifi-
cation of diseases; subtask 2 ICD10-PCS codes assignment to evaluate systems
that predict ICD10-PCS codes for the classification of medical procedures.
    The contribution of our experiments to this task can be summarized as fol-
lows:
    Copyright © 2020 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25
    September 2020, Thessaloniki, Greece.
 – the implementation of a reproducible pipeline for text analysis;
 – an evaluation of simple classifiers based on perfect matches on different lex-
   ical levels and a tf-idf approach.

   The remainder of the paper will introduce the methodology and a brief sum-
mary of the experimental settings that we used in order to create the runs that
we submitted for the task.


2     Method

In this section, we summarize the pipeline for text pre-processing which has
been developed in the last two years [4, 3] and has been extended and made
reproducible in this work. The source code used in these experiments will be
shared online.3 In general, our method follows the principles described by [?]
where the idea is to mine textual information from large text collections in an
efficient and effective by means of organized workflows named pipelines. Pipelines
are an effective way to manage the sequential process of text analysis by splitting
the source code into steps, where the output of one step is the input for the
subsequent step. The R programming language has an interesting set of packages
that follow this idea, named tidyverse, 4 that we will use in our experiments.
     Apart from being a tidy way of organizing software, an important advantage
in working with pipelines is that this practice promotes shareability and repro-
ducibility in research workflows which is one of the main pillars in the European
Open Science Cloud (EOSC). 5


2.1   Pipeline for Data Cleaning

In order to produce a dataset ready for training a classifier, we followed the same
pipeline for data ingestion and preparation for all the experiments. Instead of
using the tidytext approach,6 in this edition we tried the Universal Dependency
Parser implementation in R, udpipe, which automatically tokenizes, lemmatizes
and annotate text.7
    The following code summarizes all these steps:
u d p i p e a n n o t a t e ( o b j e c t = udmodel spanis h ,
                              x = text ,
                              doc id = doc id x )
where udmodel spanish is the dependency parser for Spanish, text and doc id x
are the textual data and the identifier of each medical document in the dataset.
The idea of our approach is to transform each piece of text in order to have
3
  https://github.com/gmdn
4
  https://www.tidyverse.org
5
  https://www.eosc-portal.eu
6
  https://www.tidytextmining.com
7
  https://bnosac.github.io/udpipe/en/index.html
three versions of it: the original tokenized version, the variant with all words
lemmatized, the variant with all words stemmed. The following lines take the
output of the udpipe step, annotated train, and add the stem version of each
token (and transform all text to lowercase):

a n n o t a t e d t r a i n %>%
   mutate ( stem = wordStem ( token , l a n g u a g e = ” s p a n i s h ” ) ) %>%
   mutate ( t o k e n l o w e r = t o l o w e r ( token ) ) %>%
   mutate ( lemma lower = t o l o w e r ( lemma ) ) %>%
   mutate ( s t e m l o w e r = t o l o w e r ( stem ) )

where the %>% symbol represents the usual “pipe” symbol (the output of a
function step is the input of the next function), and we used the Spanish Snowball
stemmer.


2.2   Classification

The main idea of our simple classifier is based on a memory-based approach
with an additional tf-idf weighting scheme. There is no difference between the
two subtasks since the procedure is exactly the same:

 – choose the morphological level: token, lemma, stem;
 – given a sentence that has to be classified, search for any previously classified
   document that contains that sentence;
 – add the classification label to the list of candidates;
 – assign the label with the majority of counts.

   Since this approach can, in principle, assign only labels that have already
been assigned in the past, we added two more steps to include more labels:

 – choose the morphological level: token, lemma, stem
 – given a sentence that has to be classified, search for any ICD-10 codes that
   contains the sentence;
 – add the classification label to the list of candidates;
 – additionally, use a tf-idf to weigh the importance of each word in the sen-
   tence;
 – assign the label with the largest weight.


3     Experiments

In this section, we briefly describe the setting of official runs that we submit-
ted for this task and the preliminary results sent by the organizers before the
workshop.
Table 1. Summary of the results for the two subtasks: upper part subtask 1, lower
part subtask 2.

      file                                             MAP P       R     F1
      test D only token                                0.449 0.373 0.652 0.474
      test D only token lemma stem                     0.391 0.306 0.672 0.420
      test D only token lemma stem codiesp             0.389 0.299 0.682 0.416
      test D tfidf only token lemma stem codiesp       0.395 0.079 0.699 0.143
      test D tfidf only token lemma stem tfidf codiesp 0.392 0.081 0.709 0.145
      test P only token                                0.365 0.310 0.478 0.376
      test P only token lemma stem                     0.365 0.291 0.509 0.370
      test P only token lemma stem codiesp             0.365 0.291 0.509 0.370
      test P tfidf only token lemma stem codiesp       0.391 0.026 0.749 0.051
      test P tfidf only token lemma stem tfidf codiesp 0.390 0.026 0.747 0.051


3.1   Run Settings
The goal of our experiments is to compare the effectiveness of adding elements to
the classifier and study the difference among them in a failure analysis (post-hoc
analysis).
   We submitted five official runs for each subtask. The letter ‘X’ in the following
description of the run can be substituted with either ‘D’ or ‘P’ according to the
subtask (Disease or Procedure):
 – test X only token: this run uses only a memory-based approach with tokens
   (original words);
 – test X only token lemma stem: this run uses only a memory-based approach
   with tokens, lemmas and stems;
 – test X only token lemma stem codiesp: the same as the previous one but we
   add the description of the ICD-10 codes to the list of possible documents to
   match
 – test X tfidf only token lemma stem codiesp: the same as the previous one,
   but we add the tf-idf weights for the token, lemma and stems representation;
 – test X tfidf only token lemma stem tfidf codiesp: the same as the previous
   one, but we add the tf-idf weights also for the token, lemma and stems
   representation of the ICD-10 description.

3.2   Results
A summary of the results for the two subtasks is shown in Table 1. The per-
formance achieved by the combination of elements changes significantly in both
subtasks. In general, the simplest classifier that uses only token achieves on av-
erage the best performances across different measures. By adding elements to
the classifiers, such as lemmas, stems and tf-idf weighting, recall increases at the
expenses of precision.
    The important decrease of precision when tf-idf is used suggested an addi-
tional investigation. In fact, we found a bug in the code that did not activate a
threshold on the number of labels retrieved. All the source code will be made
available online.8


4     Final Remarks and Future Work
The aim of our participation to the CLEF eHealth Task 1 was to test the ef-
fectiveness of a simple textual pipeline implemented in R with the ‘tidyverse’
approach for the problem of classification of clinical textual data. In this task,
participants are required to label with ICD-10 codes related to treatment and
procedures of health-related documents with the focus on the Spanish language.
We tackled this task by focusing on reproducibility aspects, as we did the previ-
ous years; this time, we tried a variation of our approach moving from a frequency
based classification approach [3, 4] to a sort of memory-based classification by
finding perfect matches of previously based classified clinical notes using dif-
ferent lexical variants. This variation was inspired by the baseline produced by
organizers of the CLEF 2018 eHealth task [?]. In addition, we included a tf-idf
approach to analyze whether the inverse document frequency can help in the
classification task.
    At the time of writing, we do not have a way to compare our results with
those of the other participants, and the comparison with previous years would
be useless since the collection of documents is completely different. However, in
the preliminary analysis, we found that the token based classification achieved
the best results both in terms of classification (F1) and retrieval (MAP) for the
disease classification subtask. It was interesting to see that the mixed approach
with tf-idf weights performed better in terms of retrieval (MAP) in the procedure
classification subtask despite a very low classification score due to an extremely
low precision. A preliminary failure analysis showed that the code had a bug that
did not allow to weigh and select correctly the labels for the tf-idf approach.


5     Acknowledgements
This work was partially supported by the ExaMode Project, as a part of the
European Union Horizon 2020 Program under Grant 825292.


References
1. Lorraine Goeuriot, Hanna Suominen, Liadh Kelly, Antonio Miranda-Escalada, Mar-
   tin Krallinger, Zhengyang Liu, Gabriella Pasi, Gabriela Saez Gonzales, Marco Vi-
   viani, and Chenchen Xu. Overview of the CLEF eHealth evaluation lab 2020. In
   Avi Arampatzis, Evangelos Kanoulas, Theodora Tsikrika, Stefanos Vrochidis, Hideo
   Joho, Christina Lioma, Carsten Eickhoff, Aurélie Névéol, and Linda Cappellato and-
   Nicola Ferro, editors, Experimental IR Meets Multilinguality, Multimodality, and
   Interaction: Proceedings of the Eleventh International Conference of the CLEF As-
   sociation (CLEF 2020) , LNCS Volume number: 12260, 2020.
8
    https://github.com/gmdn
2. Antonio Miranda-Escalada, Aitor Gonzalez-Agirre, Jordi Armengol-Estapé, and
   Martin Krallinger. Overview of automatic clinical coding: annotations, guidelines,
   and solutions for non-english clinical cases at codiesp track of CLEF eHealth 2020.
   In Working Notes of Conference and Labs of the Evaluation (CLEF) Forum, CEUR
   Workshop Proceedings, 2020.
3. Giorgio Maria Di Nunzio. Classification of ICD10 codes with no resources but
   reproducible code. IMS unipd at CLEF ehealth task 1. In Working Notes of CLEF
   2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September
   10-14, 2018., 2018.
4. Giorgio Maria Di Nunzio. Classification of animal experiments: A reproducible
   study. IMS unipd at CLEF ehealth task 1. In Linda Cappellato, Nicola Ferro,
   David E. Losada, and Henning Müller, editors, Working Notes of CLEF 2019 -
   Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-
   12, 2019, volume 2380 of CEUR Workshop Proceedings. CEUR-WS.org, 2019.