1. Introduction

Bucharest, Romania " giorgiomaria.dinunzio@unipd.it (G. M. Di Nunzio) ~ http://github.com/gmdn (G. M. Di Nunzio)

IMS-UNIPD @ CLEF eHealth Task 1: A Memory Based Reproducible Baseline

Giorgio Maria Di Nunzio

0 1 0 Department of Information Engineering, University of Padova , Italy 1 Department of Mathematics, University of Padova , Italy

2021

000 0 0001

In this paper, we report the results of our participation to the CLEF eHealth 2021 Task on “Multilingual Information Extraction". This year, this task focuses on Named Entity Recognition from Spanish clinical text in the domain of radiology reports. In particular, the main objective is to classify entities into seven diferent classes as well as hedge cues. Our main contribution can be summarized as follows: 1) continue the study of minimal/reproducible pipeline for text analysis baselines using a tidyverse approach in the R language; 2) evaluate the simplest memory based classifiers without optimization.

eol>classification memory based classifier R tidyverse

1. Introduction

• The implementation of a reproducible pipeline for text analysis; • An evaluation of a simple memory based classifier.

The remainder of the paper will introduce the methodology and a brief summary of the experimental settings that we used in order to create the run that we submitted for the task.

2. Method

In this section, we summarize the pipeline for text pre-processing which has been developed in the last three years [ 3, 4, 5 ] and has been simplified and the source code will be made available. 1

2.1. Pipeline for Data Cleaning

In order to produce a dataset ready for training a classifier, we followed the same pipeline for data ingestion and preparation for all the experiments. We used the tidytext approach to automatically parse and extract the text.2

The following code summarizes the initial steps of the analysis of the documents: t r a i n _ a n n <− t r a i n _ a n n %>% s e p a r a t e _ r o w s ( t e x t , s e p = " \ n " ) %>% s e p a r a t e ( c o l = t e x t , s e p = " \ t " , i n t o = c ( " i d " , " t y p e " , " t e x t " ) ) t r a i n _ a n n <− t r a i n _ a n n %>% m u t a t e ( i n t e r v a l = s t r _ s u b ( t y p e , s t a r t = s t r _ l o c a t e ( s t r i n g = t y p e , p a t t e r n = " [ 0 − 9 ] + [ 0 − 9 ] + ( ; [ 0 − 9 ] + [ 0 − 9 ] + ) ∗ " ) ) ) %>% m u t a t e ( t y p e = s t r _ s u b ( s t r i n g = t y p e , end = s t r _ l o c a t e ( s t r i n g = t y p e , p a t t e r n = " [ 0 − 9 ] + [ 0 − 9 ] + ( ; [ 0 − 9 ] + [ 0 − 9 ] + ) ∗ " ) [ , 1 ] − 1 ) )

With just two lines of code we separate each token of each document and extract the location in the text. As an additional example, with the following line, we tried to reduce the possibility to match smaller sequences by adding spaces around the text (even though we may lose some matches with this extra characters) t a $ t e x t _ l o w e r [ n c h a r ( t a $ t e x t _ l o w e r ) < 3 ] <− p a s t e 0 ( " " , s h o r t _ t e x t , " " )

2.2. Classification

The main idea of a memory based classifier follows the idea presented in [ 3 ]: • Choose the morphological level (in our experiments token level); • Given a (multi-word) token in a sentence, search for any previously classified documents that contains that sentence; • Add the classification label to the document.

We built the rules for the memory based system by looking at all the documents provided in the training and validation set. No optimization was performed at any steps and only one run was submitted.

3. Preliminary Results

In this section, we briefly comment the oficial results sent by the organizers before the workshop.

2https://bnosac.github.io/udpipe/en/index.html

3.1. Considerations before the oficial results

Our initial goal with this approach was to build a memory based approach that could capture with high precision (only known sequences) and low recall (all the sequences that are not previously seen are not recognized) some entities that were labeled by the experts in the training and validation set.

Without any optimization or evaluation on the validation set, our initial guess was a recall around 50% (we suppose that at least half of the entities of the test set are not in the training/validation set), and a precision around 70-80% (when a sequence previously labeled is found, we suppose that there is a low chance that it is categorized wrongly).

3.2. Considerations after the oficial results

Compared to the same approach used in the past years, the results on this task were surprisingly low: on one hand, recall was around the figure we expected for most of the categories; on the other hand, precision was extremely low.

These results, despite being negative, open interesting questions about what went wrong in the implementation of rules of the classifier. In particular, we have started to analyze the runs for the first time since the runs were submitted, and we found some odd classifications of one or two characters entities. For example, for document 4901 we have T5 Abbreviation 34 35 c T6 Abbreviation 38 39 c T7 Abbreviation 46 47 c T8 Abbreviation 51 52 c T9 Abbreviation 58 59 c T10 Abbreviation 96 97 c ...

T212 Uncertainty 12 13 o T213 Uncertainty 15 16 o T214 Uncertainty 20 21 o T215 Uncertainty 29 30 o T216 Uncertainty 52 53 o ...

Therefore, we believe that in some parts of the source code (see for example the line in Section 2.1 where we try to find a solution for smaller sequences) we introduced errors that could and should be avoided.

3.3. Source code debugging

After a careful analysis of the code, we found that we unintentionally removed the trailing white spaces introduced for shorter character sequences. In particular, when we extract the pattern to find in the text, we also “squish” any multiple white spaces including the trailing white spaces. In doing so, single characters like "m", "o", "s", are included as patterns; thus, the number of patterns wrognly assigned to each document increases and the precision for that category decreases.

For this reason, we modify the code in order to avoid this passage when shorter sequences are involved and run again the classification of the validation sets.

We provide the source code to replicate these results;3 in particular, with this simple modification we could recover most of the precision initially lost. For example, for the category “Abbreviation”, the initial precision around 7% increases up to 40%, for “Anatomical Entity” from 25-30% up to 50%, for “Negation” from 10% up to 60%.

4. Final remarks and Future Work

The aim of our participation to the CLEF 2021 eHealth Task 1 was to test the efectiveness of a simple textual pipeline implemented in R with the ‘tidyverse’ approach for the problem of classification of Spanish clinical textual data. A preliminary failure analysis showed an anomaly in the values of precision, too low compared to the expected efectiveness. After a careful analysis, we found a mistake in the source code and, after we fixed the error, the performances increased significantly. In order to make this study reproducible, we will make the source code available. Additional analyses will be carried out to find patterns that can be easily used as a kind of knowledge base to support more advanced systems. We will provide a finer analysis on the test set when the ground truth will be made available. September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings, CEUR-WS.org, 2019.

URL: http://ceur-ws.org/Vol-2380/paper_104.pdf. [5] G. Di Nunzio, Classification of ICD10 codes with no resources but reproducible code.

IMS unipd at CLEF ehealth task 1, in: Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September 10-14, 2018., 2018. URL: http: //ceur-ws.org/Vol-2125/paper_180.pdf.

[1]

Suominen ,

Goeuriot ,

Kelly ,

L. A.

Alemany ,

Bassani ,

Brew-Sam ,

Cotik ,

Filippo ,

González-Sáez ,

Luque ,

Mulhem , G. Pasi,

Roller ,

Seneviratne ,

Upadhyay ,

Vivaldi ,

Viviani , C. Xu, Overview of the CLEF eHealth evaluation lab 2021 , in: CLEF 2021 - 12th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS) , Springer, 2021 .

[2]

Cotik ,

L. A.

Alemany ,

Luque ,

Roller ,

Vivaldi ,

Ayach ,

Carranza ,

L. D.

Francesca ,

Dellanzo ,

M. F.

Urquiza , Overview of CLEF eHealth task 1 - spradie: A challenge on information extraction from spanish radiology reports, in: CLEF 2021 Evaluation Labs and Workshop: Online Working Notes, CEUR Workshop Proceedings, 2021 .

[3]

Di Nunzio , As simple as possible: Using the R tidyverse for multilingual information extraction. IMS unipd ad CLEF ehealth 2020 task 1 , in: L. Cappellato , C.

Eickhof , N.

Ferro , A . Névéol (Eds.), Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum , Thessaloniki, Greece, September 22-25 , 2020 , volume 2696 of CEUR Workshop Proceedings, CEUR-WS.org , 2020 . URL: http://ceur-ws. org/ Vol- 2696 /paper_137.pdf.

[4]

Di Nunzio , Classification of animal experiments: A reproducible study . IMS unipd at CLEF ehealth task 1 , in: L. Cappellato , N.

Ferro , D. E.

Losada , H. Müller (Eds.), Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum , Lugano, Switzerland,