<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Bucharest, Romania
" giorgiomaria.dinunzio@unipd.it (G. M. Di Nunzio)
~ http://github.com/gmdn (G. M. Di Nunzio)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>IMS-UNIPD @ CLEF eHealth Task 1: A Memory Based Reproducible Baseline</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giorgio Maria Di Nunzio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering, University of Padova</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Mathematics, University of Padova</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>In this paper, we report the results of our participation to the CLEF eHealth 2021 Task on “Multilingual Information Extraction". This year, this task focuses on Named Entity Recognition from Spanish clinical text in the domain of radiology reports. In particular, the main objective is to classify entities into seven diferent classes as well as hedge cues. Our main contribution can be summarized as follows: 1) continue the study of minimal/reproducible pipeline for text analysis baselines using a tidyverse approach in the R language; 2) evaluate the simplest memory based classifiers without optimization.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;classification</kwd>
        <kwd>memory based classifier</kwd>
        <kwd>R tidyverse</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>• The implementation of a reproducible pipeline for text analysis;
• An evaluation of a simple memory based classifier.</p>
      <p>The remainder of the paper will introduce the methodology and a brief summary of the
experimental settings that we used in order to create the run that we submitted for the task.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <p>
        In this section, we summarize the pipeline for text pre-processing which has been developed in
the last three years [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4, 5</xref>
        ] and has been simplified and the source code will be made available. 1
      </p>
      <sec id="sec-2-1">
        <title>2.1. Pipeline for Data Cleaning</title>
        <p>In order to produce a dataset ready for training a classifier, we followed the same pipeline
for data ingestion and preparation for all the experiments. We used the tidytext approach to
automatically parse and extract the text.2</p>
        <p>
          The following code summarizes the initial steps of the analysis of the documents:
t r a i n _ a n n &lt;− t r a i n _ a n n %&gt;%
s e p a r a t e _ r o w s ( t e x t , s e p = " \ n " ) %&gt;%
s e p a r a t e ( c o l = t e x t , s e p = " \ t " , i n t o = c ( " i d " ,
" t y p e " ,
" t e x t " ) )
t r a i n _ a n n &lt;− t r a i n _ a n n %&gt;%
m u t a t e ( i n t e r v a l = s t r _ s u b ( t y p e , s t a r t = s t r _ l o c a t e ( s t r i n g = t y p e ,
p a t t e r n = " [
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">0 − 9</xref>
          ] + [
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">0 − 9</xref>
          ] + ( ; [
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">0 − 9</xref>
          ] + [
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">0 − 9</xref>
          ] + ) ∗ " ) ) ) %&gt;%
m u t a t e ( t y p e = s t r _ s u b ( s t r i n g = t y p e , end = s t r _ l o c a t e ( s t r i n g = t y p e ,
p a t t e r n = " [
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">0 − 9</xref>
          ] + [
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">0 − 9</xref>
          ] + ( ; [
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">0 − 9</xref>
          ] + [
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">0 − 9</xref>
          ] + ) ∗ " ) [
          <xref ref-type="bibr" rid="ref1">, 1</xref>
          ] − 1 ) )
        </p>
        <p>With just two lines of code we separate each token of each document and extract the location
in the text. As an additional example, with the following line, we tried to reduce the possibility
to match smaller sequences by adding spaces around the text (even though we may lose some
matches with this extra characters)
t a $ t e x t _ l o w e r [ n c h a r ( t a $ t e x t _ l o w e r ) &lt; 3 ] &lt;− p a s t e 0 ( " " , s h o r t _ t e x t , " " )</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Classification</title>
        <p>
          The main idea of a memory based classifier follows the idea presented in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]:
• Choose the morphological level (in our experiments token level);
• Given a (multi-word) token in a sentence, search for any previously classified documents
that contains that sentence;
• Add the classification label to the document.
        </p>
        <p>We built the rules for the memory based system by looking at all the documents provided in
the training and validation set. No optimization was performed at any steps and only one run
was submitted.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Preliminary Results</title>
      <p>In this section, we briefly comment the oficial results sent by the organizers before the workshop.</p>
      <p>2https://bnosac.github.io/udpipe/en/index.html</p>
      <sec id="sec-3-1">
        <title>3.1. Considerations before the oficial results</title>
        <p>Our initial goal with this approach was to build a memory based approach that could capture
with high precision (only known sequences) and low recall (all the sequences that are not
previously seen are not recognized) some entities that were labeled by the experts in the
training and validation set.</p>
        <p>Without any optimization or evaluation on the validation set, our initial guess was a recall
around 50% (we suppose that at least half of the entities of the test set are not in the
training/validation set), and a precision around 70-80% (when a sequence previously labeled is found, we
suppose that there is a low chance that it is categorized wrongly).</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Considerations after the oficial results</title>
        <p>Compared to the same approach used in the past years, the results on this task were surprisingly
low: on one hand, recall was around the figure we expected for most of the categories; on the
other hand, precision was extremely low.</p>
        <p>These results, despite being negative, open interesting questions about what went wrong in
the implementation of rules of the classifier. In particular, we have started to analyze the runs
for the first time since the runs were submitted, and we found some odd classifications of one
or two characters entities. For example, for document 4901 we have
T5 Abbreviation 34 35 c
T6 Abbreviation 38 39 c
T7 Abbreviation 46 47 c
T8 Abbreviation 51 52 c
T9 Abbreviation 58 59 c
T10 Abbreviation 96 97 c
...</p>
        <p>T212 Uncertainty 12 13 o
T213 Uncertainty 15 16 o
T214 Uncertainty 20 21 o
T215 Uncertainty 29 30 o
T216 Uncertainty 52 53 o
...</p>
        <p>Therefore, we believe that in some parts of the source code (see for example the line in Section
2.1 where we try to find a solution for smaller sequences) we introduced errors that could and
should be avoided.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Source code debugging</title>
        <p>After a careful analysis of the code, we found that we unintentionally removed the trailing
white spaces introduced for shorter character sequences. In particular, when we extract the
pattern to find in the text, we also “squish” any multiple white spaces including the trailing
white spaces. In doing so, single characters like "m", "o", "s", are included as patterns; thus, the
number of patterns wrognly assigned to each document increases and the precision for that
category decreases.</p>
        <p>For this reason, we modify the code in order to avoid this passage when shorter sequences
are involved and run again the classification of the validation sets.</p>
        <p>We provide the source code to replicate these results;3 in particular, with this simple
modification we could recover most of the precision initially lost. For example, for the category
“Abbreviation”, the initial precision around 7% increases up to 40%, for “Anatomical Entity” from
25-30% up to 50%, for “Negation” from 10% up to 60%.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Final remarks and Future Work</title>
      <p>The aim of our participation to the CLEF 2021 eHealth Task 1 was to test the efectiveness of
a simple textual pipeline implemented in R with the ‘tidyverse’ approach for the problem of
classification of Spanish clinical textual data. A preliminary failure analysis showed an anomaly
in the values of precision, too low compared to the expected efectiveness. After a careful
analysis, we found a mistake in the source code and, after we fixed the error, the performances
increased significantly. In order to make this study reproducible, we will make the source code
available. Additional analyses will be carried out to find patterns that can be easily used as a
kind of knowledge base to support more advanced systems. We will provide a finer analysis on
the test set when the ground truth will be made available.
September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings, CEUR-WS.org, 2019.</p>
      <p>URL: http://ceur-ws.org/Vol-2380/paper_104.pdf.
[5] G. Di Nunzio, Classification of ICD10 codes with no resources but reproducible code.</p>
      <p>IMS unipd at CLEF ehealth task 1, in: Working Notes of CLEF 2018 - Conference and
Labs of the Evaluation Forum, Avignon, France, September 10-14, 2018., 2018. URL: http:
//ceur-ws.org/Vol-2125/paper_180.pdf.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Suominen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Alemany</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Bassani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Brew-Sam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Cotik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Filippo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>González-Sáez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Luque</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mulhem</surname>
          </string-name>
          , G. Pasi,
          <string-name>
            <given-names>R.</given-names>
            <surname>Roller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Seneviratne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Upadhyay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vivaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Viviani</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Xu, Overview of the CLEF eHealth evaluation lab 2021</article-title>
          ,
          <source>in: CLEF 2021 - 12th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS)</source>
          , Springer,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Cotik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Alemany</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Luque</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Roller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Vivaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ayach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Carranza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Francesca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dellanzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Urquiza</surname>
          </string-name>
          ,
          <article-title>Overview of CLEF eHealth task 1 - spradie: A challenge on information extraction from spanish radiology reports, in: CLEF 2021 Evaluation Labs</article-title>
          and Workshop: Online Working Notes, CEUR Workshop Proceedings,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Di</surname>
          </string-name>
          <string-name>
            <surname>Nunzio</surname>
          </string-name>
          ,
          <article-title>As simple as possible: Using the R tidyverse for multilingual information extraction. IMS unipd ad CLEF ehealth 2020 task 1</article-title>
          , in: L.
          <string-name>
            <surname>Cappellato</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Eickhof</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Névéol (Eds.), Working Notes of CLEF 2020 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          , Thessaloniki, Greece,
          <source>September 22-25</source>
          ,
          <year>2020</year>
          , volume
          <volume>2696</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2020</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2696</volume>
          /paper_137.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Di</surname>
          </string-name>
          <string-name>
            <surname>Nunzio</surname>
          </string-name>
          ,
          <article-title>Classification of animal experiments: A reproducible study</article-title>
          .
          <source>IMS unipd at CLEF ehealth task 1</source>
          , in: L.
          <string-name>
            <surname>Cappellato</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>D. E.</given-names>
          </string-name>
          <string-name>
            <surname>Losada</surname>
          </string-name>
          , H. Müller (Eds.), Working Notes of CLEF 2019 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          , Lugano, Switzerland,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>