<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the CLEF eHealth 2019 Multilingual Information Extraction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Antje Dorendahl</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nora Leich</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benedikt Hummel</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gilbert Schonfelder</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Barbara Grune</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Charite - Universitatsmedizin Berlin, Institute of Clinical Pharmacology and Toxicology</institution>
          ,
          <addr-line>Chariteplatz 1, 10117 Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR)</institution>
          ,
          <addr-line>Diedersdorfer Weg 1, 12277, Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Non-technical summaries (NTSs) of animal experimentation can be valuable resources to foster more transparency of research made with animals and to better inform the community about this topic. The NTSs of planned animal experiments in Germany are publicly available and have been manually assigned to ICD-10 codes. We used this data in the scope of organizing the Multilingual Information Extraction Task (Task 1) in the CLEF eHealth challenge. For the development phase, we released a training dataset containing more than 8,000 NTSs and their corresponding codes (if any assigned). For the test phase, we released 407 unseen NTSs for which the participants should submit the predictions made by their systems. The best performing system obtained a P, R, and FM of 0.83, 0.77, and 0.80, respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>Document indexing</kwd>
        <kwd>ICD-10 codes</kwd>
        <kwd>summaries of animal experiments</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Non-technical summaries (NTSs) are short descriptions of the planned animal
experiments to be carried out in a country and are stipulated when requesting
permission for the experiment. The European Union (EU) requires the member
states to collect these summaries and to make them available to the community in
order to foster more transparency in animal research [12]. The German Federal
Institute for Risk Assessment (BfR, in its acronym in German) publishes the
German NTSs online in the AnimalTestInfo database3.</p>
      <p>
        These NTSs are regularly manually annotated with ICD-10 codes for the
identi cation of the diseases that are the focus of the planned experiments.
Indexing the NTSs using terms from standard teminologies provides additional
information on the research goals of the animal experiments and supports a
detailed analysis of the data. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>We utilized our annotated NTSs in the scope of a shared task in the CLEF
eHealh challenge. Our shared task aimed to evaluate systems for the automatic
detection of the ICD-10 codes in German NTSs4. For this purpose, we utilized the
manually annotated data for building training, development, and test datasets,
which were to be used by the participants in the shared task. Previous editions
of CLEF eHealth addressed similar tasks, such as the extraction of ICD-10 codes
in death certi cates for English and French [8], and the following year for French,
Italian, and Hungarian [9].</p>
      <p>The remainder of the paper is structured as follows: we describe details of the
shared task in Section 2 and the participating teams and systems in Section 3. We
presented the baselines that we developed in Section 4 and the results obtained
by participants and baselines in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Details of the Shared Task</title>
      <p>In this section we describe details of the challenge, including the schedule of the
event, the data that we released, and the evaluation that we carried out.
Schedule. We released the training data, which is split into a training and
development datasets, to the participants on February 1st, 2019. During three months,
the participants could utilize this data for training, tuning, and evaluating their
system. We released the o cial test set on May 6th, 2019. The participants had
one week to process the test data and prepare the submissions les, that had to
be uploaded in to the submission system until May 13th, 2019. Each team was
allowed to submit up to three runs for their systems, i.e., di erent con gurations
or approaches that they experimented during the development period. Manual
(human-annotated) approaches were not allowed in the shared task.
Data. Our training data consisted of a set of 8,386 manually annotated NTSs
which was split into two datasets: 7,544 NTSs for the training dataset and 842
NTSs for the development dataset. For the test set, we released a collection of
407 unseen NTSs, i.e., which were not included in the training data. Each NTS
is divided in six sections, namely, title, objectives, bene ts, harms, replacement,
reduction and re nement.</p>
      <p>Evaluation. We evaluated the predictions returned by the participating systems
based on an automatic and a manual approach. We automatically evaluated the
4 Task 1: https://clefehealth.imag.fr/?page_id=26
submissions based on the standard metrics of precision (P), recall (R) and
fmeasure (FM). We utilized the Python script that we released to the
participants during the shared task.5 For the manual validation, one of our annotators
manually checked a total of 100 NTSs originated from false positives (FPs) and
false negatives (FNs) returned by the best runs. We randomly selected 25 FPs
and FNs from the best run of the two best-scoring teams, thus a total of 100
NTSs. During the manual validation, our expert checked whether the wrong
predictions (FP or FN) were indeed false.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Teams and systems</title>
      <p>
        We received 14 submissions from six teams originated from a total of six
countries, as summarized in Table 1. We present a summary of each team and their
systems below.
DEMIR [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The DEMIR team developed an approach based on two phases. In
the rst phase, they utilized the Elasticsearch tool to perform k-Nearest Neighbor
(kNN) and threshold-Nearest Neighbor (tNN). In the second phase, the codes
were selected from the top ones using two majority voting approaches based on
either the pre-de ned top M codes or on the similarity scores of the corresponding
NTSs. The team submitted three runs, namely, k-NN based on k=5 and M=2
(run1), tNN based on T=30 and M=3 (run2), and tNN based on T=80 and
adaptive M.
      </p>
      <p>
        IMS-UNIPD [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The IMS-UNIPD team experimented with three
probabilistic Nave Bayes (NB) classi ers, following the same approach that they used
in previous editions of the Multilingual Information Extraction Task in CLEF
eHealth. All models were based on a two-dimensional representation of
probabilities. They submitted three runs based on the three NB classi ers, namely,
Bernoulli (run1), Multinomial (run2) and Poisson (run3).
      </p>
      <sec id="sec-3-1">
        <title>5 https://github.com/mariananeves/clef19ehealth-task1</title>
        <p>
          MLT-DFKI [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The MLT-DFKI team tried a variety of approaches, such as
Conditional Neural Networks (CNN) and Attention models, which that
usually used for Neural Machine Translation (NMT), among others. They obtained
the best results when relying on Bidirectional Encoder Representations from
Transformers (BERT) and, more speci cally, on BioBERT which was trained
on biomedical documents [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. For using this approach, which is available for the
English language, the team had to rst automatically translate the NTSs using
the Google Translate API. The team only submitted one run.
        </p>
        <p>
          SSN-NLP [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The SSN-NLP team developed a multi-layer Recurrent Neural
Network (RNN) with a Long Short Term Memory (LSTM) as recurrent unit.
They experimented with two attention mechanisms, namely Normed Bahdanau
(NB) and Scaled Luong (SL), and with the requirement of a minimum number
of occurrences of a code as generated by the model. They submitted three runs,
namely, NB attention and minimum two occurrences (run1), SL attention and
minimum of two occurrences (run2), and SL attention, minimum 2 occurrences
and all codes if no code is repeated more than once (run3).
        </p>
        <p>TALP UPC. The TALP-UPC team developed a simple semi-supervised
system based on Machine Translation and Named Entity Recognition (NER). In a
rst step, the \Bene ts\ section was translated into English using the Amazon
Translate API6. For NER, they used MetaMap7 (online batch submission
system) and considered only the ICD-10 vocabulary source. After the identi cation
of the entities (codes), their parents in the ICD-10 hierarchy were also selected
to the prediction list.</p>
        <p>
          WBI [11]. The WBI team utilized a multilingual BERT text encoding model
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and additional training data of German clinical trials8 also annotated with
ICD-10 codes. They also experimented with training various instances of the
models and ensembling the predictions based on their average or on a logistic
regression classi er. The team submitted three runs, namely, BERT multi-label
(run1), ensemble based on the average (run2), and an ensemble based on logistic
regression (run3).
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Baseline Approaches</title>
      <p>We developed some baselines systems to compare the results from the
participants to a simple text classi cation approach. The automatic classi cation of
NTSs according to the ICD-10 codes consists of a multi-class and multi-label
problem. It is multi-class because the ICD-10-GM-2016 ontology contains a
total of 270 (until level4) that could potentially be assigned to an NTS, while it is
multi-label because more than one code can be assigned to a each NTS.</p>
      <sec id="sec-4-1">
        <title>6 https://aws.amazon.com/translate/</title>
      </sec>
      <sec id="sec-4-2">
        <title>7 https://metamap.nlm.nih.gov/</title>
      </sec>
      <sec id="sec-4-3">
        <title>8 from https://www.drks.de/drks_web/</title>
        <p>We considered only supervised learning approach based on our training data,
i.e. codes that do not appear in the training data cannot be identi ed by our
baseline approaches. Given it is a multi-label problem, during the training phase
and for the training dataset, one classi er is trained for each of the 270 codes
(if training data is available). During the test phase, for the development and
test datasets, and for each NTS, each of the above classi er is used for deciding
regarding the assignment of the corresponding code to the summary. All
documents were pre-processed using the standard tokenization and TF-IDF
functionality available in the Python scikitlearn library9. We considered two types
of experiments, one using all sections of the summaries, and one using only only
the title and the bene ts sections.</p>
        <p>We followed the approaches based on Support Vector Machine (SVM) that
was previously utilized for the MIMIC II dataset [10]. The authors proposed
at and hierarchical SVMs in which the hierarchical structure of the ICD-10
terminology is considered in the latter. Both SVM algorithms were based on
the SVM implementation available in the Python scikitlearn library and the
di erences between the two approaches are described below.</p>
        <p>Flat This approach does not make use of the hierarchical structure of the
terminology neither when building the classi ers nor when classifying the NTSs
from the test set. For the at SVM approach, we built one classi er for each
code based on the totality of the summaries in the training dataset, i.e. for each
code, the positive training examples were the NTSs that contained the particular
code, while the negative examples were all NTSs that did not contain the code.
Therefore, the classi ers were trained on a very unbalanced data for those codes
that occur very seldom in our training data.</p>
        <p>Hierarchical In this approach, we consider the four levels of the hierarchy of
the ICD-10 ontology, as considered in our manual annotations of the NTSs. The
classi ers related to codes on level 1 were trained on the whole training data,
in which the positive examples were the ones that contained the particular code
and the negative examples were the one that did not contain the code. Therefore,
the classi er for level 1 are not di erent from the ones built in the at approach
for these same codes. As for next levels, the classi er for a particular code was
only trained on the NTSs which belonged to the corresponding parent code. For
instance, the classi er for code C00-C97 (level 2) was trained on all NTSs that
were assigned to chapter II. The positive examples were the one assigned to code
C00-C97, while the negative examples were the all the others assigned to chapter
II but not to C00-C97, for instance, those that belong to the other codes in this
chapter, such as D00-D09, D10-D36 or D37-D48. Therefore, each classi er has
a di erent number of training examples, but a more balanced one with regard
to the proportion of positive and negative examples, in comparison to the at
approach.</p>
      </sec>
      <sec id="sec-4-4">
        <title>9 https://scikit-learn.org/stable/</title>
        <p>In this section we present the results obtained by the runs submitted by the
participating teams and by our baseline systems.
As described in Section 2, one expert manually validated a random sample of
100 FPs and FNs from the best runs from the two best-scoring teams, namely,
run1 from WBI and the only run submitted by team MLT-DFKI. The FNs and
FPs were automatically detected by our evaluation script (cf. Section 2) with
regards to our gold standard test set. We provide a discussion below about the
errors in our gold standards that we found.</p>
        <p>FNs. From the 25 NTSs from run1 of the WBI team, our expert found seven
NTSs in which a total of 14 FNs codes were wrong (cf. Table 3). These were
not codes missed by the run, but rather codes that were mistakenly assigned to
the NTSs in our gold standard. The same seven NTSs also contained 12 wrong
FNs codes detected for the run from team MLT-DFKI. Curiously, even though
we randomly selected the FNs codes, both runs had practically the same FNs
codes from the same seven NTSs.</p>
        <p>FPs. From the 25 NTSs from run1 of the WBI team, our expert found 12 NTSs
in which a total of 22 FPs codes were wrong (cf. Table 4). These were codes
that the expert judged as correct but that were not originally included in our
gold standard. For the run from team MLT-DFKI, our expert judged as correct
predictions just nine codes from four NTSs, from the total of 25 NTSs that were
manually evaluated.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>We presented the rst corpus of non-technical summaries (NTS) of animal
experiments for the German language. We annotated the NTS with the ICD-10
codes and utilized the data in the scope of a shared task in the CLEF eHealth
challenge. Runs from two of the participants obtained results above 0.80 of
fmeasure and outperformed our baseline systems. The results obtained by the
participants show that automatizing this task is indeed feasible, for instance,
for the development of a semi-automatic system to support the experts in the
manual annotation of the NTSs.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgment</title>
      <p>We would like to thank all participants for their interest in our task, and
Felipe Soares for providing a description of his team's system. We would like to
acknowledge the Australian National University for supporting the submission
Web site in EasyChair.
8. Neveol, A., Anderson, R.N., Cohen, K.B., Grouin, C., Lavergne, T., Rey, G.,
Robert, A., Zweigenbaum, P.: CLEF eHealth 2017 multilingual information
extraction task overview: ICD10 coding of death certi cates in English and French.</p>
      <p>In: Proc of CLEF eHealth Evaluation lab. Dublin, Ireland (September 2017)
9. Neveol, A., Robert, A., Grippo, F., Morgand, C., Orsi, C., Pelikan, L., Ramadier,
L., Rey, G., Zweigenbaum, P.: CLEF eHealth 2018 Multilingual Information
Extraction Task Overview: ICD10 Coding of Death Certi cates in French, Hungarian
and Italian. In: Proc of CLEF eHealth Evaluation lab. Avignon, France (September
2018)
10. Perotte, A., Pivovarov, R., Natarajan, K., Weiskopf, N., Wood, F.,
Elhadad, N.: Diagnosis code assignment: models and evaluation metrics.
Journal of the American Medical Informatics Association 21(2), 231{237
(2014). https://doi.org/10.1136/amiajnl-2013-002159, http://dx.doi.org/10.
1136/amiajnl-2013-002159
11. Sanger, M., Weber, L., Kittner, M., Leser, U.: Classifying German Animal
Experiment Summaries with Multi-lingual BERT at CLEF eHealth 2019 Task 1. In:
CLEF (Working Notes) (2019)
12. Taylor, K., Rego, L., Weber, T.: Recommendations to improve the EU
nontechnical summaries of animal experiments. ALTEX - Alternatives to animal
experimentation 35(2), 193{210 (Apr 2018). https://doi.org/10.14573/altex.1708111,
https://www.altex.org/index.php/altex/article/view/90</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ahmed</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ar</surname>
            <given-names>bas</given-names>
          </string-name>
          , A.,
          <string-name>
            <surname>Alpkocak</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <source>DEMIR at CLEF eHealth</source>
          <year>2019</year>
          :
          <article-title>Information Retrieval based Classi cation of Animal Experiment Summaries</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Amin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <article-title>Dun eld</article-title>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Vechkaeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Chapman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Wixted</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.:</surname>
          </string-name>
          <article-title>MLT-DFKI at CLEF eHealth 2019: Multi-label Classi cation of ICD-10 Codes with BERT</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bert</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Dorendahl,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Leich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Vietze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Steinfath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Chmielewska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Hensel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Grune</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          , Schonfelder, G.:
          <article-title>Rethinking 3r strategies: Digging deeper into AnimalTestInfo promotes transparency in in vivo biomedical research</article-title>
          .
          <source>PLOS Biology</source>
          <volume>15</volume>
          (
          <issue>12</issue>
          ),
          <volume>1</volume>
          {
          <fpage>20</fpage>
          (12
          <year>2017</year>
          ). https://doi.org/10.1371/journal.pbio.
          <volume>2003217</volume>
          , https://doi.org/10.1371/journal.pbio.2003217
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          . CoRR abs/
          <year>1810</year>
          .04805 (
          <year>2018</year>
          ), http://arxiv.org/abs/
          <year>1810</year>
          .04805
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Di</given-names>
            <surname>Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.M.:</surname>
          </string-name>
          <article-title>Classi cation of Animal Experiments: A Reproducible Study</article-title>
          .
          <article-title>IMS Unipd at CLEF eHealth Task 1</article-title>
          . In: CLEF (Working Notes) (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kayalvizhi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thenmozhi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aravindan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Deep Learning Approach for Semantic Indexing of Animal Experiments Summaries in German Language</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yoon</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>So</surname>
            ,
            <given-names>C.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kang</surname>
          </string-name>
          , J.:
          <article-title>Biobert: a pre-trained biomedical language representation model for biomedical text mining</article-title>
          . CoRR abs/
          <year>1901</year>
          .08746 (
          <year>2019</year>
          ), http://arxiv.org/abs/
          <year>1901</year>
          .08746
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>