<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Classifying German Animal Experiment Summaries with Multi-lingual BERT at CLEF eHealth 2019 Task 1</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mario Sanger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leon Weber</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Madeleine Kittner</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ulf Leser</string-name>
          <email>leserg@informatik.hu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Humboldt Universitat zu Berlin</institution>
          ,
          <addr-line>Knowledge management in Bioinformatics, Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we present our contribution to the CLEF eHealth challenge 2019, Task 1. The task involves the automatic annotation of German non-technical summaries of animal experiments with ICD-10 codes. We approach the task as multi-label classi cation problem and leverage the multi-lingual version of the BERT text encoding model [6] to represent the summaries. The model is extended by a single output layer to produce probabilities for individual ICD-10 codes. In addition, we make use of extra training data from the German Clinical Trials Register and ensemble several model instances to improve the overall performance of our approach. We compare our model with ve baseline systems including a dictionary matching approach and single-label SVM and BERT classi cation models. Experiments on the development set highlight the advantage of our approach compared to the baselines with an improvement of 3.6%. Our model achieves the overall best performance in the challenge reaching an F1 score of 0.80 in the nal evaluation.</p>
      </abstract>
      <kwd-group>
        <kwd>ICD-10 Classi cation</kwd>
        <kwd>German Animal Experiments</kwd>
        <kwd>Multi- label Classi cation</kwd>
        <kwd>Multi-lingual BERT Encodings</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Biomedical natural language processing (NLP) aims to support biomedical
researchers, health professionals in their daily clinical routine as well as patients
and the public searching for disease-related information. A large part of
Biomedical NLP focuses on extraction of biomedical concepts from scienti c publications
or classi cation of such documents to biomedical concepts. In the past
biomedical NLP has strongly advanced for biomedical or clinical documents in English
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Non-English biomedical NLP lags behind since the availability of annotated
corpora and other resources (e.g. dictionaries and ontologies for biomedical
concepts) in non-English languages is limited.
      </p>
      <p>
        Since 2015 the CLEF eHealth community addresses this issue by organising
shared tasks on non-English or multilingual information extraction. The subject
of CLEF eHealth shared tasks since 2016 [13{15] include the classi cation of
clinical documents according to the International Classi cation of Diseases and
Related Health Problems (ICD-10) [17]. More precisely, the task has been the
assignment of ICD-10 codes to death certi cates in Frensh, English,
Hungarian and Italian. Among the best performing teams in 2018, the task has been
treated as a multi-label classi cation problem or as sequence-to-sequence
prediction leveraging neural networks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Other well performing systems were based
on a supervised learning system using multi-layer perceptrons and an
One-vsRest (OVR) strategy supplemented with IR methods [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or an ensemble model
for ICD 10 coding prediction utilising word embeddings created on the training
data as well as on language-speci c Wikipedia articles [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        In 2019, the CLEF eHealth Evaluation Task 1 focuses on the assignment
of ICD-10 codes to health-related, non-technical summaries of animal
experiments in German [
        <xref ref-type="bibr" rid="ref10 ref16">10, 16</xref>
        ]. According to the laws of the European Union each
member state has to publish a comprehensible, nontechnical summary (NTS) of
each authorised research project involving laboratory animals to provide greater
transparency and increase the protection of animal welfare. In Germany the
webbased database AnimalTestInfo1 houses and publishes planned animal studies
to inform researchers and the public. To improve analysis of the database,
summaries submitted in 2014 and 2015 (roughly 5.300) were labelled by human
experts according to the German version of the ICD-10 classi cation system2
in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Based on this pilot study further documents added to the database have
been labelled and used to conduct this year's CLEF eHealth challenge. The task
is to explore the automatic assignment of ICD-10 codes to the animal
experiments, i.e. given the non-technical summary predicting the ICD-10 codes that
are investigated in the study.
      </p>
      <p>
        We treat the task as a multi-label classi cation problem and apply the
multilingual BERT model [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] which recently achieved state-of-the-art results in eleven
di erent NLP tasks [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The model is extended by a single output layer to
produce probabilities for individual ICD-10 codes. Since training data in this
task is sparse, we also use summaries of clinical trails conducted in Germany
published by the German Clinical Trials Register (GCTR). We compare our
model with ve baseline systems including a dictionary matching approach and
single-label SVM and BERT classi cation models. The implementation of our
models is available as open source software at github3.
      </p>
    </sec>
    <sec id="sec-2">
      <title>1 https://www.animaltestinfo.de/</title>
    </sec>
    <sec id="sec-3">
      <title>2 https://www.dimdi.de/static/de/klassifikationen/icd/icd-10-gm/</title>
      <p>kode-suche/htmlgm2016/</p>
    </sec>
    <sec id="sec-4">
      <title>3 https://github.com/mariosaenger/wbi-clef19x</title>
      <sec id="sec-4-1">
        <title>Method</title>
        <p>Here we describe the corpora, used terminologies and classi cation models we
use in the task.
2.1</p>
        <sec id="sec-4-1-1">
          <title>Corpora and Terminologies</title>
          <p>The lab organisers provided a corpus of 8,385 German non-technical summaries
of animal experiments (NTS) originating from the AnimalTestInfo database.
For each experiment a short title is given followed by a description of expected
bene ts as well as pressures and damages of the animals. Furthermore, strategies
to prevent unnecessary harm to the animals and to improve animal welfare are
described. Each summary was labeled by experts using the German version of
the ICD-10 classi cation system. Depending on the level of detail of the summary
di erent levels (e.g. chapter, group) of the ICD-10 ontology are used to annotated
the experiment. About two-thirds of the experiments are labeled with exactly
one disease and 10% with multiple diseases; the remainder have no annotated
disease. For each disease the complete path in the ICD-10 ontology, i.e. up to
two parent groups and the chapter of the annotated disease, is given. About two
third of the summaries are annotated with 2-level paths (e.g. I j B50-B64 ), 20 %
with 3- or 4-level paths (eg. IV j E70-E90 j E10-E14 or II j C00-C97 j C00-C75 j
C15-C26 ) and less than 1 % of the summaries are only annotated with chapters
(e.g. VI ). The data set is divided into a strati ed train and development split
(7,543 / 842) at document level. For the nal evaluation an hold-out set of 407
experiments are used by the organisers.</p>
          <p>In addition to the provided data set, we use information from the German
Clinical Trials Register (GCTR)4. The GCTR provides access to basic
information (e.g. trial title, short description, studied health condition, inclusion and
exclusion criteria) of clinical trials conducted in Germany and is also annotated
with ICD-10 codes. We downloaded all trials available through the GCTR
website. For each trial we make use of the title as well as the scienti c and lay
language summary. We use the chapter and all (sub-) groups up to the third
level of the ontology of the given ICD-10 codes describing the studied health
condition as labels for the trial, similar to ICD-10 coding in the NTS data set.
In this way we are able to extend the training set by 7,615 documents having
18,263 ICD-10 codes. ICD-codes of each study in the GCTR data set relate
to the ICD-10 version valid at publication of a study. We did not adjust for
any di erences (e.g. any potentially missing ICD-10 codes) to version 2016 used
for the NTS corpus. The two data sets almost fully overlap with regard to the
considered health problems. Of the 233 distinct ICD-10 codes occurring in the
complete NTS corpus, 226 (97%) are mentioned in GCTR too. Moreover, 27
other ICD-10 codes will be introduced through the additional data set. Table 1
summarises the used corpora.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4 https://www.drks.de/drks_web/setLocale_EN.do</title>
      <sec id="sec-5-1">
        <title>BERT for multi-label classi cation</title>
        <p>
          Our approach for the task is based on BERT language model [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. BERT is a
text encoding model that recently achieved state-of-the-art results in many
different NLP tasks [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. It is a neural network based on the transformer
architecture of [19], which was pretrained using two di erent language modelling tasks:
masked language modeling and next sentence prediction. Speci cally, we use
the multilingual version of BERT-Base5 that has been pre-trained on Wikipedia
dumps of 104 di erent languages including German.
        </p>
        <p>
          Given a sequence of tokens t1; : : : ; tL, BERT rst subdivides the tokens
into subword-tokens, yielding a new (usually, longer) sequence s1; : : : ; sN using
WordPiece [21]. Then, it produces vector representations for each subword-token
e1; : : : ; eN 2 R768 and one vector c 2 R768 which is not tied to a speci c token.
BERT supports sequence lengths up to 512 sub-word tokens. We represent each
animal experiment by taking as much as possible sub-word tokens from the title
and the description of expected bene ts and pressures of the summary text as
model input. Following [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], we employ c as a representation for the whole token
sequence.
        </p>
        <p>
          We treat the assignment of ICD-10 codes as a one-versus-rest multi-label
classi cation problem [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], i.e. as jYj independent binary classi cation tasks, where
Y is the set of all ICD-10 codes occurring in the training set. Each example is
used as a positive example if it has the respective label, while all other
examples are used as negative examples. The only connection between the individual
classi cation tasks is the BERT encoder which is shared between all tasks and
which receives parameter updates from all of them. We use a single output layer
W 2 R768 jYj to compute the output probabilities per class with (c W ), where
is the element-wise sigmoid function, and use binary cross-entropy as a loss.
        </p>
        <p>
          We implement our model in PyTorch [18] using the pytorch-pretrained-BERT 6
implementation of BERT and use the included modi ed version of Adam [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] for
optimization. We train our model for 60 epochs on a single Nvidia V100 GPU,
which takes about nine hours. In principle, it would also be possible to train
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5 https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_</title>
      <p>H-768_A-12.zip</p>
    </sec>
    <sec id="sec-7">
      <title>6 https://github.com/huggingface/pytorch-pretrained-BERT</title>
      <p>and evaluate the model using only CPUs but that would take considerably more
time.</p>
      <p>
        We train multiple model instances using di erent random seeds and ensemble
their predictions. Ensembling of multiple neural network models has shown to
be bene cial in several NLP taks [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. We ensemble the models in two ways: (1)
by averaging the predictions of the di erent model instances and (2) learning
a logistic regression classi er based on the model outputs on the development
set. We denote the two ensembling model variants as BERT multi-label Avg
and BERT multi-label LogReg. Note, that because BERT multi-label LogReg is
trained on the development set, the resulting scores on this data are no longer a
reliable estimate for out-of-sample performance and can only be fairly compared
to the other approaches on the development set.
2.3
      </p>
      <sec id="sec-7-1">
        <title>Baselines</title>
        <p>To gain better insights about the performance level of our approach we
compare it with ve di erent baseline methods. First, we implement a
dictionarymatching approach. For this we took the concept descriptions of all codes listed
in the ICD-10 ontology as well as all given synonyms and search for
occurrences of these terms in the title and goals (line 1 and 2) of an animal trial
summary. Dictionary matching is performed by indexing all ICD-10 concepts
using Apache Solr 7.5.07 and applying exact and fuzzy matching. Each ICD-10
concept is linked to its related path up to the chapter-level which is used for
annotation. All concepts matched by the dictionary are reported as results. We
do not perform any further post-processing like sorting out overlapping ICD-10
paths. For the other baselines we transform the task into (1) a group-level or
(2) a sub-group-level classi cation problem, i.e. we use the label on the second
level of the ICD-10 hierarchy (e.g. for I j C00-C97 j C00-C75 we use C00-C97 )
resp. the deepest label (e.g. for I j C00-C97 j C00-C75 we use C00-C75 ) for a
given trial summary as gold standard. In both cases, for instances with multiple
codes originating from di erent branches of the ICD-10 ontology we use the rst
label as gold standard. Moreover, we add a special no-class label to support
documents without any annotated ICD-10 code.</p>
        <p>
          We investigate two di erent classi cation methods for the tasks, Support
Vector Machines (SVM) and BERT sequence classi cation model[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. For the former,
we build TF-IDF vectors as input representation for the trial summaries. For the
latter, the model architecture is equivalent to our multi-label model except that
the nal linear layer calculates a soft-max over the classes of the classi cation
task and hence applies a (single-class) cross-entropy loss for training.
        </p>
        <p>For the both classi cation baselines, we augment the predictions of the
models according to the ICD-10 hierarchy, e.g. if a group-level model predicts
C00C97 we automatically add the parent chapter (in this case I ) to the prediction.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>7 https://lucene.apache.org/solr/</title>
      <sec id="sec-8-1">
        <title>Results &amp; Discussion</title>
        <sec id="sec-8-1-1">
          <title>Experimental setup</title>
          <p>
            We use the training split of the provided corpus as well as the documents from
the GCTR data set to train our multi-label as well as all baseline models. For
the BERT multi-label and the SVM classi cation models we perform
hyperparameter optimisation and select the best model of each approach based on the
development set performance. Regarding the SVM models, we follow [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] and test
f2 5; 2 3; : : : ; 215g as values for the C parameter. The best scores are reached
with C = 2 / C = 0:5 for group level / sub-group-level classi cation. In case of
our BERT multi-label approach we only tune the learning rate parameter. We
evaluate the sequence f5e 5; 4e 5; 3e 5; 2e 5; 1e 5g and found that 4e 5
achieves the highest scores. We omit hyperparameter tuning for the BERT
classication models due to time constraints. Therefore, we use the default parameter
settings of the model, i.e. learning rate of 5e 5.
          </p>
          <p>As described in Section 2.2 we learn eight model instances of our approach
using di erent random seeds and ensemble them. The two ensemble variants are
built (a) by averaging of the two best model instances and (b) learning a logistic
regression classi er based on the output of the three models with the highest
scores. The latter is trained on the output of the individual model instances on
the development set. We opt for this settings based on preliminary experiments
on the training and development set.</p>
          <p>To gain insights about the e ectiveness of the additional data, we evaluate
each model (except for the ensemble models) in two data con guration settings:
with and without the additional texts from the GCTR data set (see Section 2.1).
We use the provided evaluation script and report precision, recall and F1 scores
as evaluation metrics.
3.2</p>
        </sec>
        <sec id="sec-8-1-2">
          <title>Development results</title>
          <p>This highlights the e ectiveness and suitability of the BERT model for this task,
since in general SVMs o er competitive performance for document classi cation
problems [20].</p>
          <p>The dictionary matching can't compete with the machine learning based
solutions. Even through the matching of the concept terms with the trail summaries
provides the highest recall (0.894) of all evaluated approaches, the precision of
the approach is very low (0.416) due to many false positives. In particular, the
approach often predicts incorrect chapter annotations, for instance the chapter
XXI 681 times. This is because of the broad and general topic of the chapters
respectively their descriptions, e.g. XXI is about "Factors in uencing health
status and contact with health services".</p>
          <p>Comparing the con gurations with and without the GCTR documents, it can
be seen that the performance increases (at least slightly) for all considered
models. Improvements range from 0.5% (SVM Sub-group) to 1.9% (BERT group) for
the baseline systems with respect to their variants without the additional data.
In contrast, the multi-label model can bene t more greatly from the extended
training set (+3.2%).</p>
          <p>The overall best performance is achieved by ensembling the best BERT
multilabel models. In both ensembling variants the model reaches an F1 score of 0.815.
This represents an increase of 0.6% over the single model.
We further analysed the predictions made by the di erent approaches. Figure 1
(left) compares the true positives of our BERT multi-label model as well as the
SVM and BERT sub-group baseline (all with the GCTR corpus as additional
training data). We exclude the dictionary matching baseline for this
investigation, since the approach predicts too optimistically and thereby distorts the
picture.</p>
          <p>First of all it can be noted that, in total 1,422 of the 1,682 gold standard
ICD-10 codes are identi ed by at least one of the three methods. This
corresponds to 84.5% of the complete development data set. The intersection of all
three methods consists of 1,001 true positives. This represents 70.4% of all
correctly identi ed codes. Additionally, 1,240 (87.2%) labels are predicted by two
of the three methods. Furthermore, it can be seen that 110 true positives are
exclusively identi ed by our multi-label approach. This constitutes 7.7% of all
correctly found codes. In contrast, 98 codes (6.9%) were predicted by (at least)
one of the two classi cation baseline and not detected by our BERT multi-label
approach. We tried to investigate the di erences between the multi-label and
the classi cation models but can't come up with a clear (error) pattern.</p>
          <p>We also perform the investigation using the best ensembled version of our
approach (BERT multi-label LogReg). Figure 1 (right) highlights the results of
this comparison. Through the ensembling we are able to additionally identify
20 labels correctly. Moreover, 38 ICD-10 codes that were exclusively predicted
by the classi cation baselines previously are now detected by the multi-label
approach too. However, when interpreting the gures one has to keep in mind
that the logistic regression model that ensembles the predictions of the
individual model instances is trained on the development set and hence may tend to
represent an over-optimistic picture.</p>
          <p>The overall best performance is accomplished by the single BERT
multilabel model. In this setting the model achieves an F1 score of 0.80. The model
shows a slightly better precision (0.83) than recall (0.77). Comparing the model
with both ensembling variants it can be seen that all models perform almost
on par and just leverage slightly di erent precision-recall trade-o s. The
Avgensemble of the best models (run2 ) predicts more conservatively reaching the
highest precision (0.84) of all evaluated models, but o ers lower recall scores.
In contrast, the LogReg-ensemble provides well-balanced precision and recall
scores. Moreover, it has to be noted that the nal evaluation scores are virtually
the same as the development scores. However, no positive e ects can be observed
through ensembling of multiple models (at least in the considered way).</p>
          <p>Comparing our method with the other submissions, it can be seen that our
model outperforms the other team's approaches by a large extend. The second
best team (MLT-DFKI) reaches a higher recall (0.86) than our multi-label model
(0.77). However, their approach has a lower precision compared to our model
(0.64/0.83). This allows our model to achieve a 9.6% higher F1 score.
4</p>
        </sec>
      </sec>
      <sec id="sec-8-2">
        <title>Conclusion</title>
        <p>This paper presents our contribution to Task 1 of the CLEF eHealth competition
2019. The task challenges the automatic assignment of ICD-10 codes to German
non-technical summaries of animal experiments.</p>
        <p>
          We approach the task as multi-label classi cation problem and leverage the
multi-lingual version of BERT [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] to represent the summaries. We extend the
model with a single output layer to predict probabilities for each ICD-10 code.
Furthermore, we utilise additional data from the German Clinical Trails
Register to built an extended training data set and hereby improve the overall
performance of the approach. Evaluation results highlight the advantage of our
proposed approach. Our model achieves the highest performance gures of all
submission with an F1 score of 0.80. Moreover, experiments on the development
set illustrate that the model outperforms several strong classi cation baselines
by a large extend.
        </p>
        <p>
          There are several research questions worth to investigate following this work.
Due to the multi-lingual nature of the used BERT encoding model it would be
interesting to evaluate our approach in an cross-lingual setup, e.g. apply the
learned model to non-German clinical documents or animal trail summaries.
For this purpose we want to use the data from the previous editions of the
CLEF eHealth challenges, i.e. Italian, English, French and Hungarian death
certi cates. This is especially interesting, because of the di erent text format of
the certi cates. They are much shorter than the animal experiment summaries
and contain a lot of abbreviations of medical terms. It is an open question how
well our trained model can be transferred to this type of texts. Furthermore,
we also plan to inspect other approaches to the task, e.g. modelling the task
as question-answering problem. Recently, versions of BERT trained on English
biomedical literature have been published [
          <xref ref-type="bibr" rid="ref12 ref3">12, 3</xref>
          ]. It would be worthwhile to
investigate whether an extension of such models to multi-lingual biomedical texts
would improve results further.
        </p>
      </sec>
      <sec id="sec-8-3">
        <title>Acknowledgments</title>
        <p>Leon Weber acknowledges the support of the Helmholtz Einstein International
Berlin Research School in Data Science (HEIBRiDS). We gratefully acknowledge
the support of NVIDIA Corporation with the donation of the Titan X Pascal
GPU used for this research.
17. Organization, W.H., et al.: The ICD-10 classi cation of mental and behavioural
disorders: clinical descriptions and diagnostic guidelines. Geneva: World Health
Organization (1992)
18. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,</p>
        <p>Desmaison, A., Antiga, L., Lerer, A.: Automatic di erentiation in pytorch (2017)
19. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
L., Polosukhin, I.: Attention is all you need. In: Advances in neural information
processing systems. pp. 5998{6008 (2017)
20. Wang, S., Manning, C.D.: Baselines and bigrams: Simple, good sentiment and
topic classi cation. In: Proceedings of the 50th annual meeting of the association
for computational linguistics: Short papers-volume 2. pp. 90{94. Association for
Computational Linguistics (2012)
21. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun,
M., Cao, Y., Gao, Q., Macherey, K., et al.: Google's neural machine translation
system: Bridging the gap between human and machine translation. arXiv preprint
arXiv:1609.08144 (2016)</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Almagro</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montalvo</surname>
          </string-name>
          , S.,
          <string-name>
            <surname>de Ilarraza</surname>
            ,
            <given-names>A.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Mamtra-med at clef ehealth 2018: A combination of information retrieval techniques and neural networks for icd-10 coding of death certi cates</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Atutxa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Casillas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ezeiza</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goenaga</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fresno</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gojenola</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martinez</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oronoz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perez-de Vinaspre</surname>
          </string-name>
          , O.:
          <article-title>Ixamed at clef ehealth 2018 task 1: Icd10 coding with a sequence-to-sequence approach</article-title>
          .
          <source>CLEF</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Beltagy</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lo</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Scibert: Pretrained contextualized embeddings for scienti c text</article-title>
          . arXiv preprint arXiv:
          <year>1903</year>
          .
          <volume>10676</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bert</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Dorendahl,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Leich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Vietze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Steinfath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Chmielewska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Hensel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Grune</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          , Schonfelder, G.:
          <article-title>Rethinking 3r strategies: Digging deeper into animaltestinfo promotes transparency in in vivo biomedical research</article-title>
          .
          <source>PLoS biology</source>
          <volume>15</volume>
          (
          <issue>12</issue>
          ),
          <year>e2003217</year>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bishop</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          :
          <article-title>Pattern recognition and machine learning</article-title>
          . springer (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Habibi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weber</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neves</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegandt</surname>
            ,
            <given-names>D.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leser</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Deep learning with word embeddings improves biomedical named entity recognition</article-title>
          .
          <source>Bioinformatics</source>
          <volume>33</volume>
          (
          <issue>14</issue>
          ),
          <year>i37</year>
          {
          <fpage>i48</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Hsu</surname>
            ,
            <given-names>C.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>C.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          , et al.:
          <article-title>A practical guide to support vector classi cation (</article-title>
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Jeblee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Budhkar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Milic</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pinto</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pou-Prom</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vishnubhotla</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hirst</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rudzicz</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Toronto cl at clef 2018 ehealth task 1: Multi-lingual icd-10 coding using an ensemble of recurrent and convolutional neural networks</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suominen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neves</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanoulas</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Azzopardi</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spijker</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zuccon</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scells</surname>
          </string-name>
          , H., ao Palotti, J.:
          <article-title>Overview of the CLEF eHealth evaluation lab 2019</article-title>
          . In: Cappellato,
          <string-name>
            <surname>L.</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.N.L.D.E.</surname>
          </string-name>
          , Muller, H. (eds.)
          <article-title>Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Tenth International Conference of the CLEF Association (CLEF</source>
          <year>2019</year>
          ).
          <source>Lecture Notes in Computer Science</source>
          . Springer, Berlin Heidelberg, Germany (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
          </string-name>
          , J.:
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yoon</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>So</surname>
            ,
            <given-names>C.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kang</surname>
          </string-name>
          , J.:
          <article-title>Biobert: pretrained biomedical language representation model for biomedical text mining</article-title>
          . arXiv preprint arXiv:
          <year>1901</year>
          .
          <volume>08746</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Neveol</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grouin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hamon</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lavergne</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rey</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robert</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tannier</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          , et al.:
          <article-title>Clinical information extraction at the clef ehealth evaluation lab 2016</article-title>
          .
          <article-title>In: Proceedings of CLEF 2016 Evaluation Labs</article-title>
          and Workshop: Online Working Notes. CEUR-WS (
          <year>September 2016</year>
          ) (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Neveol</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robert</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anderson</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Cohen,
          <string-name>
            <given-names>K.B.</given-names>
            ,
            <surname>Grouin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Lavergne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Rey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Rondet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Zweigenbaum</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>Clef ehealth 2017 multilingual information extraction task overview: Icd10 coding of death certi cates in english and french</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Neveol</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robert</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grippo</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morgand</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Orsi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pelikan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramadier</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rey</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zweigenbaum</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Clef ehealth 2018 multilingual information extraction task overview: Icd10 coding of death certi cates in french, hungarian and italian</article-title>
          . In:
          <article-title>CLEF 2018 Evaluation Labs</article-title>
          and Workshop: Online Working Notes, CEUR-WS (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Neves</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Butzke</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Dorendahl,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Leich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Hummel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            , Schonfelder, G.,
            <surname>Grune</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          :
          <article-title>Overview of the CLEF eHealth 2019 Multilingual Information Extraction</article-title>
          . In: Crestani,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Braschler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Savoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Rauber</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          , et al. (eds.)
          <article-title>Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Tenth International Conference of the CLEF Association (CLEF</source>
          <year>2019</year>
          ).
          <source>Lecture Notes in Computer Science</source>
          . Springer, Berlin Heidelberg, Germany (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>