FLE at CLEF eHealth 2020: Text Mining and Semantic
    Knowledge for Automated Clinical Encoding

                        Nuria Garcı́a-Santa1 and Kendrick Cetina1

        Fujitsu Laboratories of Europe (FLE), Pozuelo de Alarcón (Madrid) 28224, Spain
               {nuria.garcia, kendrick.cetina}@uk.fujitsu.com
                             http://www.fujitsu.com/emea/about/fle/


        Abstract. In Healthcare domain, several documents are provided in a narrative
        way, following textual unstructured formats. This is the case of the discharge
        summaries, which are clinical texts where physicians describe the conditions of
        the patients with natural language, making the automated processing of such texts
        hard and challenging. The objective of the tasks of the 2020 CLEF eHealth for
        Multilingual Information Extraction is to develop solutions to automatically an-
        notate Spanish clinical texts with codes from the International Classification of
        Diseases, 10th version (ICD-10). In the present paper, we show our approach
        which is based on Named Entity Recognition (NER) to detect the diagnoses and
        procedures, and semantic linking against a Knowledge Graph to extract the ICD-
        10 codes. Besides, we exploit text augmentation techniques to generate synthetic
        input samples and we use BERT pre-trained models and architecture to train the
        NERs.

        Keywords: CLEF eHealth · Clinical Encoding · Text Mining · Semantic Knowl-
        edge · Named Entity Recognition (NER)


1     Introduction

Automated Clinical Encoding covers multiple computer-assisted techniques to extract
valuable knowledge within clinical documents written in natural language and trans-
form such knowledge to structured information. Clinical documents usually include
medical entities that correspond to diagnoses, procedures, symptoms, drugs, etc., but
the use of narrative and informal language is challenging for the automatic processing
of this information. Among Automated Clinical Encoding tasks, a popular one is the
assisted assignment of codes to the clinical documents from standard medical classi-
fications, such as the International Classification of Diseases (ICD) [7]. Traditionally,
this code assignment is done manually by healthcare professionals. Therefore, the main
objective of automated approaches is to support clinicians in their daily activities by
helping them save time and resources.

    Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons Li-
    cense Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessa-
    loniki, Greece.
     The CLEF eHealth challenge on Multilingual Information Extraction of this year
is focused on this kind of techniques [6][14]. The three sub-tasks of this 2020 chal-
lenge are based on the automated code assignment of Spanish clinical documents for
diagnoses and procedures of the International Classification of Diseases (ICD).
     In previous years, the challenges worked in the same line of research. For CLEF
eHealth 2017 challenge, participants provided solutions to extract ICD codes (10th
version , ICD-10) in death certificates for English and French [17], and, in 2018, for
French, Italian, and Hungarian [18]. In the CLEF eHealth challenge of 2019, the shared
task was focused on the automatic detection of ICD-10 codes for German Non-technical
summaries (NTSs), which are short descriptions of planned animal experiments [19].
     In these past challenges, the best approaches were mainly based on neural network
architectures. In the best approach of 2017, the authors provided sequence-to-sequence
deep learning models based on Recurrent Neural Networks (RNNs) [12]. In 2018,
the best solution followed a similar way, proposing a machine learning sequence-to-
sequence neural model to map input text snippets with the output ICD-10 codes [2] .
And, in 2019, the best two approaches developed different neural network designs, such
as CNNs and Attention models [1], or logistic regression classifiers [21], but both used
multilingual BERT [4].
     In the literature, a wide range of approaches have been published, since semantic-
based or rule-based solutions to machine learning proposals. Several examples of se-
mantic approaches are works such as Pakhomov et al. [20], where the authors presented
a system that relies on a Knowledge Base obtained by manually coded data, collected
over 10 years, or Garcı́a-Santa et. al [5], that developed a solution to return automati-
cally the k-top ICD codes associated to a clinical text through exploitation of enriched
Knowledge Graphs and heuristic rules. In machine learning research, Mullenbach et
al. [16] proposed a method called Convolutional Attention for Multi-Label classification
(CAML) that is based on a CNN and a per-label attention mechanism, and, it includes
explanations of the code assignments. Baumel et al. [3] described a Hierarchical Atten-
tion bidirectional Gated Recurrent Unit (HA-GRU) to identify the relevant sentences
for each code. In this approach, the authors compared the results with an SVM-based
one-vs-all model, a continuous bag-of-words (CBOW) model, and a CNN.
     To address the CLEF eHealth 2020 challenge, our FLE team has developed a so-
lution focused on Named Entity Recognition (NER) and semantic-based approaches
exploiting Knowledge Graphs. Our Knowledge Graph has been enriched with the an-
notations coming from training and validation sets provided by the organizers of the
challenge. Besides, we extended the input datasets for the NER models by creating
synthetic samples with text augmentation techniques over the train/validation sets and
we also used BERT pre-trained models and architecture for the NER training.


2     Material and Methods

2.1   Problem definition

Current CLEF eHealth 2020 task deals with multilingual Information Extraction (IE).
Concretely, this year, the challenge is focused on automatic coding of Spanish clinical
textual documents to the International Classification of Diseases [7], version 10 (ICD-
10)1 , in its Spanish distribution (CIE-10)2 . This is the first community task oriented
exclusively to the automatic coding of clinical cases in Spanish [14].
    The challenge is divided in three sub-tasks:
 – Task 1: Automatic code assignment of Spanish texts to CIE10-CM, i.e. to specific
   diagnoses of the standard.
 – Task 2: Automatic code assignment of Spanish texts to CIE10-PCS, i.e. to specific
   procedures of the standard.
 – Task 3: Addition of explainability references for the two aforementioned tasks. It
   requires a joint automatic code assignment of diagnoses and procedures, including
   the positions of the key entities that justify such code assignment.
    The output has to be a list of ICD-10 codes for each text document. In the first
two sub-tasks, this list must be arranged in descending order, based on the relevance
of the code to the corresponding document. In the last sub-task, order of relevance is
not required but a joint list of codes has to be presented for diagnoses and procedures,
specifying the position of related entities in the text documents.
    Our team has participated in the three sub-tasks through a multi-task approach. In
our proposal, the core and main techniques have been shared and reused to address the
three sub-tasks in a unified way.

2.2   Datasets and Resources
For all the sub-tasks, a synthetic corpus of 1000 clinical case studies has been published.
The dataset was manually annotated by clinical professionals. In the official source of
the challenge it is specified that the dataset comprises 16,504 sentences and 396,988
words, with an average of 396.2 words per clinical case. This corpus is freely accessi-
ble3 . There are separate directories for train, dev and test datasets. The train set has 500
clinical cases, the dev set has 250, and the test set with gold standard annotations has
250 clinical cases. In addition, organizers shared a background set without annotations
of 2,751 clinical cases. Besides the texts of the clinical cases, in the corpus of train and
dev set, tab-separated files for each sub-task are included. These files include the anno-
tations associated to each clinical case. In figure 1, an excerpt sample of these files is
shown for task 1 (CIE10-CM) and task 3 (Explainability).
    The training dataset of the task 3 (Explainability) has 9,211 annotated codes, of
which, 2,392 are unique. Taking into account that CIE10-CM reports a number of
71,486 diagnoses and CIE10-PCS has a number of 87,170 procedures, we have a total
of 158,656 codes [15]. This quantity is very far from the unique number of annotated
codes in the dataset, which means that we are losing a wide spectrum of potential codes
of assignment. This makes it more difficult to provide scalable systems in supervised
learning approaches. Besides, the number of the code annotations in the dataset is very
unbalanced. If this issue is not addressed, it could increase classification biases to most
frequent codes. Figure 2 shows the forty most frequent codes in the dataset.
 1
   https://www.who.int/classifications/icd/en/
 2
   https://eciemaps.mscbs.gob.es/ecieMaps/browser/metabuscador.html
 3
   https://zenodo.org/record/3837305#.XtTwHjozYgx
Fig. 1: Excerpt sample of the training datasets. In the left, for the task 1 (two columns; clinical
case ID and CIE10-CM code). In the right, for the task 3 (four columns; clinical case ID, entity
label type, CIE-10 code, entity label and position in text).


                        200

                        175

                        150

                        125
            Frecuency


                        100

                        75

                        50

                        25

                         0
                                     r52
                                  r50.9
                                      i10
                                     r69
                                  r59.9
                                  r60.9
                                  e11.9
                              bw03zzz
                                  r53.1
                                  d64.9
                                  r59.0
                                     r58
                              bw40zzz
                                  bw20
                                  r10.9
                                r11.10
                                  b99.9
                                      i96
                                n28.89
                              4a02x4z
                                  t14.8
                                  r31.9
                                   b030
                                  bw24
                                     b20
                                   l53.9
                                     r51
                                 i82.90
                                   l98.9
                               f17.210
                                  n28.9
                                  r80.9
                                   5a1d
                                   i46.9
                                  r23.1
                                  r63.4
                                  a15.9
                              3e0t3cz
                                b19.20
                                  r56.9
                                                   Codes
                Fig. 2: Top 40 most frequent codes in the training dataset of the task 3.


    Aside from these core datasets of the challenge, we also tried the additional Spanish
abstracts provided by the organizers4 . These abstracts were a total of 176,294 texts and
they were annotated automatically [13]. After several tests, we decided to discard the
use of this resource in final versions because it increased the noise of our annotations
and the performance was affected negatively.
    External sources such as PubMed [11] and MIMIC-III [8] have been used to sup-
port Named Entity Recognition (NER) tasks. Annotated samples of diagnoses and pro-
cedures entities mentioned in medical literature from PubMed (through PubTator FTP
service [22]) and in clinical notes from MIMIC-III databases are exploited to train NER
models. We translated those annotated datasets from English to Spanish. In this NER
task, the pre-trained language model of Multi-lingual BERT5 [4] was also used.
    Other linguistic resource that we used is the NegEx-MES tool6 to detect entities
negated in Spanish texts. This resource was exploited in several of our final versions for
post-processing steps.
 4
   https://zenodo.org/record/3606662#.XtUVdDozYgx
 5
   https://github.com/google-research/bert/blob/master/multilingual.md
 6
   https://github.com/PlanTL-SANIDAD/NegEx-MES
   And, finally, we used the lists of valid codes for Spanish ICD-10 [15], for diagnoses
and procedures (CIE-10).


2.3     Automated Clinical Encoding Methodology

The main workflow and steps of our system are depicted in figure 3. We followed an ap-
proach based on named entity detection from clinical texts and later Knowledge Graph
(KG) entity linking to CIE-10 codes. We also tested an approach based on text classifi-
cation through simple Convolutional Neural Networks (CNNs). However, performance
decreased and the system was less scalable because of imbalance nature and low cov-
erage of datasets, as we explained in previous section. Because of these reasons, we
discarded a full Machine Learning approach in our final system, instead, we devel-
oped a combination of Machine Learning for Named Entity Recognition (NER) and
semantic-based approach for CIE-10 entity linking through Knowledge Graph (KG)
construction.


                             Fig. 3: Workflow of our system


    The first steps include the activities of KG Creation and Data Pre-process. For the
KG Creation, we take the lists of valid codes in CIE-10 and we implement a Knowl-
edge Graph with that information. This data contains nodes with the CIE-10 code, the
description in Spanish, the description in English and hierarchical relations between
the CIE-10 nodes. In this implementation we use the framework of Neo4j7 . In Data
 7
     https://neo4j.com/
Pre-process, we perform Terms Population to the KG. These terms come from the sam-
ples annotated in the training and validation sets provided by the organizers. Taking the
datasets annotated in the task 3, we get the CIE-10 codes and its related term mentions
and we create a new array attribute for that CIE-10 node in the KG, adding all the dif-
ferent ways to name a diagnosis or procedure (term mentions). We also perform Dataset
Preparation, cleaning the clinical texts of special characters and encoding issues, and
solving several errors in the annotated samples regarding wrong labels (e.g. a diagno-
sis with the type label of procedure or vice versa), and codes that belonged to other
standards. In this step, we also adapt the format of input annotated samples and clinical
texts to the BIOES format or IOBES8 in order to be used later for training the NER.
    In the intermediate step, we proceed to train a Named Entity Recognition (NER)
model. We use the pre-trained language model of Multilingual BERT to initialize the
neural network. We have performed different trainings depending on variations in the
input annotated samples. Below, we point out the different settings (all samples follow
the BIOES format):

 – Baseline: Input annotated samples from train and dev sets provided by organizers.
 – Baseline + Abstracts: Previous samples + annotated samples from the additional
   Spanish abstracts resource.
 – Baseline + MIMIC-III: Baseline samples + annotated samples from MIMIC-III
   dataset.
 – Baseline + Text Augmentation: Baseline samples + annotated samples augmented
   with Fujitsu’s proprietary technology.

     For the final versions, we carried out the setting of ’Baseline + Text Augmentation’
because we achieved the best results as shown in the evaluation section. We trained
two different NER models for each type of label; one for diagnoses recognition, and
the other for procedures recognition. In the neural network, we follow the BERT ar-
chitecture. Bidirectional Encoder Representations from Transformers (BERT) [4] is a
bidirectional transformer encoder whose main features are multi-headed self attention,
multi-layer feed forward and positional embeddings. We fine-tuned the Multilingual
BERT for our supervised NER model.
     Next, in the CIE-10 Linking process, the previous models were run over the test
sets to extract all named entities within the clinical texts. In this way, we are able to
obtain the named entity, its label (diagnosis or procedure) and its position in the text.
Once we have detected the named entities, we perform the linking algorithm to extract
the correspondent CIE-10 codes. For this activity, we use Levenshtein string similarity
distance [10]. We compare the named entities against the terms and descriptions of the
KG, getting the most suitable CIE-10 codes.
     In the final step, we develop Data Post-process methods to create the definitive
results. For the task 1 and the task 2 we apply frequency-based techniques to sort the
results. We assume the most frequent named entities in a text are the most relevant
ones. The CIE-10 codes of such named entities would be in the first positions of the
list of a clinical text. However, in those cases when the frequency is the same, we
apply reordering based on the text position of named entities. We give more relevance
 8
     https://donovanong.github.io/ner/tagging-scheme-for-ner.html
to named entities nearer to the end of the text, where conclusions and the diagnostics
usually take place. We assign lower relevance to the entities located at the beginning
of the text, where the antecedents are usually exposed. For all these tasks, we created
a version of results where we removed the entities negated with NegEx-MES tool. We
also created other versions where we analyzed overlapping of named entities in different
text positions to normalize the CIE-10 code of the longest named entity.

3      Evaluation
In this section we first describe the environment and the tools that we used. Then, we
describe the experiments carried out to select the tools and the methods used. And
finally, we present and discuss the CLEF eHealth performance evaluation of our results.

3.1     Environment setup
All the NER models described here are obtained by fine-tuning a BERT-Based Multilin-
gual cased architecture with 110 million trainable parameters. The weights are available
on GitHub9 . It supports 104 languages (Spanish included). We chose this architecture
empirically from the experience we have on training biomedical domain models. We
used the script run ner.py, which is available on GitHub10 to fine-tune our NER
models. This script was also previously used to train BioBERT[9].

    For text augmentation of the NER input datasets, we trained a text generation model
with Fujitsu’s proprietary technology based on decentralized learning. We used a sub-
set of MIMIC-III [8] and PubMed [11] databases. We selected the first 10% of each
database and we created sequences of sizes 40 and 50 for MIMIC-III and PubMed re-
spectively. We trained a total of 4,113,665 parameters with batch size of 128 for 250
epochs. The training time for this model was 16.6 hours. This is the main version of
the Fujitsu Text Generation model, but we generated a second analogue model with the
training data provided by CLEF eHealth challenge. In our experiments, we tested the
performance of both versions.

    After the prediction of the NER and the linking process to the CIE-10 codes with the
help of the Knowledge Graph, we performed 4 different methods of post-processing of
the results. We presented 4 versions of our results, one for each type of post-processing
method applied. Table 1 presents the identifiers of each version, and then we describe
the methods of post-processing followed.
    – No position overlap: After locating the position of each entity found in the text,
      we remove entities with overlapping positions and we only keep the longest entity.
      For instance, if we detect the entities “hipertensión ocular” and “hipertensión”, the
      second entity is a substring of the first, that is, it is in the same position of the word
      “hipertensión” from the first entity. In this case, we only keep the first entity, which
      is the longest one.
 9
     https://github.com/google-research/bert
10
     https://github.com/dmis-lab/biobert
                     Table 1: Identifiers of the results by task presented.

            Results Identifiers
                                                       Description
     Task 1      Task 2         Task 3
  CodiEspD v1 CodiEspP v1 CodiEspX v1             No position overlap.
  CodiEspD v2 CodiEspP v2 CodiEspX v2 No position overlap. Denied entities removed.
  CodiEspD v3 CodiEspP v3 CodiEspX v3               No word overlap.
  CodiEspD v4 CodiEspP v4 CodiEspX v4 No word overlap. Denied entities removed.


 – No position overlap. Denied entities removed: This follows the same approach as
   above. We remove overlapped entities, but we add an extra step by using the NegEx-
   MES tool to find instances of negated entities. This includes instances where a text
   states that a patient does not present a disease.
 – No word overlap: It is similar to No position overlap method but here, on top of the
   position-level overlap, we also consider the word-level overlap. We keep the code of
   the longest entity when word overlap exists, regardless of the position of the entity
   in the text. Following the previous example, if in a text the entity “hipertensión oc-
   ular” exists while in other part of the same text the entity “hipertensión” also exists,
   then we associate the code of “hipertensión ocular” to the entity “hipertensión”.
 – No word overlap. Denied entities removed: This result is obtained by applying the
   NegEx-MES tool to the No word overlap process.


3.2   Experiments and Results

The performance of each experiment in the building of the NER model is shown in the
Table 2.


 Table 2: Results of the experiments performed to build our Named Entity Recognition model.

      Experiment                                                   Learned
                                Train Data           Test Data               F1-Score
       Identifier                                                   Entity
        Baseline      Data-1                         Val Data-1 diagnostico   56.82
       Abstracts      Data-1 + Abstracts             Val Data-1 diagnostico   24.05
       MIMIC-III      Data-1 + MIMIC-III             Val Data-1 diagnostico   57.10
         Fujitsu
                      Data-1 + Augmentation          Val Data-1 diagnostico   58.03
     Augmentation
 Fujitsu Augmentation
                      Data-1 + Augmentation extended Val Data-1 diagnostico   58.40
        fine-tune
        Final T1      Data-2 + Augmentation extended Val Data-2 diagnostico   72.45
        Final T2      Data-2 + Augmentation extended Val Data-2 procedimiento 76.70
 – Baseline: Firstly, we obtain a baseline performance by fine tuning BERT for the
   entity “diagnostico”. This baseline is trained with the first version of the dataset
   provided by the organizers (Data-1) and it is tested with the validation set (Val Data-
   1). We achieve an F1-Score of 56.82 with this experiment. Based on this result, we
   are going to proceed to adjusting training parameters.
 – Abstracts: Besides training and validation sets, organizers provided extra text anno-
   tations from literature abstracts. So, in this experiment, we trained an NER with the
   initial training set and the data from the abstracts. The performance decreased, so
   we discarded the abstract data for the following training iterations. We believe the
   data imbalance and noise introduced to the training by the abstracts are responsible
   of this low performance.
 – MIMIC-III: Similar to the previous experiment, we trained an NER with the initial
   training set plus data from MIMIC-III dataset, taking care not to outnumber the data
   points of CLEF eHealth data. This experiment resulted in a better performance over
   the baseline. Therefore, we thought that data augmentation would be a useful tool
   to achieve higher performance.
 – Fujitsu Augmentation: We used the Text Generation model trained with Fujitsu’s
   proprietary Decentralized Learning technology. This Text Generation model was
   trained with MIMIC-III and PubMed data. This model receives a seed text as input
   and it generates text similar to the medical domain text. In our experiments, the
   seed text is the word or the set of words that conform to an entity. By using our
   Text Generation model, we duplicate the number of samples in the initial train set.
   And by the randomness of the generated text, we add robustness to the final NER
   model. In this experiment we achieve an F1-Score of 58.03.
 – Fujitsu Augmentation fine-tune: In this experiment we fine-tune the Text Genera-
   tion model with data from CLEF eHealth challenge. This means the Text Gener-
   ation model learns from MIMIC-III, PubMed and CLEF eHealth data. Similar to
   the previous experiment, we duplicate the training samples and we obtain an NER
   with a performance of 58.40. This is the final methodology used to train the NER
   model in our system.

    Using the method of experiment “Fujitsu Augmentation fine-tune” we trained two
NER models with the latest released datasets (Data-2) from the task 3 for type “diag-
nostico” (Final T1) and for type “procedimiento” (Final T2). With this method and the
updated dataset we go from 56.82 to 72.45 F1-Score for the entity “diagnostico”. And
these NER models are the ones used alongside the linking algorithm with our Knowl-
edge Graph to obtain our results over the test set.
    After post-processing the data with the 4 methods described in Section 3.1, we
submitted the results for evaluation. Table 3 shows the performance obtained after the
evaluation carried out by the organizers of the CLEF eHealth challenge.
    From the evaluation tables we see that for task 1, V 1 achieves the highest Mean
Average Precision (MAP) and F1-Score. V 3 and V 4 achieved the highest MAP and
F1-Score for task 2 of the challenge and V 1 was the highest performance approach for
task 3. We associate the lower results in task 2 to issues with bigger specificity regarding
body parts and concrete parameters in CIE10-PCS. There are cases where our approach
Table 3: Performance results provided by organizers of the CLEF eHealth challenge for the four
versions of our results over the test set.
                        MAP         MAP30                      P     R    F1     P     R    F1
          File      MAP       MAP30          P     R    F1
                        codes        codes                   codes codes codes cat cat cat
      CodiEspD v1 0.519 0.598 0.519 0.597 0.732 0.633 0.679 0.767 0.699 0.731 0.802 0.734 0.766
      CodiEspD v2 0.481 0.553 0.48   0.553 0.733 0.588 0.652 0.768 0.646 0.702 0.804 0.687 0.741
      CodiEspD v3 0.501 0.576 0.501 0.576 0.74 0.604 0.665 0.775 0.665 0.716 0.807 0.714 0.758
      CodiEspD v4 0.46 0.528 0.46    0.528 0.739 0.556 0.635 0.774 0.61 0.682 0.809 0.662 0.728

      CodiEspP v1 0.434 0.515   0.433   0.514   0.587 0.448 0.508 0.627 0.539 0.58 0.665 0.468 0.549
      CodiEspP v2 0.433 0.513   0.432   0.512   0.587 0.446 0.507 0.626 0.537 0.578 0.665 0.465 0.548
      CodiEspP v3 0.443 0.525   0.443   0.525   0.643 0.428 0.514 0.692 0.514 0.59 0.687 0.462 0.552
      CodiEspP v4 0.440 0.52    0.440    0.52   0.642 0.424 0.511 0.692 0.51 0.587 0.687 0.458 0.55

      CodiEspX v1    -    -       -       -     0.669 0.562 0.611 0.704 0.634 0.667   -     -     -
      CodiEspX v2    -    -       -       -     0.667 0.527 0.589 0.702 0.592 0.642   -     -     -
      CodiEspX v3    -    -       -       -     0.687 0.537 0.603 0.725 0.604 0.659   -     -     -
      CodiEspX v4    -    -       -       -     0.685 0.505 0.581 0.722 0.566 0.635   -     -     -


is missing the specifications of the procedure. In those cases, we are retrieving wrong
code annotations, what decreases the precision and recall.
    We can highlight the versions of post-processing that did not remove negated enti-
ties as the highest achieving approaches. We conclude, after analysis over the ground-
truth, that negated entities are contemplated as part of the expected results. In the case of
task 1, out of the 2841 negated entities that we removed from our results, 2059 appear in
the ground-truth. And in case of task 2, the ground-truth contains 61 appearances from
the 58 negated entities removed from our results. This is the reason why versions of our
results that do not remove negated entities (V 1 and V 3) outperform the versions that
remove negated entities (V 2 and V 4) across all the tasks. And due to the small number
of negated entities we found in task 2, the performance differences between versions is
negligible.


4   Conclusions

In this paper we present the methods used in the CLEF eHealth 2020 Challenge for
Automated Clinical Encoding. This challenge consisted of 3 tasks for CIE-10 code
assignment of texts written in Spanish for diagnoses (task 1) and procedures (task 2),
and identifying the positioning of those entities in the texts (task 3).
    We followed an approach composed of a Knowledge Graph created with the CIE-10
standard and the training data annotations. Then we fine-tuned a multilingual BERT-
based network for Named Entity Recognition to predict entities in the clinical texts.
We used data augmentation through Fujitsu proprietary Text Generation Model, trained
with Decentralized Learning from the datasets MIMIC-III and PubMed, to create syn-
thetic samples for the NER training. With the output of our NER models and the Knowl-
edge Graph, we developed a linking algorithm to assign a CIE-10 code to predicted
entities in the texts. Finally, we post-processed the output of our linking algorithm to
remove negated entities using the NegEx-MES tool. We also analyzed the output to
provide 4 different versions of our results taking into account the overlapping words
and positions of predicted entities.
    Our approach achieves F1-Scores of 0.67, 0.51 and 0.61 for tasks 1, 2 and 3 respec-
tively. The versions of our results that achieve higher performance are the ones that do
not remove negated entities, named CodiEspD v1, CodiEspP v3 and CodiEspX v1.


References
 1. Amin, S., Neumann, G., Dunfield, K., Vechkaeva, A., Chapman, K.A., Wixted, M.K.: MLT-
    DFKI at CLEF eHealth 2019: Multi-label Classification of ICD-10 Codes with BERT. In:
    Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano,
    Switzerland, September 9-12, 2019. CEUR Workshop Proceedings, vol. 2380. CEUR-
    WS.org (2019), http://ceur-ws.org/Vol-2380/paper 67.pdf
 2. Atutxa, A., Casillas, A., Ezeiza, N., Fresno, V., Goenaga, I., Gojenola, K., Martı́nez, R., An-
    chordoqui, M.O., Perez-de-Viñaspre, O.: IxaMed at CLEF eHealth 2018 Task 1: ICD10 Cod-
    ing with a Sequence-to-Sequence Approach. In: Working Notes of CLEF 2018 - Conference
    and Labs of the Evaluation Forum, Avignon, France, September 10-14, 2018. CEUR Work-
    shop Proceedings, vol. 2125. CEUR-WS.org (2018), http://ceur-ws.org/Vol-2125/paper 167.
    pdf
 3. Baumel, T., Nassour-Kassis, J., Cohen, R., Elhadad, M., Elhadad, N.: Multi-Label Classi-
    fication of Patient Notes: Case Study on ICD Code Assignment. In: The Workshops of the
    The Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana,
    USA, February 2-7, 2018. AAAI Workshops, vol. WS-18, pp. 409–416. AAAI Press (2018),
    https://aaai.org/ocs/index.php/WS/AAAIW18/paper/view/16881
 4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional
    transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
 5. Garcı́a-Santa, N., San-Miguel, B., Ugai, T.: The Magic of Semantic Enrichment and NLP
    for Medical Coding. In: The Semantic Web: ESWC 2019 Satellite Events - ESWC 2019
    Satellite Events, Portorož, Slovenia, June 2-6, 2019, Revised Selected Papers. Lecture Notes
    in Computer Science, vol. 11762, pp. 58–63. Springer (2019). https://doi.org/10.1007/978-
    3-030-32327-1 12
 6. Goeuriot, L., Suominen, H., Kelly, L., Miranda-Escalada, A., Krallinger, M., Liu, Z., Pasi,
    G., Saez Gonzales, G., Viviani, M., Xu, C.: Overview of the CLEF eHealth Evaluation Lab
    2020. In: Arampatzis, A., Kanoulas, E., Tsikrika, T., Vrochidis, S., Joho, H., Lioma, C., Eick-
    hoff, C., Névéol, A., Cappellato, L., Ferro, N. (eds.) Experimental IR Meets Multilinguality,
    Multimodality, and Interaction: Proceedings of the Eleventh International Conference of the
    CLEF Association (CLEF 2020) . LNCS Volume number: 12260 (2020)
 7. ICD, W.: 10: International statistical classification of diseases and related health problems.
    World Health Organization, Geneva (1992)
 8. Johnson, A.E., Pollard, T.J., Shen, L., Li-wei, H.L., Feng, M., Ghassemi, M., Moody, B.,
    Szolovits, P., Celi, L.A., Mark, R.G.: MIMIC-III, a freely accessible critical care database.
    Scientific data 3, 160035 (2016). https://doi.org/10.13026/C2XW26
 9. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: BioBERT: a pre-trained
    biomedical language representation model for biomedical text mining. Bioinform. 36(4),
    1234–1240 (2020). https://doi.org/10.1093/bioinformatics/btz682
10. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. In:
    Soviet physics doklady. vol. 10, pp. 707–710 (1966)
11. Lindberg, D.: Internet access to the National Library of Medicine. Effective clinical practice:
    ECP 3(5), 256 (2000)
12. Miftahutdinov, Z., Tutubalina, E.: KFU at CLEF ehealth 2017 task 1: ICD-10 coding
    of english death certificates with recurrent neural networks. In: Working Notes of CLEF
    2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14,
    2017. CEUR Workshop Proceedings, vol. 1866. CEUR-WS.org (2017), http://ceur-ws.org/
    Vol-1866/paper 64.pdf
13. Miranda, A., Rana, A., Krallinger, M.: Abstracts from Lilacs and Ibecs with ICD10
    codes (Jan 2020). https://doi.org/10.5281/zenodo.3606626, https://doi.org/10.5281/zenodo.
    3606626, Funded by the Plan de Impulso de las Tecnologı́as del Lenguaje (Plan TL).
14. Miranda-Escalada, A., Gonzalez-Agirre, A., Armengol-Estapé, J., Krallinger, M.: Overview
    of automatic clinical coding: annotations, guidelines, and solutions for non-English clinical
    cases at CodiEsp track of CLEF eHealth 2020. In: Working Notes of Conference and Labs
    of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings (2020)
15. Miranda-Escalada, A., Krallinger, M.: CodiEsp codes: list of valid CIE10 codes for the
    CodiEsp task (Jan 2020), https://doi.org/10.5281/zenodo.3706838, Funded by the Plan de
    Impulso de las Tecnologı́as del Lenguaje (Plan TL).
16. Mullenbach, J., Wiegreffe, S., Duke, J., Sun, J., Eisenstein, J.: Explainable Prediction of
    Medical Codes from Clinical Text. In: Proceedings of the 2018 Conference of the North
    American Chapter of the Association for Computational Linguistics: Human Language
    Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Vol-
    ume 1 (Long Papers). pp. 1101–1111. Association for Computational Linguistics (2018).
    https://doi.org/10.18653/v1/n18-1100
17. Névéol, A., Robert, A., Anderson, R., Cohen, K.B., Grouin, C., Lavergne, T., Rey, G.,
    Rondet, C., Zweigenbaum, P.: CLEF eHealth 2017 Multilingual Information Extraction
    task Overview: ICD10 Coding of Death Certificates in English and French. In: Working
    Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland,
    September 11-14, 2017. CEUR Workshop Proceedings, vol. 1866. CEUR-WS.org (2017),
    http://ceur-ws.org/Vol-1866/invited paper 6.pdf
18. Névéol, A., Robert, A., Grippo, F., Morgand, C., Orsi, C., Pelikan, L., Ramadier, L.,
    Rey, G., Zweigenbaum, P.: CLEF eHealth 2018 Multilingual Information Extraction Task
    Overview: ICD10 Coding of Death Certificates in French, Hungarian and Italian. In: Work-
    ing Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France,
    September 10-14, 2018. CEUR Workshop Proceedings, vol. 2125. CEUR-WS.org (2018),
    http://ceur-ws.org/Vol-2125/invited paper 18.pdf
19. Neves, M.L., Butzke, D., Dörendahl, A., Leich, N., Hummel, B., Schönfelder, G., Grune,
    B.: Overview of the CLEF eHealth 2019 Multilingual Information Extraction. In: Working
    Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland,
    September 9-12, 2019. CEUR Workshop Proceedings, vol. 2380. CEUR-WS.org (2019),
    http://ceur-ws.org/Vol-2380/paper 251.pdf
20. Pakhomov, S.V., Buntrock, J.D., Chute, C.G.: Automating the assignment of diagnosis codes
    to patient encounters using example-based and machine learning techniques. Journal of the
    American Medical Informatics Association 13(5), 516–525 (2006)
21. Sänger, M., Weber, L., Kittner, M., Leser, U.: Classifying German Animal Experiment
    Summaries with Multi-lingual BERT at CLEF eHealth 2019 Task 1. In: Working Notes
    of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland,
    September 9-12, 2019. CEUR Workshop Proceedings, vol. 2380. CEUR-WS.org (2019),
    http://ceur-ws.org/Vol-2380/paper 81.pdf
22. Wei, C.H., Allot, A., Leaman, R., Lu, Z.: PubTator central: automated concept annotation
    for biomedical full text articles. Nucleic acids research 47(W1), W587–W593 (2019)