=Paper=
{{Paper
|id=Vol-2664/cantemist_paper11
|storemode=property
|title=LasigeBioTM at CANTEMIST: Named Entity	Recognition and Normalization of Tumour
			Morphology Entities and Clinical Coding of Spanish Health-related Documents
|pdfUrl=https://ceur-ws.org/Vol-2664/cantemist_paper11.pdf
|volume=Vol-2664
|authors=Pedro Ruas, Andre Neves, Vitor D.T. Andrade and Francisco M. Couto,Mario Ezra Aragón
|dblpUrl=https://dblp.org/rec/conf/sepln/RuasNACA20
}}
==LasigeBioTM at CANTEMIST: Named Entity	Recognition and Normalization of Tumour
			Morphology Entities and Clinical Coding of Spanish Health-related Documents==
<pdf width="1500px">https://ceur-ws.org/Vol-2664/cantemist_paper11.pdf</pdf>
<pre>
LasigeBioTM at CANTEMIST: Named Entity
Recognition and Normalization of Tumour
Morphology Entities and Clinical Coding of Spanish
Health-related Documents
Pedro Ruasa , Andre Nevesa , Vitor D.T. Andradea and Francisco M. Coutoa
a
    LASIGE, Faculdade de Ciências da Universidade de Lisboa, Portugal


                                         Abstract
                                         The CANTEMIST track included three subtasks for the automatic assignment of codes related with
                                         tumour morphology entities to Spanish health-related documents: CANTEMIST-NER, CANTEMIST-
                                         NORM and CANTEMIST-CODING. For CANTEMIST-NER, we trained Spanish biomedical Flair embed-
                                         dings on PubMed abstracts and then trained a BiLSTM+CRF Named Entity Recognition tagger on the
                                         CANTEMIST corpus using the trained embeddings. For CANTEMIST-NORM, we adapted a graph-based
                                         model that uses the Personalized PageRank algorithm to rank the eCIE-O-3.1 candidates for each entity
                                         mention. As for CANTEMIST-CODING, we adapted X-Transformer, a state-of-the-art deep learning
                                         Extreme Multi-Label Classification algorithm, to classify the clinical cases with a ranked list of eCIE-
                                         O-3.1 terms in a multilingual and biomedical panorama. The results obtained were a F1-score of 0.749
                                         and 0.069 for the CANTEMIST-NER and the CANTEMIST-NORM subtasks, respectively, and our best
                                         scoring submission achieved a MAP score of 0.506 in the CANTEMIST-CODING subtask.

                                         Keywords
                                         CANTEMIST, Named Entity Recognition, Normalization, Coding, Text Mining, Natural Language Pro-
                                         cessing, Clinical Text, Extreme Multi-Label Classification


1. Introduction
There are several benefits arising from the application of Natural Language Processing (NLP)/Text
Mining approaches to clinical text, like for example, the improvement of the decision-making
process in clinical context. The use of electronic health records is associated with less doctor-
patient interaction [1], so tools that are able to automatically extract relevant information from
clinical notes can free up the doctors to contact directly with patients. Besides, these tools have
the potential to improve biomedical [2, 3] and pharmaceutical research [4] and to democratise
the access to clinical information for the layman user [5].
   In the present work, we describe the participation of the LasigeBioTM team in CANTEMIST
(“CANcer TExt Mining Shared Task – tumor named entity recognition") competition [6], which


Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020)
email: psruas@fc.ul.pt (P. Ruas); aneves@lasige.di.fc.ul.pt (A. Neves); fc49005@alunos.fc.ul.pt (V.D.T. Andrade);
fcouto@di.fc.ul.pt (F.M. Couto)
orcid: 0000-0002-1293-4199 (P. Ruas); 0000-0003-0627-1496 (F.M. Couto)
                                       © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
included a corpus of Spanish health-related documents and three different subtasks with the
following goals:

    • CANTEMIST-NER: Automatically recognise and locate tumour morphology mentions.

    • CANTEMIST-NORM: Returning and normalising all tumour morphology mentions along
      with their respective codes from the eCIE-O-3.1 (“Clasificación Internacional de Enfer-
      medades para Oncología - 3ª edición, 1ª revisión"1 ) terminology.

    • CANTEMIST-CODING: Classification of clinical cases by returning a list of ranked eCIE-
      O-3.1 codes for each document.

   For the CANTEMIST-NER substask we used the Flair framework [7] to train new Flair
embeddings over Spanish translated PubMed abstracts and to train a NER tagger with BiL-
STM+CRF architecture on the CANTEMIST Corpus leveraging the trained embeddings. For the
CANTEMIST-NORM substask, we used the PPR-SSM model [8] to normalise the entities recog-
nised by the NER tagger. This model builds a disambiguation graph for each document, where
the nodes are the retrieved candidate codes from the eCIE-O-3.1 terminology for the present
entities and the relations are based on the hierarchy of eCIE-O-3.1 terminology. A variation
of this model additionally retrieves candidates from other terminologies, like CIE-10-ES and
DeCS, and extracts relations between concepts from these terminologies and the codes from the
eCIE-O-3.1 terminology to improve the edge structure in the graph. The Personalised PageRank
algorithm (PPR) assign weights to each candidate and the highest scored one is the selected
code for the respective entity. As for the CANTEMIST-CODING subtask, we adapted and built
a pipeline using X-Transformer, a deep learning Extreme Multi-Label Classification (XMLC)
algorithm, to the multilingual biomedical panorama, so that it could successfully process and
classify each clinical case with the eCIE-O-3.1 terms more related with each document.

1.1. Related work
In the Named Entity Recognition (NER) task, state-of-the-art approaches usually have a BiLSTM-
CRF architecture, which was initially proposed by Huang et al. [9]. LSTM (Long-Short Term
Memory) networks are Recurrent Neural Networks (RNN), which means that these networks
have a recurrent layer connecting different features at different time frames. In fact, BiLSTM
networks can leverage the past features and the future features for a given time frame. An
input layer represents a given set of features, in this case text tokens, at a given time and
the output layer represents the probability distribution for each label at that time. The CRF
(Conditional Random Fields) models, in turn, focus on sentence level tag information. More
recently, pre-trained language models, like BERT [10] or ELMo [11], have been fine-tuned to the
NER task. BERT has a multilayer bidirectional transformer encoder architecture and, contrarily
to RNNs, employs an attention mechanism to establish the dependency between the input
features and the output. The original BERT implementation has been trained in general corpora,
such as the BookCorpus and the English Wikipedia, but since then a plethora of domain-specific
versions have been proposed, including BioBERT [12] and ClinicalBERT [13].
   1
       https://eciemaps.mscbs.gob.es/ecieMaps/browser/index_o_3.html


                                                     423
   The state-of-the-art approaches in Entity Normalization (also called Disambiguation or
Linking) include graph-based models [14, 15], neural networks-based models [16, 17] and,
similarly to what happens for the NER task, more recently the fine-tuned pre-trained language
models [18, 19]. The graph-based models usually focus on building a graph containing the
candidates for the entity mentions and then on ranking the candidates according to the relevance
or coherence of each candidate in the graph. These are global models, since the disambiguation
decision of a given entity mention is dependant of the other disambiguations in the same graph,
but there is also sometimes a module responsible for the determination of the local similarity
between candidates and mentions or the candidate retrieval. The neural network-based models
usually take into account both the global coherence of the candidates and the local similarity,
however, the candidates and the entities are typically represented by word embeddings, which
are then integrated in the neural network. On the other hand, BERT-based models like the
one proposed by Ji et al. [18] focus on the generation of contextualised word embeddings for
candidates and entities and then on candidate ranking, which is considered a sequence-pair
classification task. In this case, the model performs disambiguation of each entity independently
based on word representations and other local features.
   As to XMLC, several machine learning solutions have been developed in the last decade [20],
but only more recently there have been deep learning solutions applied to XMLC. One of the
first attempts to was the XML-CNN [21], a convolutional neural network that was adapted from
a state-of-the-art approach to a multi-class classification task [22], with some changes on the
neural network layers that allowed it to capture features more precisely from different regions
of text. There was also HAXMLNet [23] which used a BiLSTM RNN with a multi-label attention
layer to capture the most relevant parts of the text, along with a hierarchical clustering algorithm
to divide labels through clusters, which proved efficient on larger datasets. Lastly, there is
X-Transformer [24] the first deep learning approach to scale pre-trained Transformer models,
such as BERT [10], RoBERTa[25] or XLNet[26] to XMLC. The algorithm uses a three-stage
framework that firstly, semantically indexes all the possible labels in clusters using ELMo [11].
Then, using a deep learning Transformer model, it indexes each text instance to the most relevant
cluster and, finally, ranks the labels retrieved from the previous cluster indices. X-Transformer
surpassed other state-of-the-art methods in XMLC in four benchmark datasets and it was also
applied to a query recommendation dataset from Amazon, where it showed improvements
of more than 10% over Parabel [27], one of the most commonly used and competitive XMLC
algorithms.
   Our team has already made an adaptation of X-Transformer for the biomedical panorama
in the BioASQ MESINESP competition that occurred earlier this year. In MESINESP, the goal
was indexing a large dataset of biomedical articles written in Spanish using DeCS terms. In the
final scoreboard, our approach using X-Transformer has achieved high scores in the precision
measures, surpassing most competing systems in those measures.


                                                424
2. Methodology
2.1. CANTEMIST-NER
The goal of this task was to recognise and locate tumour morphology entities in Spanish health-
related documents. We used the Flair framework [7] to develop a Spanish biomedical NER
tagger.

2.1.1. Training of Flair embeddings
We trained new Flair contextualised embeddings [28] in Spanish biomedical text, more concretely,
in translated abstracts of PubMed articles available at https://temu.bsc.es/mesinesp/index.php/
download/translated-pubmed-articles/. We considered 4 subsets of articles, each one with
80%/10%/10% of the articles included in the train, validation and test files, respectively:
   1. 32,500 articles, 40,987,614 tokens
   2. 32,500 articles, 35,352,727 tokens
   3. 32,500 articles, 39,021,229 tokens
   4. 32,500 articles, 40,005,075 tokens
  The 4 splits contained a total of 130,000 articles and 155,366,645 tokens, of which 143,387,385
corresponded to training tokens.
  We generated a language model for each split with hidden_size = 1024, nlayers = 1, dropout
= 0.1, using the following training parameters:
    • sequence_lenght = 250
    • mini_batch_size = 100
    • max_epochs = 2000
    • patience = 25
   We trained forward and backward embeddings in each split using a single NVIDIA Tesla P4
GPU, since Flair does not have multi-GPU support in the current version, and interrupted the
training after a variable number of epochs that ranged from 71 in the backward embeddings
training in split 1 and 99 in the backward embeddings training in split 2.

2.1.2. Pre-processing of the CANTEMIST corpus
We converted the train, development 1 (dev1) and development 2 (dev2) sets of the CANTEMIST
corpus2 to the IOB2 format3 using the Flair sentence segmenter jointly with the Flair tokenizer.
Each token was tagged with the label “B-MOR_NEO" if corresponded to the beginning of an
annotation, the label “I-MOR_NEO" if corresponded to the inside of an annotation, and the
label “O" if it was outside of any annotation. The content present in the train, dev1 and dev2
sets originated, respectively, the files “train.txt", “dev.txt" and “test.txt". The corpus was then
loaded into a Flair “ColumnCorpus" object to allow the further training of the NER tagger.
   2
       https://temu.bsc.es/cantemist/?p=4338
   3
       https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)


                                                    425
2.1.3. Training of the Spanish biomedical NER tagger
The following models were considered:
   1. “base": uses Flair embeddings (es-forward and es-backward) trained on Spanish Wikipedia
      + Spanish FastText embeddings.
   2. “medium": uses Flair embeddings (es-forward and es-backward) trained on Spanish
      Wikipedia + Spanish FastText embeddings + PubMed Flair embeddings trained in 1
      PubMed split.
   3. “large": uses Flair embeddings (es-forward and es-backward) trained on Spanish Wikipedia
      + Spanish FastText embeddings + PubMed Flair embeddings trained in 2 PubMed splits.
   4. “pubmed": uses PubMed Flair embeddings trained in 4 PubMed splits.
   We considered the default architecture for the sequence tagger: BiLSTM with a CRF decoding
layer, hidden_size = 256. The training parameters were set to:

    • learning_rate = 0.1
    • mini_batch_size = 32
    • max_epochs = 55
    • patience = 3

Due to the lack of time, we only selected the “medium" model for training, as we considered
it was the safest approach: to leverage trained biomedical embeddings jointly with general
available embeddings. After the training, we applied the model to predict the labels in the 5232
documents belonging to the background + test set of the CANTEMIST corpus and to create an
annotation file in the BRAT format for each text document.

2.2. CANTEMIST-NORM
The goal of this task was to perform NER and the normalization or disambiguation of the
recognised entities to the eCIE-O-3.1 terminology. We applied the model previously developed
for the NER task to recognise the entities and adapted the PPR-SSM model [8] to assign the
entities a eCIE-O-3.1 code.

2.2.1. Pre-processing of the NER output
In each document, the first step was to apply the NER tagger to generate the NER output files
and then to retrieve the ten best eCIE-O-3.1 candidates for each recognised entity through
string matching, more concretely, according to the edit distance. The model then built a
disambiguation graph with the eCIE-O-3.1 candidates for all present entities in the document.
Two candidates/nodes were considered linked in the graph if they were linked in the eCIE-O-3.1
hierarchy. For each candidate was calculated the extrinsic information content (IC), which is a
measure of rareness: the IC of a given entity is high if that entity has few entries in an external
dataset [29]. In this case, we considered the external dataset the train, dev1 and dev2 sets of the
CANTEMIST corpus.


                                               426
2.2.2. Entity disambiguation
The model applied the Personalized PageRank (PPR) algorithm over each disambiguation graph.
PPR is a variation of PageRank [30], which was originally proposed as an algorithm to rank the
relative importance of web pages. It considers the web a graph where each node is a page and
has links to other pages (forward links) and links from other pages (backlinks). The PageRank
algorithm simulates the behaviour of a "random surfer" in the web: from a given page the surfer
can either follow one of the forward links in that page or jump to a random page belonging to
the graph. In the personalised variation, this jump is not random and instead always occur to a
chosen page. After successive iterations, the algorithm returns the probability distribution of
reaching each node in the graph. Nodes containing more links will be reached more times, so
they will have more relevance in the context of the graph. PPR have also been applied in the
normalization of entities, but in this case the web graph is replaced by the disambiguation graph
containing the candidates for all entities in a given document. PPR traverses the graph and then
assigns weights to each candidate according to its coherence or relevance to the graph: more
connected nodes will have higher weight. Additionally, in our model, the IC of each candidate
was also considered in the candidate ranking. The model selected the highest scored candidate
for each entity and added the eCIE-O-3.1 codes to the annotation files outputted by the NER
tagger.

2.2.3. Multiple terminologies
We also explored the use of more than one terminology in the candidate retrieval and graph
building. We considered the CIE-10-ES (“Classificación Internacional de Enfermedades 10.ª
Revisión, Modificación Clínica"4 ) and the Spanish DeCS (“Descriptores en Ciencias de la Salud")5
terminologies. For each entity, besides the ten best eCIE-O-3.1 candidates, we also retrieved the
five best CIE-10-ES and the five best DeCS candidates through string matching and built the
disambiguation graph accordingly. We considered that a eCIE-O-3.1 candidate and a CIE-10-ES
or a DeCS candidate were linked if they were present in the same sentence of any document
belonging to the CANTEMIST corpus. We applied the python implementation6 of MER [31]
for the fast recognition of the entities in the sentences. With these additional candidates, the
disambiguation graph was denser and contained more semantic information, which we expected
that would improve the precision of the disambiguation process. After the application of the
PPR algorithm, the model only selected eCIE-O-3.1 candidates to normalise the mentions.

2.2.4. Models
The following models were considered:

   1. “single-ont": this model only used candidates retrieved from the eCIE-O-3.1 terminology.
   2. “multi-ont": besides eCIE-O-3.1, this model additionally retrieved candidates from CIE-
      10-ES and DeCS terminologies to improve the disambiguation graph.
   4
     https://eciemaps.mscbs.gob.es/ecieMaps/browser/index_10_mc.html
   5
     http://decs.bvs.br/E/homepagee.htm
   6
     https://pypi.org/project/merpy/


                                                  427
2.3. CANTEMIST-CODING
The goal of this task was the classification of clinical cases in Spanish by returning a list of
ranked eCIE-O codes related to the content of each clinical case. To tackle this challenge, we
decided to use X-Transformer, a state-of-the-art deep learning XMLC solution and apply it to
multilingual biomedical panorama.

2.3.1. X-Transformer modifications
Some modifications in the algorithm code were required. The first one was made in the
vectorization of the labels of the training and test sets. We have chosen to use all possible labels,
including the labels that were not present in the train or test sets. This change was needed since
the algorithm would fail to work correctly if the number of labels between sets did not match.
  Another modification was the inclusion of BETO [32]7 in the choices of models to train
X-Transformer, since we considered that using a Transformer model specifically designed for
the Spanish language could lead to improved results over the Multilingual version of BERT.
Finally, we have also adapted the algorithm so that it could process input data containing
diacritical marks, such as accents, that are common in the Spanish language.

2.3.2. Pipeline
A pipeline was developed for this task as it can be seen in Figure 1. After retrieving the data
from the competition organisers, the first step of this pipeline was merging each separate text
file that composed the training and development datasets into a single file for each dataset, so
that in the end we could have only two files, one for train and another for test. Then, using
the ’.ann’ files given for the other two tasks of the CANTEMIST competition and that were
associated with each clinical case, we extracted all labels that were attributed to each document
and appended them to the beginning of the corresponding clinical case separated by a tab
character (’\t’). This way, X-Transformer could distinguish between the labels and the text. The
text was then stemmed using a Snowball stemmer8 .
   The next step was creating a vocabulary file containing all labels used in the datasets, which
was required as input by X-Transformer. Each line of this file had an eCIE-O code and its
corresponding internal identifier, which corresponds to a number from 0 to N, where N is the
total number of eCIE-O codes minus 1. This internal identifier is a label standardization method
that allows X-Transformer to classify the text using labels from any kind or domain. For the
creation of this file, we used a file containing a list of valid eCIE-O codes that was provided
by the competition in their evaluation scripts folder, from which we retrieved all codes and
corresponding descriptions. Then, we included the eCIE-O codes descriptions that were present
in the ’.ann’ files and that were not present in the list of codes retrieved, adding them to existing
descriptions if they had differences. Then, the codes without any description were removed,
since they would be of no use to classify the clinical cases using X-Transformer. In the end, our
vocabulary file consisted of a total of 4360 eCIE-O codes.

    7
        https://github.com/dccuchile/beto
    8
        https://www.nltk.org/_modules/nltk/stem/snowball.html


                                                      428
Figure 1: Developed pipeline for CANTEMIST-CODING task. It processes the CANTEMIST data before
running it through X-Transformer. In the end, the predictions are converted to the format required by
the competition.


   In addition, we have also created a label mapping file that contained the correspondence
between the eCIE-O code and its numeric identifier in the vocabulary file. For example, the term
‘Células tumorales benignas’, which has the corresponding eCIE-O code ‘8001/0’, is the eleventh
element in the vocabulary file, thus it’s numeric identifier will be ‘10’. This label mapping file
will later be used to map the predictions from their numeric identifiers to the corresponding
eCIE-O codes required for the task.
   The results of X-Transformer are given in the form of sparse matrices, with a number
of rows equal to the number of clinical cases that compose the test set, and the number of
columns corresponding to the possible labels. The prediction for each clinical case was retrieved,
comprising a top K of most relevant labels and their confidence values ranging from -0.99
to 1. We used K=20 labels per clinical case. Then, using a Python script, we converted the
predicted labels from their numeric identifiers to their corresponding eCIE-O codes using the
label mapping file previously created. The script also discarded each label with a confidence
score under a threshold chosen to achieve the highest Precision, Recall or F1 scores. For each
of these measures, a ’.tsv’ file was created with the predictions for each clinical case in the
format required by the competition. A fourth ’.tsv’ file was also created for the score threshold
equal to 0, which was used as a baseline score. In the end, the files were used as input for the
evaluation script given by the CANTEMIST competition so that we could retrieve the Mean
Average Precision (MAP) scores for the predictions of our models.
   In the test and background sets given by competition organizers, there were no eCIE-O codes


                                                429
indexing the clinical cases, so we had to put a placeholder label on each document, because
X-Transformer was not prepared to run on unlabelled data. We also had to artificially adapt the
size of the given test and background sets by splitting the sets into a total of 48 smaller sets
of 250 clinical cases each. This procedure was necessary so that the files could have the same
size as the test sets used on the trained X-Transformer models, which had a total of 250 articles.
The first 109 lines of each of those files was composed by 109 clinical cases from the test and
background sets to classify, and the remaining 141 came from the dev1 set, which was already
classified with eCIE-O codes. These 141 clinical cases were used as an additional validation set
to define the confidence threshold values of our submissions.

2.3.3. Developed Models
In a first iteration, we trained 4 models using the 501 indexed clinical cases that composed
the train set, and the dev2 set that comprised a total of 250 indexed clinical cases, was used as
test set. One of our models was trained using BERT Base Multilingual Cased and another was
trained using BETO, the Spanish version of BERT.
   The other two models were trained with two X-Transformer models that were previously
developed by us for the Spanish biomedical domain using biomedical articles in Spanish retrieved
from the IBECS, LILACS and PubMed databases, along with a list of keywords identified for
each article using MER[31], a NER software. The major difference between the two models was
that one of them was developed using 318,658 articles, while the other one used 50% more data,
with a total of 637,316 articles. We shall call this two models Spanish Biomedical X-Transformer
and Spanish Biomedical X-Transformer large, correspondingly. Summarizing, the 4 models had
the following characteristics:

    • Model 1: BERT base Multilingual Cased finetuned with the clinical records.

    • Model 2: BETO finetuned with the clinical records.

    • Model 3: Spanish Biomedical X-Transformer finetuned with the clinical records.

    • Model 4: Spanish Biomedical X-Transformer large finetuned with the clinical records.

  In a second iteration, we trained 4 additional models following the same characteristics of
the previous ones, but using a larger train set composed by the 501 clinical cases that composed
the CANTEMIST-CODING train set, plus 249 additional clinical cases from the dev1 set that
were already classified. This way, we expected to achieved better results, since the models
were trained with additional clinical cases. Summarizing, these four models had the following
characteristics:

    • Model 5: BERT base Multilingual Cased finetuned with 750 clinical records.

    • Model 6: BETO finetuned with 750 clinical records.

    • Model 7: Spanish Biomedical X-Transformer finetuned with 750 clinical records.

    • Model 8: Spanish Biomedical X-Transformer large finetuned with 750 clinical records.


                                               430
   All models were trained using the default parameters of X-Transformer, except for the eval
and train batch sizes which were both changed from their original values of 64 and 32 to 4 due
to hardware constrains. We have also set the number of gradient accumulation steps to 2 to
compensate for the small batch size. Each model was trained for 12 epochs, on a single NVIDIA
Tesla P4 GPU.

2.3.4. Preliminary Results
In order to choose which model predictions to submit, we decided to evaluate the predictions
made by each model on the dev2 set using the evaluation script given by the competition
organizers. As was explained before, each model had 4 ’.tsv’ files as output, with each file
containing the predictions with a confidence score superior to the confidence score threshold
defined to best precision, recall or F1-score, and the baseline score, which corresponds to the
confidence score threshold set to 0, which was the middle of the X-Transformer confidence
score scale. Then, each file was used as input for the evaluation script given by the competition
organizers.
   The results for each model can be seen in Table 1. As it can be seen, the highest MAP scores
are achieved when the predictions are focused on achieving the highest recall scores especially
when using the models trained with the Spanish Biomedical X-Transformer models. We can
notice that the models that use BETO seem to achieve higher scores when compared with the
ones that used BERT Multilingual. In addition, we notice that there is not a clear difference
between the usage of more clinical cases to train the models, with some models achieving
slightly higher scores in MAP if the evaluation was focused on precision or in the F1-score,
while in other models the score was inferior when compared with the models that used lesser
articles.
   Taking this into consideration, we decided to choose the predictions focused on recall of 5
distinct models to submit for the CANTEMIST-CODING task so we could also compare their
performance in the competition. The chosen models were Models 2, 3, 4, 5 and 7. We then
retrieved the predictions made by the models for each of the 48 text files that contained the
test and background sets data. Then, for each of the resulting prediction files, the first 109 lines
corresponding to the predictions made for the test and background sets were stored in the ’.tsv’
files, while the other 141 lines which corresponded to the predictions of the labelled cases from
the first development set, were used to find the confidence score threshold that achieved the
best recall score, and that would be used to choose which predictions would be stored in the
final ’.tsv’ file.


3. Results and discussion
The results for the CANTEMIS-NER subtask are available in Table 2.
   Our model obtained a F1-score of 0.749 in this subtask. Some errors prevented a higher
performance, such as those related with the span of some detected entities. For example, in the
document “cc_onco94.ann", the mention “linfoma" is correctly recognised by the NER tagger,
but the processing script attributed the span “1690 1697", whereas the correct span would be
“1691 1698". Besides, in some cases the NER tagger only recognised incomplete entity mentions.


                                                431
                             Model                        Focus      MAP     Threshold
                                                         Baseline    0.222        0
                          Model 1                           F1       0.366     -0,66
                BERT Base Multilingual Cased             Precision    0.18      0,22
                                                          Recall     0.384     -0,81
                                                         Baseline    0.267        0
                            Model 2                         F1       0.385     -0,48
                             BETO                        Precision   0.188      0,48
                                                          Recall     0.438     -0,83
                                                         Baseline    0.293        0
                         Model 3                            F1       0.378     -0,34
             Spanish Biomedical X-Transformer            Precision   0.191      0,52
                                                          Recall     0.446     -0,82
                                                         Baseline    0.281        0
                        Model 4                             F1       0.396     -0,45
          Spanish Biomedical X-Transformer large         Precision    0.18       0,6
                                                          Recall     0.448     -0,82
                                                         Baseline    0.203        0
                          Model 5
                                                            F1       0.372     -0,46
                BERT Base Multilingual Cased
                                                         Precision   0.186      0,12
                         (750 CC)
                                                          Recall     0.407     -0,83
                                                         Baseline    0.224        0
                            Model 6
                                                            F1       0.371     -0,46
                              BETO
                                                         Precision    0.18      0,39
                            (750 CC)
                                                          Recall     0.417     -0,82
                                                         Baseline     0.25        0
                         Model 7
                                                            F1       0.392      -0,4
             Spanish Biomedical X-Transformer
                                                         Precision     0.2      0,32
                         (750 CC)
                                                          Recall     0.442     -0,82
                                                         Baseline    0.268        0
                        Model 8
                                                            F1       0.379     -0,43
          Spanish Biomedical X-Transformer large
                                                         Precision   0.194      0,29
                        (750 CC)
                                                          Recall     0.427     -0,81
Table 1
Preliminary results of the trained models for the CODING task using the second development set.
Bold values correspond to the three highest values achieved in the Mean Average Precision (MAP) mea-
sure using the evaluation script given by the competition organizers.
Green lines correspond to the models and corresponding evaluation focus whose predictions were cho-
sen to submit for the CANTEMIST-CODING task.


For example, in the document “cc_onco89.ann", the NER tagger recognised the entity “implantes
mediastínicos", whereas the full entity mention would be “implantes mediastínicos pleurales".
   For future work, it would be interesting to train the other models beside the “medium" (“base",
“large" and “pubmed") in the CANTEMIST corpus and to apply them to the test set to verify if
they obtain a higher performance. Besides, we only trained the Flair embeddings during less
than 100 epochs, so with more epochs, probably the performance of the NER tagger would be
higher. The NER tagger itself could also be trained for more epochs, up until 150, according to a


                                                432
                                  Model        P        R      F1
                                  medium     0.787    0.714   0.749
Table 2
Results for the CANTEMIST-NER subtask. P, R and F1 refer, respectively, to Precision, Recall and F1-
score.

                 Model         P       R       F1      P-No-M     R-No-M   F1-No-M
              1.single-ont   0.063   0.057   0.060      0.059      0.082    0.069
              2.multi-ont    0.064   0.058   0.061      0.059      0.080     0.068
Table 3
Results for the CANTEMIST-NORM subtask. P, R and F1 refer, respectively, to Precision, Recall and
F1-score, and P-No-M, R-No-M and F1-No-M refer, respectively, to the Precision, Recall and F1-score
calculated without considering the mentions to metastasis (8000/6 code). Bold values correspond to
the highest achieved scores in our submissions


suggestion by the authors of Flair. We will also address the errors associated with the span of
the recognised entities.
   The results for the CANTEMIS-NORM subtask are available in Table 3.
   The model “multi-ont" obtained a F1-score of 0.061, which represents a slight improvement
of +0.001 comparing to the “single-ont" model. Still, the performance of the model was too
low, which is mainly related with the candidate retrieval step. Since we used string matching
for candidate retrieval and selected the top candidates according to the edit distance between
candidate/entity mention, most of the candidates lists did not contain the correct codes for
the respective entity mentions. For example, the correct code for the entity mention “cáncer"
would be the eCIE-O-3.1 code 8000/6, corresponding to the concept “Neoplasia metastásica".
However, neither the “single-ont" model nor the “multi-ont" model were able to retrieve this
candidate code for the entity mention, because the edit distance between “cancer" and “Neoplasia
metastásica" is too high. Besides, the lack of a synonyms list in the eCIE-O-3.1 terminology
further exacerbates the problem. Another aspect that is worth mentioning is the fact that the
performance of the normalization model is always dependant on the performance of the NER
tagger, so an incorrect output returned by the latter will hinder the results outputted by the
former.
   In order to improve the normalization model we will explore alternative methods for candidate
generation, like for example, the use of word embeddings, both for entity mentions and for the
terminology concepts. Instead of just considering that two entities in the same sentence are
related, we will also use a proper relation extraction tool to get more semantically meaningful
relations between concepts, either belonging to the same terminology, or belonging to different
terminologies (for example, a relation between a eCIE-O-3 and a DeCS concept), which will
densify the disambiguation graph and improve the disambiguation precision.
   The results for the CANTEMIS-CODING subtask are available in Table 4.
   Looking at our results, we can observe that, contrarily to what we expected, the best scoring
model was Model 5 which, in our preliminary evaluation, had achieved the lowest MAP score
of the five models submitted for this task. The lowest MAP scores were achieved by Models
3, 4 and 7, which were trained using the Spanish Biomedical X-Transformer models and that


                                                433
     Model      MAP        P       R        F1     MAP-No-M       P-No-M     R-No-M      F1-No-M
     Model 2    0.463    0.157   0.549    0.244      0.350         0.119      0.466        0.189
     Model 3    0.449    0.159   0.517    0.243      0.333         0.118      0.427        0.184
     Model 4    0.455    0.151   0.532    0.235      0.344         0.113      0.445        0.180
     Model 5    0.506    0.211   0.601    0.312      0.399         0.167      0.527       0.254
     Model 7    0.459    0.197   0.541    0.289      0.346         0.151      0.456        0.226
Table 4
Results for the CANTEMIST-CODING subtask. MAP, P, R and F1 refer, respectively, to Mean Average
Precision, Precision, Recall and F1-score, and MAP-No-M, P-No M, R-No-M and F1-No-M refer, respec-
tively, to the Mean Average Precision, Precision, Recall and F1-score calculated without considering the
mentions to metastasis (8000/6 code).
Bold values correspond to the highest scores for each metric in our submissions


had achieved the highest MAP scores in the preliminary evaluation. We can also notice that
the precision and F1 scores were lower than the recall scores, but that was expected, since
our submissions contained the predictions that used a confidence score threshold focused on
achieving the highest recall scores.
   As a proposal for a future work, we could try to develop additional models with a Trans-
former architecture, like the Biomedical X-Transformer models, and use them to train other
X-Transformer models expecting to improve the number of correctly identified eCIE-O codes.
These models could be trained using scientific biomedical articles in Spanish or even the clinical
cases given by the CANTEMIST competition.
   Another possible solution could pass by preprocessing the clinical case files before running
them through X-Transformer by reducing the amount of text from each clinical case. This could
be achieved by using automatic text summarization tools to leave only the essential information
about each clinical case. Then, using NER tools, we could retrieve key entities and/or related
terms and include them in the summarized text. This way, by reducing the original clinical case
to a smaller and more objective text, along with identified key terms and entities given by the
NER tools, it is expected that the X-Transformer model will be able to achieve better results
since it has a smaller and more concise string of words, than a larger amount of text with the
key topics more diluted.


4. Conclusion
We obtained a F1-score of 0.749 and 0.069 for the CANTEMIST-NER and the CANTEMIST-NORM
subtasks, respectively, and a MAP of 0.506 for the CANTEMIST-CODING subtask. The code
to run the developed models is available in our GitHub page: https://github.com/lasigeBioTM/
CANTEMIST-Participation.
   To improve the NER tagger, we intend to resume the training of the Flair embeddings up
until 2000 epochs and to generate a larger language model, with 2048 hidden layers instead
of 1024, and to train the tagger over more text, by including the development 2 set in the
training process. For the normalization model, we intend to explore the use of word embeddings
in the candidate generation process and the use of a relation extraction tool to build better
disambiguation graphs. As to improvements in the classification model, we intend to explore


                                                  434
additional X-Transformer models trained with new models using biomedical data in Spanish
and through the combination with other NLP techniques, such as automatic text summarization
and NER, in order to further improve the results achieved by the algorithm.


Acknowledgements
This project was supported by FCT through funding of the DeST: Deep SemanticTagger project,
ref. PTDC/CCI-BIO/28685/2017, and the LASIGE ResearchUnit, ref. UIDB/00408/2020


References
 [1] O. Asan, P. Smith, E. Montague, More Screen Time, Less Face time – Implications for EHR
     Design, Journal of Evaluation in Clinical Practice 20 (2014) 896–901. doi:10.1111/jep.
     12182.
 [2] P. Sfakianaki, L. Koumakis, S. Sfakianakis, G. Iatraki, G. Zacharioudakis, N. Graf, K. Marias,
     M. Tsiknakis, Semantic biomedical resource discovery : a Natural Language Processing
     framework, BMC Medical Informatics and Decision Making 15 (2015). URL: http://dx.doi.
     org/10.1186/s12911-015-0200-4. doi:10.1186/s12911-015-0200-4.
 [3] P. Ernst, A. Siu, D. Milchevski, J. Hoffart, G. Weikum, DeepLife: An Entity-aware
     Search, Analytics and Exploration Platform for Health and Life Sciences, in: Proceed-
     ings ofthe 54th Annual Meeting ofthe Association for Computational Linguistics—System
     Demonstrations, Association for Computational Linguistics, Berlin, Germany, August
     7-12, 2016, 2016, pp. 19–24. URL: http://www.aclweb.org/anthology/P16-4004{%}0Ahttp:
     //aclweb.org/anthology/P16-4004. doi:10.1111/j.1348-0421.2010.00272.x.
 [4] M. Vazquez, M. Krallinger, F. Leitner, A. Valencia, Text mining for drugs and chemical
     compounds: Methods, tools and applications, Molecular Informatics 30 (2011) 506–519.
     doi:10.1002/minf.201100005.
 [5] J. He, M. de Rijke, M. Sevenster, R. van Ommering, Y. Qian, Generating Links to Background
     Knowledge: A Case Study Using Narrative Radiology Reports, in: CIKM’11, ACM, October
     24–28, 2011, Glasgow, Scotland, UK., 2011, p. 1867. doi:10.1145/2063576.2063845.
 [6] A. Miranda-Escalada, E. Farré, M. Krallinger, Named entity recognition, concept normal-
     ization and clinical coding: Overview of the cantemist track for cancer text mining in
     spanish, corpus, guidelines, methods and results, in: Proceedings of the Iberian Languages
     Evaluation Forum (IberLEF 2020), CEUR Workshop Proceedings, 2020.
 [7] A. Akbik, T. Bergmann, D. Blythe, K. Rasul, S. Schweter, R. Vollgraf, FLAIR: An easy-to-use
     framework for state-of-the-art NLP, in: NAACL HLT 2019 - 2019 Conference of the North
     American Chapter of the Association for Computational Linguistics: Human Language
     Technologies - Proceedings of the Demonstrations Session, 2019, pp. 54–59.
 [8] A. Lamurias, P. Ruas, F. M. Couto, PPR-SSM: Personalized PageRank and semantic sim-
     ilarity measures for entity linking, BMC Bioinformatics 20 (2019) 1–12. doi:10.1186/
     s12859-019-3157-y.
 [9] Z. Huang, W. Xu, K. Yu, Bidirectional LSTM-CRF Models for Sequence Tagging (2015).
     URL: http://arxiv.org/abs/1508.01991. arXiv:1508.01991.


                                               435
[10] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional
     transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv.
     org/abs/1810.04805. arXiv:1810.04805.
[11] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep
     contextualized word representations, in: Proceedings of the 2018 Conference of the North
     American Chapter of the Association for Computational Linguistics: Human Language
     Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New Or-
     leans, Louisiana, 2018, pp. 2227–2237. URL: https://www.aclweb.org/anthology/N18-1202.
     doi:10.18653/v1/N18-1202.
[12] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, BioBERT: a pre-trained biomedical
     language representation model for biomedical text mining, Bioinformatics (2019) 1–7.
     doi:10.1093/bioinformatics/btz682. arXiv:1901.08746.
[13] E. Alsentzer, J. Murphy, W. Boag, W.-H. Weng, D. Jindi, T. Naumann, M. McDermott,
     Publicly Available Clinical, in: Proceedings ofthe 2nd Clinical Natural Language Processing
     Workshop, Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp.
     72–78. doi:10.18653/v1/w19-1909.
[14] M. Pershina, Y. He, R. Grishman, Personalized Page Rank for Named Entity Disambiguation,
     in: Human Language Technologies: The 2015 Annual Conference ofthe North American
     Chapter ofthe ACL, Section 4, Association for Computational Linguistics, Denver, Colorado,
     May 31 – June 5, 2015, 2015, pp. 238–243.
[15] Z. Guo, D. Barbosa, Robust named entity disambiguation with random walks, Semantic
     Web 9 (2018) 459–479. doi:10.3233/SW-170273.
[16] Y. Cao, L. Hou, J. Li, Z. Liu, Neural Collective Entity Linking (2018). URL: http://arxiv.org/
     abs/1811.08603. arXiv:1811.08603.
[17] O.-E. Ganea, T. Hofmann, Deep Joint Entity Disambiguation with Local Neural At-
     tention, in: Proceedings of the 2017 Conference on Empirical Methods in Natural
     Language Processing, Association for Computational Linguistics, Copenhagen, Den-
     mark, September 7–11, 2017, 2017, pp. 2619–2629. URL: http://arxiv.org/abs/1704.04920.
     doi:10.18653/v1/d17-1277. arXiv:1704.04920.
[18] Z. Ji, Q. Wei, H. Xu, BERT-based Ranking for Biomedical Entity Normalization (2019). URL:
     http://arxiv.org/abs/1908.03548. arXiv:1908.03548.
[19] X. Yin, Y. Huang, B. Zhou, A. Li, L. Lan, Y. Jia, Deep Entity Linking via Eliminating
     Semantic Ambiguity With BERT, IEEE Access 7 (2019) 169434–169445. doi:10.1109/
     ACCESS.2019.2955498.
[20] K. Bhatia, H. Jain, P. Kar, M. Varma, P. Jain, Sparse local embeddings for extreme multi-label
     classification, in: Advances in Neural Information Processing Systems, 2015.
[21] J. Liu, W. C. Chang, Y. Wu, Y. Yang, Deep learning for extreme multi-label text classification,
     in: SIGIR 2017 - Proceedings of the 40th International ACM SIGIR Conference on Research
     and Development in Information Retrieval, 2017. doi:10.1145/3077136.3080834.
[22] Y. Kim, Convolutional neural networks for sentence classification, in: EMNLP 2014 - 2014
     Conference on Empirical Methods in Natural Language Processing, Proceedings of the
     Conference, 2014. doi:10.3115/v1/d14-1181. arXiv:1408.5882.
[23] R. You, Z. Zhang, S. Dai, S. Zhu, Haxmlnet: Hierarchical attention network for extreme
     multi-label text classification, CoRR abs/1904.12578 (2019). URL: http://arxiv.org/abs/1904.


                                               436
     12578. arXiv:1904.12578.
[24] W.-C. Chang, H.-F. Yu, K. Zhong, Y. Yang, I. Dhillon, Taming Pretrained Transformers
     for Extreme Multi-label Text Classification, 2020. URL: http://arxiv.org/abs/1905.02331v4.
     arXiv:1905.02331.
[25] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy-
     anov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692
     (2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692.
[26] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, Q. V. Le, Xlnet: Generalized
     autoregressive pretraining for language understanding, CoRR abs/1906.08237 (2019). URL:
     http://arxiv.org/abs/1906.08237. arXiv:1906.08237.
[27] Y. Prabhu, A. Kag, S. Harsola, R. Agrawal, M. Varma, Parabel: Partitioned label trees for
     extreme classification with application to dynamic search advertising, in: 2018 World
     Wide Web Conference, International World Wide Web Conferences Steering Committee,
     International World Wide Web Conferences Steering Committee, 2018, pp. 993–1002.
[28] A. Akbik, D. Blythe, R. Vollgraf, Contextual String Embeddings for Sequence Labeling, in:
     Proceedings of the 27th International Conference on Computational Linguistics, 2018, pp.
     1638–1649. URL: https://github.com/zalandoresearch/flair.
[29] F. M. Couto, A. Lamurias, Semantic Similarity Definition, Reference Module in Life Sci-
     ences (2018) 0–16. URL: http://linkinghub.elsevier.com/retrieve/pii/B9780128096338204019.
     doi:10.1016/B978-0-12-809633-8.20401-9.
[30] L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank Citation Ranking: Bringing
     Order to the Web, Technical Report, Stanford InfoLab, 1998. URL: http://ilpubs.stanford.edu:
     8090/422/.
[31] F. M. Couto, A. Lamurias, MER: a shell script and annotation server for minimal
     named entity recognition and linking, Journal of Cheminformatics 10 (2018) 58. URL:
     https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0312-9. doi:10.1186/
     s13321-018-0312-9.
[32] J. Cañete, G. Chaperon, R. Fuentes, J. Pérez, Spanish pre-trained bert model and evaluation
     data, in: to appear in PML4DC at ICLR 2020, 2020.


                                               437

</pre>