<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LasigeBioTM at CANTEMIST: Named Entity Recognition and Normalization of Tumour Morphology Entities and Clinical Coding of Spanish Health-related Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pedro Ruas</string-name>
          <email>psruas@fc.ul.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andre Neves</string-name>
          <email>aneves@lasige.di.fc.ul.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vitor D.T. Andrade</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francisco M. Couto</string-name>
          <email>fcouto@di.fc.ul.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LASIGE, Faculdade de Ciências da Universidade de Lisboa</institution>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <fpage>422</fpage>
      <lpage>437</lpage>
      <abstract>
        <p>The CANTEMIST track included three subtasks for the automatic assignment of codes related with tumour morphology entities to Spanish health-related documents: CANTEMIST-NER, CANTEMISTNORM and CANTEMIST-CODING. For CANTEMIST-NER, we trained Spanish biomedical Flair embeddings on PubMed abstracts and then trained a BiLSTM+CRF Named Entity Recognition tagger on the CANTEMIST corpus using the trained embeddings. For CANTEMIST-NORM, we adapted a graph-based model that uses the Personalized PageRank algorithm to rank the eCIE-O-3.1 candidates for each entity mention. As for CANTEMIST-CODING, we adapted X-Transformer, a state-of-the-art deep learning Extreme Multi-Label Classification algorithm, to classify the clinical cases with a ranked list of eCIEO-3.1 terms in a multilingual and biomedical panorama. The results obtained were a F1-score of 0.749 and 0.069 for the CANTEMIST-NER and the CANTEMIST-NORM subtasks, respectively, and our best scoring submission achieved a MAP score of 0.506 in the CANTEMIST-CODING subtask.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;CANTEMIST</kwd>
        <kwd>Named Entity Recognition</kwd>
        <kwd>Normalization</kwd>
        <kwd>Coding</kwd>
        <kwd>Text Mining</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Clinical Text</kwd>
        <kwd>Extreme Multi-Label Classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        There are several benefits arising from the application of Natural Language Processing (NLP)/Text
Mining approaches to clinical text, like for example, the improvement of the decision-making
process in clinical context. The use of electronic health records is associated with less
doctorpatient interaction [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], so tools that are able to automatically extract relevant information from
clinical notes can free up the doctors to contact directly with patients. Besides, these tools have
the potential to improve biomedical [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] and pharmaceutical research [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and to democratise
the access to clinical information for the layman user [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        In the present work, we describe the participation of the LasigeBioTM team in CANTEMIST
(“CANcer TExt Mining Shared Task – tumor named entity recognition") competition [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which
included a corpus of Spanish health-related documents and three diferent subtasks with the
following goals:
• CANTEMIST-NER: Automatically recognise and locate tumour morphology mentions.
• CANTEMIST-NORM: Returning and normalising all tumour morphology mentions along
with their respective codes from the eCIE-O-3.1 (“Clasificación Internacional de
Enfermedades para Oncología - 3ª edición, 1ª revisión"1) terminology.
• CANTEMIST-CODING: Classification of clinical cases by returning a list of ranked
eCIE
      </p>
      <p>O-3.1 codes for each document.</p>
      <p>
        For the CANTEMIST-NER substask we used the Flair framework [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] to train new Flair
embeddings over Spanish translated PubMed abstracts and to train a NER tagger with
BiLSTM+CRF architecture on the CANTEMIST Corpus leveraging the trained embeddings. For the
CANTEMIST-NORM substask, we used the PPR-SSM model [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] to normalise the entities
recognised by the NER tagger. This model builds a disambiguation graph for each document, where
the nodes are the retrieved candidate codes from the eCIE-O-3.1 terminology for the present
entities and the relations are based on the hierarchy of eCIE-O-3.1 terminology. A variation
of this model additionally retrieves candidates from other terminologies, like CIE-10-ES and
DeCS, and extracts relations between concepts from these terminologies and the codes from the
eCIE-O-3.1 terminology to improve the edge structure in the graph. The Personalised PageRank
algorithm (PPR) assign weights to each candidate and the highest scored one is the selected
code for the respective entity. As for the CANTEMIST-CODING subtask, we adapted and built
a pipeline using X-Transformer, a deep learning Extreme Multi-Label Classification (XMLC)
algorithm, to the multilingual biomedical panorama, so that it could successfully process and
classify each clinical case with the eCIE-O-3.1 terms more related with each document.
      </p>
      <sec id="sec-1-1">
        <title>1.1. Related work</title>
        <p>
          In the Named Entity Recognition (NER) task, state-of-the-art approaches usually have a
BiLSTMCRF architecture, which was initially proposed by Huang et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. LSTM (Long-Short Term
Memory) networks are Recurrent Neural Networks (RNN), which means that these networks
have a recurrent layer connecting diferent features at diferent time frames. In fact, BiLSTM
networks can leverage the past features and the future features for a given time frame. An
input layer represents a given set of features, in this case text tokens, at a given time and
the output layer represents the probability distribution for each label at that time. The CRF
(Conditional Random Fields) models, in turn, focus on sentence level tag information. More
recently, pre-trained language models, like BERT [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] or ELMo [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], have been fine-tuned to the
NER task. BERT has a multilayer bidirectional transformer encoder architecture and, contrarily
to RNNs, employs an attention mechanism to establish the dependency between the input
features and the output. The original BERT implementation has been trained in general corpora,
such as the BookCorpus and the English Wikipedia, but since then a plethora of domain-specific
versions have been proposed, including BioBERT [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and ClinicalBERT [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
1https://eciemaps.mscbs.gob.es/ecieMaps/browser/index_o_3.html
        </p>
        <p>
          The state-of-the-art approaches in Entity Normalization (also called Disambiguation or
Linking) include graph-based models [
          <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
          ], neural networks-based models [
          <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
          ] and,
similarly to what happens for the NER task, more recently the fine-tuned pre-trained language
models [
          <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
          ]. The graph-based models usually focus on building a graph containing the
candidates for the entity mentions and then on ranking the candidates according to the relevance
or coherence of each candidate in the graph. These are global models, since the disambiguation
decision of a given entity mention is dependant of the other disambiguations in the same graph,
but there is also sometimes a module responsible for the determination of the local similarity
between candidates and mentions or the candidate retrieval. The neural network-based models
usually take into account both the global coherence of the candidates and the local similarity,
however, the candidates and the entities are typically represented by word embeddings, which
are then integrated in the neural network. On the other hand, BERT-based models like the
one proposed by Ji et al. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] focus on the generation of contextualised word embeddings for
candidates and entities and then on candidate ranking, which is considered a sequence-pair
classification task. In this case, the model performs disambiguation of each entity independently
based on word representations and other local features.
        </p>
        <p>
          As to XMLC, several machine learning solutions have been developed in the last decade [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ],
but only more recently there have been deep learning solutions applied to XMLC. One of the
ifrst attempts to was the XML-CNN [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], a convolutional neural network that was adapted from
a state-of-the-art approach to a multi-class classification task [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], with some changes on the
neural network layers that allowed it to capture features more precisely from diferent regions
of text. There was also HAXMLNet [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] which used a BiLSTM RNN with a multi-label attention
layer to capture the most relevant parts of the text, along with a hierarchical clustering algorithm
to divide labels through clusters, which proved eficient on larger datasets. Lastly, there is
X-Transformer [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] the first deep learning approach to scale pre-trained Transformer models,
such as BERT [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], RoBERTa[
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] or XLNet[
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] to XMLC. The algorithm uses a three-stage
framework that firstly, semantically indexes all the possible labels in clusters using ELMo [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
Then, using a deep learning Transformer model, it indexes each text instance to the most relevant
cluster and, finally, ranks the labels retrieved from the previous cluster indices. X-Transformer
surpassed other state-of-the-art methods in XMLC in four benchmark datasets and it was also
applied to a query recommendation dataset from Amazon, where it showed improvements
of more than 10% over Parabel [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ], one of the most commonly used and competitive XMLC
algorithms.
        </p>
        <p>Our team has already made an adaptation of X-Transformer for the biomedical panorama
in the BioASQ MESINESP competition that occurred earlier this year. In MESINESP, the goal
was indexing a large dataset of biomedical articles written in Spanish using DeCS terms. In the
ifnal scoreboard, our approach using X-Transformer has achieved high scores in the precision
measures, surpassing most competing systems in those measures.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>
        The goal of this task was to recognise and locate tumour morphology entities in Spanish
healthrelated documents. We used the Flair framework [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] to develop a Spanish biomedical NER
tagger.
      </p>
      <sec id="sec-2-1">
        <title>2.1.1. Training of Flair embeddings</title>
        <p>
          We trained new Flair contextualised embeddings [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] in Spanish biomedical text, more concretely,
in translated abstracts of PubMed articles available at https://temu.bsc.es/mesinesp/index.php/
download/translated-pubmed-articles/. We considered 4 subsets of articles, each one with
80%/10%/10% of the articles included in the train, validation and test files, respectively:
1. 32,500 articles, 40,987,614 tokens
2. 32,500 articles, 35,352,727 tokens
3. 32,500 articles, 39,021,229 tokens
4. 32,500 articles, 40,005,075 tokens
• sequence_lenght = 250
• mini_batch_size = 100
• max_epochs = 2000
• patience = 25
        </p>
        <p>The 4 splits contained a total of 130,000 articles and 155,366,645 tokens, of which 143,387,385
corresponded to training tokens.</p>
        <p>We generated a language model for each split with hidden_size = 1024, nlayers = 1, dropout
= 0.1, using the following training parameters:</p>
        <p>We trained forward and backward embeddings in each split using a single NVIDIA Tesla P4
GPU, since Flair does not have multi-GPU support in the current version, and interrupted the
training after a variable number of epochs that ranged from 71 in the backward embeddings
training in split 1 and 99 in the backward embeddings training in split 2.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.1.2. Pre-processing of the CANTEMIST corpus</title>
        <p>We converted the train, development 1 (dev1) and development 2 (dev2) sets of the CANTEMIST
corpus2 to the IOB2 format3 using the Flair sentence segmenter jointly with the Flair tokenizer.
Each token was tagged with the label “B-MOR_NEO" if corresponded to the beginning of an
annotation, the label “I-MOR_NEO" if corresponded to the inside of an annotation, and the
label “O" if it was outside of any annotation. The content present in the train, dev1 and dev2
sets originated, respectively, the files “train.txt", “dev.txt" and “test.txt". The corpus was then
loaded into a Flair “ColumnCorpus" object to allow the further training of the NER tagger.
2https://temu.bsc.es/cantemist/?p=4338
3https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.1.3. Training of the Spanish biomedical NER tagger</title>
        <p>1. “base": uses Flair embeddings (es-forward and es-backward) trained on Spanish Wikipedia
+ Spanish FastText embeddings.
2. “medium": uses Flair embeddings (es-forward and es-backward) trained on Spanish
Wikipedia + Spanish FastText embeddings + PubMed Flair embeddings trained in 1
PubMed split.
3. “large": uses Flair embeddings (es-forward and es-backward) trained on Spanish Wikipedia
+ Spanish FastText embeddings + PubMed Flair embeddings trained in 2 PubMed splits.
4. “pubmed": uses PubMed Flair embeddings trained in 4 PubMed splits.</p>
        <p>
          We considered the default architecture for the sequence tagger: BiLSTM with a CRF decoding
layer, hidden_size = 256. The training parameters were set to:
• learning_rate = 0.1
• mini_batch_size = 32
• max_epochs = 55
• patience = 3
Due to the lack of time, we only selected the “medium" model for training, as we considered
it was the safest approach: to leverage trained biomedical embeddings jointly with general
available embeddings. After the training, we applied the model to predict the labels in the 5232
documents belonging to the background + test set of the CANTEMIST corpus and to create an
annotation file in the BRAT format for each text document.
The goal of this task was to perform NER and the normalization or disambiguation of the
recognised entities to the eCIE-O-3.1 terminology. We applied the model previously developed
for the NER task to recognise the entities and adapted the PPR-SSM model [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] to assign the
entities a eCIE-O-3.1 code.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.2.1. Pre-processing of the NER output</title>
        <p>
          In each document, the first step was to apply the NER tagger to generate the NER output files
and then to retrieve the ten best eCIE-O-3.1 candidates for each recognised entity through
string matching, more concretely, according to the edit distance. The model then built a
disambiguation graph with the eCIE-O-3.1 candidates for all present entities in the document.
Two candidates/nodes were considered linked in the graph if they were linked in the eCIE-O-3.1
hierarchy. For each candidate was calculated the extrinsic information content (IC), which is a
measure of rareness: the IC of a given entity is high if that entity has few entries in an external
dataset [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]. In this case, we considered the external dataset the train, dev1 and dev2 sets of the
CANTEMIST corpus.
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>2.2.2. Entity disambiguation</title>
        <p>
          The model applied the Personalized PageRank (PPR) algorithm over each disambiguation graph.
PPR is a variation of PageRank [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ], which was originally proposed as an algorithm to rank the
relative importance of web pages. It considers the web a graph where each node is a page and
has links to other pages (forward links) and links from other pages (backlinks). The PageRank
algorithm simulates the behaviour of a "random surfer" in the web: from a given page the surfer
can either follow one of the forward links in that page or jump to a random page belonging to
the graph. In the personalised variation, this jump is not random and instead always occur to a
chosen page. After successive iterations, the algorithm returns the probability distribution of
reaching each node in the graph. Nodes containing more links will be reached more times, so
they will have more relevance in the context of the graph. PPR have also been applied in the
normalization of entities, but in this case the web graph is replaced by the disambiguation graph
containing the candidates for all entities in a given document. PPR traverses the graph and then
assigns weights to each candidate according to its coherence or relevance to the graph: more
connected nodes will have higher weight. Additionally, in our model, the IC of each candidate
was also considered in the candidate ranking. The model selected the highest scored candidate
for each entity and added the eCIE-O-3.1 codes to the annotation files outputted by the NER
tagger.
        </p>
      </sec>
      <sec id="sec-2-6">
        <title>2.2.3. Multiple terminologies</title>
        <p>
          We also explored the use of more than one terminology in the candidate retrieval and graph
building. We considered the CIE-10-ES (“Classificación Internacional de Enfermedades 10. ª
Revisión, Modificación Clínica" 4) and the Spanish DeCS (“Descriptores en Ciencias de la Salud")5
terminologies. For each entity, besides the ten best eCIE-O-3.1 candidates, we also retrieved the
ifve best CIE-10-ES and the five best DeCS candidates through string matching and built the
disambiguation graph accordingly. We considered that a eCIE-O-3.1 candidate and a CIE-10-ES
or a DeCS candidate were linked if they were present in the same sentence of any document
belonging to the CANTEMIST corpus. We applied the python implementation6 of MER [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ]
for the fast recognition of the entities in the sentences. With these additional candidates, the
disambiguation graph was denser and contained more semantic information, which we expected
that would improve the precision of the disambiguation process. After the application of the
PPR algorithm, the model only selected eCIE-O-3.1 candidates to normalise the mentions.
2.2.4. Models
The following models were considered:
1. “single-ont": this model only used candidates retrieved from the eCIE-O-3.1 terminology.
2. “multi-ont": besides eCIE-O-3.1, this model additionally retrieved candidates from
CIE10-ES and DeCS terminologies to improve the disambiguation graph.
4https://eciemaps.mscbs.gob.es/ecieMaps/browser/index_10_mc.html
5http://decs.bvs.br/E/homepagee.htm
6https://pypi.org/project/merpy/
        </p>
        <sec id="sec-2-6-1">
          <title>2.3. CANTEMIST-CODING</title>
          <p>The goal of this task was the classification of clinical cases in Spanish by returning a list of
ranked eCIE-O codes related to the content of each clinical case. To tackle this challenge, we
decided to use X-Transformer, a state-of-the-art deep learning XMLC solution and apply it to
multilingual biomedical panorama.</p>
        </sec>
      </sec>
      <sec id="sec-2-7">
        <title>2.3.1. X-Transformer modifications</title>
        <p>Some modifications in the algorithm code were required. The first one was made in the
vectorization of the labels of the training and test sets. We have chosen to use all possible labels,
including the labels that were not present in the train or test sets. This change was needed since
the algorithm would fail to work correctly if the number of labels between sets did not match.</p>
        <p>
          Another modification was the inclusion of BETO [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ]7 in the choices of models to train
X-Transformer, since we considered that using a Transformer model specifically designed for
the Spanish language could lead to improved results over the Multilingual version of BERT.
Finally, we have also adapted the algorithm so that it could process input data containing
diacritical marks, such as accents, that are common in the Spanish language.
        </p>
      </sec>
      <sec id="sec-2-8">
        <title>2.3.2. Pipeline</title>
        <p>A pipeline was developed for this task as it can be seen in Figure 1. After retrieving the data
from the competition organisers, the first step of this pipeline was merging each separate text
ifle that composed the training and development datasets into a single file for each dataset, so
that in the end we could have only two files, one for train and another for test. Then, using
the ’.ann’ files given for the other two tasks of the CANTEMIST competition and that were
associated with each clinical case, we extracted all labels that were attributed to each document
and appended them to the beginning of the corresponding clinical case separated by a tab
character (’\t’). This way, X-Transformer could distinguish between the labels and the text. The
text was then stemmed using a Snowball stemmer8.</p>
        <p>The next step was creating a vocabulary file containing all labels used in the datasets, which
was required as input by X-Transformer. Each line of this file had an eCIE-O code and its
corresponding internal identifier, which corresponds to a number from 0 to N, where N is the
total number of eCIE-O codes minus 1. This internal identifier is a label standardization method
that allows X-Transformer to classify the text using labels from any kind or domain. For the
creation of this file, we used a file containing a list of valid eCIE-O codes that was provided
by the competition in their evaluation scripts folder, from which we retrieved all codes and
corresponding descriptions. Then, we included the eCIE-O codes descriptions that were present
in the ’.ann’ files and that were not present in the list of codes retrieved, adding them to existing
descriptions if they had diferences. Then, the codes without any description were removed,
since they would be of no use to classify the clinical cases using X-Transformer. In the end, our
vocabulary file consisted of a total of 4360 eCIE-O codes.</p>
        <p>7https://github.com/dccuchile/beto
8https://www.nltk.org/_modules/nltk/stem/snowball.html</p>
        <p>In addition, we have also created a label mapping file that contained the correspondence
between the eCIE-O code and its numeric identifier in the vocabulary file. For example, the term
‘Células tumorales benignas’, which has the corresponding eCIE-O code ‘8001/0’, is the eleventh
element in the vocabulary file, thus it’s numeric identifier will be ‘10’. This label mapping file
will later be used to map the predictions from their numeric identifiers to the corresponding
eCIE-O codes required for the task.</p>
        <p>The results of X-Transformer are given in the form of sparse matrices, with a number
of rows equal to the number of clinical cases that compose the test set, and the number of
columns corresponding to the possible labels. The prediction for each clinical case was retrieved,
comprising a top K of most relevant labels and their confidence values ranging from -0.99
to 1. We used K=20 labels per clinical case. Then, using a Python script, we converted the
predicted labels from their numeric identifiers to their corresponding eCIE-O codes using the
label mapping file previously created. The script also discarded each label with a confidence
score under a threshold chosen to achieve the highest Precision, Recall or F1 scores. For each
of these measures, a ’.tsv’ file was created with the predictions for each clinical case in the
format required by the competition. A fourth ’.tsv’ file was also created for the score threshold
equal to 0, which was used as a baseline score. In the end, the files were used as input for the
evaluation script given by the CANTEMIST competition so that we could retrieve the Mean
Average Precision (MAP) scores for the predictions of our models.</p>
        <p>In the test and background sets given by competition organizers, there were no eCIE-O codes
indexing the clinical cases, so we had to put a placeholder label on each document, because
X-Transformer was not prepared to run on unlabelled data. We also had to artificially adapt the
size of the given test and background sets by splitting the sets into a total of 48 smaller sets
of 250 clinical cases each. This procedure was necessary so that the files could have the same
size as the test sets used on the trained X-Transformer models, which had a total of 250 articles.
The first 109 lines of each of those files was composed by 109 clinical cases from the test and
background sets to classify, and the remaining 141 came from the dev1 set, which was already
classified with eCIE-O codes. These 141 clinical cases were used as an additional validation set
to define the confidence threshold values of our submissions.</p>
      </sec>
      <sec id="sec-2-9">
        <title>2.3.3. Developed Models</title>
        <p>In a first iteration, we trained 4 models using the 501 indexed clinical cases that composed
the train set, and the dev2 set that comprised a total of 250 indexed clinical cases, was used as
test set. One of our models was trained using BERT Base Multilingual Cased and another was
trained using BETO, the Spanish version of BERT.</p>
        <p>
          The other two models were trained with two X-Transformer models that were previously
developed by us for the Spanish biomedical domain using biomedical articles in Spanish retrieved
from the IBECS, LILACS and PubMed databases, along with a list of keywords identified for
each article using MER[
          <xref ref-type="bibr" rid="ref31">31</xref>
          ], a NER software. The major diference between the two models was
that one of them was developed using 318,658 articles, while the other one used 50% more data,
with a total of 637,316 articles. We shall call this two models Spanish Biomedical X-Transformer
and Spanish Biomedical X-Transformer large, correspondingly. Summarizing, the 4 models had
the following characteristics:
• Model 1: BERT base Multilingual Cased finetuned with the clinical records.
• Model 2: BETO finetuned with the clinical records.
• Model 3: Spanish Biomedical X-Transformer finetuned with the clinical records.
• Model 4: Spanish Biomedical X-Transformer large finetuned with the clinical records.
        </p>
        <p>In a second iteration, we trained 4 additional models following the same characteristics of
the previous ones, but using a larger train set composed by the 501 clinical cases that composed
the CANTEMIST-CODING train set, plus 249 additional clinical cases from the dev1 set that
were already classified. This way, we expected to achieved better results, since the models
were trained with additional clinical cases. Summarizing, these four models had the following
characteristics:
• Model 5: BERT base Multilingual Cased finetuned with 750 clinical records.
• Model 6: BETO finetuned with 750 clinical records.
• Model 7: Spanish Biomedical X-Transformer finetuned with 750 clinical records.
• Model 8: Spanish Biomedical X-Transformer large finetuned with 750 clinical records.</p>
        <p>All models were trained using the default parameters of X-Transformer, except for the eval
and train batch sizes which were both changed from their original values of 64 and 32 to 4 due
to hardware constrains. We have also set the number of gradient accumulation steps to 2 to
compensate for the small batch size. Each model was trained for 12 epochs, on a single NVIDIA
Tesla P4 GPU.</p>
      </sec>
      <sec id="sec-2-10">
        <title>2.3.4. Preliminary Results</title>
        <p>In order to choose which model predictions to submit, we decided to evaluate the predictions
made by each model on the dev2 set using the evaluation script given by the competition
organizers. As was explained before, each model had 4 ’.tsv’ files as output, with each file
containing the predictions with a confidence score superior to the confidence score threshold
defined to best precision, recall or F1-score, and the baseline score, which corresponds to the
confidence score threshold set to 0, which was the middle of the X-Transformer confidence
score scale. Then, each file was used as input for the evaluation script given by the competition
organizers.</p>
        <p>The results for each model can be seen in Table 1. As it can be seen, the highest MAP scores
are achieved when the predictions are focused on achieving the highest recall scores especially
when using the models trained with the Spanish Biomedical X-Transformer models. We can
notice that the models that use BETO seem to achieve higher scores when compared with the
ones that used BERT Multilingual. In addition, we notice that there is not a clear diference
between the usage of more clinical cases to train the models, with some models achieving
slightly higher scores in MAP if the evaluation was focused on precision or in the F1-score,
while in other models the score was inferior when compared with the models that used lesser
articles.</p>
        <p>Taking this into consideration, we decided to choose the predictions focused on recall of 5
distinct models to submit for the CANTEMIST-CODING task so we could also compare their
performance in the competition. The chosen models were Models 2, 3, 4, 5 and 7. We then
retrieved the predictions made by the models for each of the 48 text files that contained the
test and background sets data. Then, for each of the resulting prediction files, the first 109 lines
corresponding to the predictions made for the test and background sets were stored in the ’.tsv’
ifles, while the other 141 lines which corresponded to the predictions of the labelled cases from
the first development set, were used to find the confidence score threshold that achieved the
best recall score, and that would be used to choose which predictions would be stored in the
ifnal ’.tsv’ file.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results and discussion</title>
      <p>The results for the CANTEMIS-NER subtask are available in Table 2.</p>
      <p>Our model obtained a F1-score of 0.749 in this subtask. Some errors prevented a higher
performance, such as those related with the span of some detected entities. For example, in the
document “cc_onco94.ann", the mention “linfoma" is correctly recognised by the NER tagger,
but the processing script attributed the span “1690 1697", whereas the correct span would be
“1691 1698". Besides, in some cases the NER tagger only recognised incomplete entity mentions.
Model</p>
      <p>Model 1
BERT Base Multilingual Cased</p>
      <p>Model 2
BETO</p>
      <p>Model 3
Spanish Biomedical X-Transformer</p>
      <p>Model 4
Spanish Biomedical X-Transformer large</p>
      <p>Model 5
BERT Base Multilingual Cased
(750 CC)
Model 6</p>
      <p>BETO
(750 CC)</p>
      <p>Model 7
Spanish Biomedical X-Transformer
(750 CC)</p>
      <p>Model 8
Spanish Biomedical X-Transformer large
(750 CC)</p>
      <p>Focus
Baseline</p>
      <p>F1
Precision</p>
      <p>Recall
Baseline</p>
      <p>F1
Precision</p>
      <p>Recall
Baseline</p>
      <p>F1
Precision</p>
      <p>Recall
Baseline</p>
      <p>F1
Precision</p>
      <p>Recall
Baseline</p>
      <p>F1
Precision</p>
      <p>Recall
Baseline</p>
      <p>F1
Precision</p>
      <p>Recall
Baseline</p>
      <p>F1
Precision</p>
      <p>Recall
Baseline</p>
      <p>F1
Precision</p>
      <p>Recall</p>
      <p>For example, in the document “cc_onco89.ann", the NER tagger recognised the entity “implantes
mediastínicos", whereas the full entity mention would be “implantes mediastínicos pleurales".</p>
      <p>For future work, it would be interesting to train the other models beside the “medium" (“base",
“large" and “pubmed") in the CANTEMIST corpus and to apply them to the test set to verify if
they obtain a higher performance. Besides, we only trained the Flair embeddings during less
than 100 epochs, so with more epochs, probably the performance of the NER tagger would be
higher. The NER tagger itself could also be trained for more epochs, up until 150, according to a
Model
medium</p>
      <p>Model
1.single-ont
2.multi-ont
suggestion by the authors of Flair. We will also address the errors associated with the span of
the recognised entities.</p>
      <p>The results for the CANTEMIS-NORM subtask are available in Table 3.</p>
      <p>The model “multi-ont" obtained a F1-score of 0.061, which represents a slight improvement
of +0.001 comparing to the “single-ont" model. Still, the performance of the model was too
low, which is mainly related with the candidate retrieval step. Since we used string matching
for candidate retrieval and selected the top candidates according to the edit distance between
candidate/entity mention, most of the candidates lists did not contain the correct codes for
the respective entity mentions. For example, the correct code for the entity mention “cáncer"
would be the eCIE-O-3.1 code 8000/6, corresponding to the concept “Neoplasia metastásica".
However, neither the “single-ont" model nor the “multi-ont" model were able to retrieve this
candidate code for the entity mention, because the edit distance between “cancer" and “Neoplasia
metastásica" is too high. Besides, the lack of a synonyms list in the eCIE-O-3.1 terminology
further exacerbates the problem. Another aspect that is worth mentioning is the fact that the
performance of the normalization model is always dependant on the performance of the NER
tagger, so an incorrect output returned by the latter will hinder the results outputted by the
former.</p>
      <p>In order to improve the normalization model we will explore alternative methods for candidate
generation, like for example, the use of word embeddings, both for entity mentions and for the
terminology concepts. Instead of just considering that two entities in the same sentence are
related, we will also use a proper relation extraction tool to get more semantically meaningful
relations between concepts, either belonging to the same terminology, or belonging to diferent
terminologies (for example, a relation between a eCIE-O-3 and a DeCS concept), which will
densify the disambiguation graph and improve the disambiguation precision.</p>
      <p>The results for the CANTEMIS-CODING subtask are available in Table 4.</p>
      <p>Looking at our results, we can observe that, contrarily to what we expected, the best scoring
model was Model 5 which, in our preliminary evaluation, had achieved the lowest MAP score
of the five models submitted for this task. The lowest MAP scores were achieved by Models
3, 4 and 7, which were trained using the Spanish Biomedical X-Transformer models and that
Model
Model 2
Model 3
Model 4
Model 5
Model 7
had achieved the highest MAP scores in the preliminary evaluation. We can also notice that
the precision and F1 scores were lower than the recall scores, but that was expected, since
our submissions contained the predictions that used a confidence score threshold focused on
achieving the highest recall scores.</p>
      <p>As a proposal for a future work, we could try to develop additional models with a
Transformer architecture, like the Biomedical X-Transformer models, and use them to train other
X-Transformer models expecting to improve the number of correctly identified eCIE-O codes.
These models could be trained using scientific biomedical articles in Spanish or even the clinical
cases given by the CANTEMIST competition.</p>
      <p>Another possible solution could pass by preprocessing the clinical case files before running
them through X-Transformer by reducing the amount of text from each clinical case. This could
be achieved by using automatic text summarization tools to leave only the essential information
about each clinical case. Then, using NER tools, we could retrieve key entities and/or related
terms and include them in the summarized text. This way, by reducing the original clinical case
to a smaller and more objective text, along with identified key terms and entities given by the
NER tools, it is expected that the X-Transformer model will be able to achieve better results
since it has a smaller and more concise string of words, than a larger amount of text with the
key topics more diluted.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>We obtained a F1-score of 0.749 and 0.069 for the CANTEMIST-NER and the CANTEMIST-NORM
subtasks, respectively, and a MAP of 0.506 for the CANTEMIST-CODING subtask. The code
to run the developed models is available in our GitHub page: https://github.com/lasigeBioTM/
CANTEMIST-Participation.</p>
      <p>To improve the NER tagger, we intend to resume the training of the Flair embeddings up
until 2000 epochs and to generate a larger language model, with 2048 hidden layers instead
of 1024, and to train the tagger over more text, by including the development 2 set in the
training process. For the normalization model, we intend to explore the use of word embeddings
in the candidate generation process and the use of a relation extraction tool to build better
disambiguation graphs. As to improvements in the classification model, we intend to explore
additional X-Transformer models trained with new models using biomedical data in Spanish
and through the combination with other NLP techniques, such as automatic text summarization
and NER, in order to further improve the results achieved by the algorithm.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This project was supported by FCT through funding of the DeST: Deep SemanticTagger project,
ref. PTDC/CCI-BIO/28685/2017, and the LASIGE ResearchUnit, ref. UIDB/00408/2020</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>O.</given-names>
            <surname>Asan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Montague</surname>
          </string-name>
          , More Screen Time,
          <article-title>Less Face time - Implications for EHR Design</article-title>
          ,
          <source>Journal of Evaluation in Clinical Practice</source>
          <volume>20</volume>
          (
          <year>2014</year>
          )
          <fpage>896</fpage>
          -
          <lpage>901</lpage>
          . doi:
          <volume>10</volume>
          .1111/jep. 12182.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sfakianaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Koumakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sfakianakis</surname>
          </string-name>
          , G. Iatraki, G. Zacharioudakis,
          <string-name>
            <given-names>N.</given-names>
            <surname>Graf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Marias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tsiknakis</surname>
          </string-name>
          ,
          <article-title>Semantic biomedical resource discovery : a Natural Language Processing framework</article-title>
          ,
          <source>BMC Medical Informatics and Decision Making</source>
          <volume>15</volume>
          (
          <year>2015</year>
          ). URL: http://dx.doi. org/10.1186/s12911-015-0200-4. doi:
          <volume>10</volume>
          .1186/s12911-015-0200-4.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ernst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Siu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Milchevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hofart</surname>
          </string-name>
          , G. Weikum,
          <article-title>DeepLife: An Entity-aware Search, Analytics and Exploration Platform for Health and Life Sciences, in: Proceedings ofthe 54th Annual Meeting ofthe Association for Computational Linguistics-System Demonstrations, Association for Computational Linguistics</article-title>
          , Berlin, Germany,
          <source>August</source>
          <volume>7</volume>
          -
          <issue>12</issue>
          ,
          <year>2016</year>
          ,
          <year>2016</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>24</lpage>
          . URL: http://www.aclweb.org/anthology/P16-4004{%}0Ahttp: //aclweb.org/anthology/P16-4004. doi:
          <volume>10</volume>
          .1111/j.1348-
          <fpage>0421</fpage>
          .
          <year>2010</year>
          .
          <volume>00272</volume>
          .x.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Vazquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Leitner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Valencia</surname>
          </string-name>
          ,
          <article-title>Text mining for drugs and chemical compounds: Methods, tools and applications</article-title>
          ,
          <source>Molecular Informatics</source>
          <volume>30</volume>
          (
          <year>2011</year>
          )
          <fpage>506</fpage>
          -
          <lpage>519</lpage>
          . doi:
          <volume>10</volume>
          .1002/minf.201100005.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. de Rijke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sevenster</surname>
            , R. van Ommering,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Qian</surname>
          </string-name>
          ,
          <article-title>Generating Links to Background Knowledge: A Case Study Using Narrative Radiology Reports</article-title>
          , in: CIKM'11, ACM,
          <source>October 24-28</source>
          ,
          <year>2011</year>
          , Glasgow, Scotland, UK.,
          <year>2011</year>
          , p.
          <year>1867</year>
          . doi:
          <volume>10</volume>
          .1145/2063576.2063845.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Miranda-Escalada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Farré</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <article-title>Named entity recognition, concept normalization and clinical coding: Overview of the cantemist track for cancer text mining in spanish, corpus, guidelines, methods and results</article-title>
          ,
          <source>in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2020</year>
          ),
          <source>CEUR Workshop Proceedings</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Akbik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bergmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Blythe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rasul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schweter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vollgraf</surname>
          </string-name>
          ,
          <string-name>
            <surname>FLAIR:</surname>
          </string-name>
          <article-title>An easy-to-use framework for state-of-the-</article-title>
          <string-name>
            <surname>art</surname>
            <given-names>NLP</given-names>
          </string-name>
          , in: NAACL HLT 2019
          <article-title>- 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies -</article-title>
          <source>Proceedings of the Demonstrations Session</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>54</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lamurias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ruas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Couto</surname>
          </string-name>
          , PPR-SSM:
          <article-title>Personalized PageRank and semantic similarity measures for entity linking</article-title>
          ,
          <source>BMC Bioinformatics 20</source>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          . doi:
          <volume>10</volume>
          .1186/ s12859-019-3157-y.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Bidirectional LSTM-CRF Models for Sequence Tagging (</article-title>
          <year>2015</year>
          ). URL: http://arxiv.org/abs/1508.
          <year>01991</year>
          . arXiv:
          <fpage>1508</fpage>
          .
          <year>01991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          , CoRR abs/
          <year>1810</year>
          .04805 (
          <year>2018</year>
          ). URL: http://arxiv. org/abs/
          <year>1810</year>
          .04805. arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Iyyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gardner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <article-title>Deep contextualized word representations</article-title>
          ,
          <source>in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , New Orleans, Louisiana,
          <year>2018</year>
          , pp.
          <fpage>2227</fpage>
          -
          <lpage>2237</lpage>
          . URL: https://www.aclweb.org/anthology/N18-1202. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N18</fpage>
          -1202.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>So</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Kang,</surname>
          </string-name>
          <article-title>BioBERT: a pre-trained biomedical language representation model for biomedical text mining</article-title>
          ,
          <source>Bioinformatics</source>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . doi:
          <volume>10</volume>
          .1093/bioinformatics/btz682. arXiv:
          <year>1901</year>
          .08746.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E.</given-names>
            <surname>Alsentzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Murphy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Boag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-H.</given-names>
            <surname>Weng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jindi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>McDermott</surname>
          </string-name>
          , Publicly Available Clinical,
          <source>in: Proceedings ofthe 2nd Clinical Natural Language Processing Workshop</source>
          , Association for Computational Linguistics, Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>72</fpage>
          -
          <lpage>78</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/w19-
          <fpage>1909</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pershina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Grishman</surname>
          </string-name>
          ,
          <article-title>Personalized Page Rank for Named Entity Disambiguation, in: Human Language Technologies: The 2015 Annual Conference ofthe North American Chapter ofthe ACL, Section 4, Association for Computational Linguistics</article-title>
          , Denver, Colorado, May 31 - June 5,
          <year>2015</year>
          ,
          <year>2015</year>
          , pp.
          <fpage>238</fpage>
          -
          <lpage>243</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Barbosa</surname>
          </string-name>
          ,
          <article-title>Robust named entity disambiguation with random walks</article-title>
          ,
          <source>Semantic Web</source>
          <volume>9</volume>
          (
          <year>2018</year>
          )
          <fpage>459</fpage>
          -
          <lpage>479</lpage>
          . doi:
          <volume>10</volume>
          .3233/SW-170273.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          , Neural Collective Entity Linking (
          <year>2018</year>
          ). URL: http://arxiv.org/ abs/
          <year>1811</year>
          .08603. arXiv:
          <year>1811</year>
          .08603.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>O.-E.</given-names>
            <surname>Ganea</surname>
          </string-name>
          , T. Hofmann,
          <article-title>Deep Joint Entity Disambiguation with Local Neural Attention</article-title>
          ,
          <source>in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Copenhagen, Denmark, September 7-
          <issue>11</issue>
          ,
          <year>2017</year>
          ,
          <year>2017</year>
          , pp.
          <fpage>2619</fpage>
          -
          <lpage>2629</lpage>
          . URL: http://arxiv.org/abs/1704.04920. doi:
          <volume>10</volume>
          .18653/v1/d17-
          <fpage>1277</fpage>
          . arXiv:
          <volume>1704</volume>
          .
          <fpage>04920</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wei</surname>
          </string-name>
          , H. Xu,
          <article-title>BERT-based Ranking for Biomedical Entity Normalization (</article-title>
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1908</year>
          .03548. arXiv:
          <year>1908</year>
          .03548.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <article-title>Deep Entity Linking via Eliminating Semantic Ambiguity With BERT</article-title>
          ,
          <source>IEEE Access 7</source>
          (
          <year>2019</year>
          )
          <fpage>169434</fpage>
          -
          <lpage>169445</lpage>
          . doi:
          <volume>10</volume>
          .1109/ ACCESS.
          <year>2019</year>
          .
          <volume>2955498</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>K.</given-names>
            <surname>Bhatia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Varma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <article-title>Sparse local embeddings for extreme multi-label classification</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. C.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Deep learning for extreme multi-label text classification</article-title>
          ,
          <source>in: SIGIR 2017 - Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2017</year>
          . doi:
          <volume>10</volume>
          .1145/3077136.3080834.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>Convolutional neural networks for sentence classification</article-title>
          ,
          <source>in: EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference</source>
          ,
          <year>2014</year>
          . doi:
          <volume>10</volume>
          .3115/v1/d14-
          <fpage>1181</fpage>
          . arXiv:
          <volume>1408</volume>
          .
          <fpage>5882</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>R.</given-names>
            <surname>You</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Dai,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , Haxmlnet:
          <article-title>Hierarchical attention network for extreme multi-label text classification</article-title>
          , CoRR abs/
          <year>1904</year>
          .12578 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1904</year>
          . 12578. arXiv:
          <year>1904</year>
          .12578.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>W.-C. Chang</surname>
            ,
            <given-names>H.-F.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Zhong</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>I. Dhillon</given-names>
          </string-name>
          ,
          <article-title>Taming Pretrained Transformers for Extreme Multi-label Text</article-title>
          <string-name>
            <surname>Classification</surname>
          </string-name>
          ,
          <year>2020</year>
          . URL: http://arxiv.org/abs/
          <year>1905</year>
          .02331v4. arXiv:
          <year>1905</year>
          .02331.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized BERT pretraining approach</article-title>
          , CoRR abs/
          <year>1907</year>
          .11692 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1907</year>
          .11692. arXiv:
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , J. G. Carbonell, R. Salakhutdinov,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Xlnet: Generalized autoregressive pretraining for language understanding</article-title>
          , CoRR abs/
          <year>1906</year>
          .08237 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1906</year>
          .08237. arXiv:
          <year>1906</year>
          .08237.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Prabhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Harsola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Varma</surname>
          </string-name>
          , Parabel:
          <article-title>Partitioned label trees for extreme classification with application to dynamic search advertising</article-title>
          , in: 2018 World Wide Web Conference, International World Wide Web Conferences Steering Committee,
          <source>International World Wide Web Conferences Steering Committee</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>993</fpage>
          -
          <lpage>1002</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>A.</given-names>
            <surname>Akbik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Blythe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vollgraf</surname>
          </string-name>
          ,
          <article-title>Contextual String Embeddings for Sequence Labeling</article-title>
          ,
          <source>in: Proceedings of the 27th International Conference on Computational Linguistics</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1638</fpage>
          -
          <lpage>1649</lpage>
          . URL: https://github.com/zalandoresearch/flair.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Couto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lamurias</surname>
          </string-name>
          , Semantic Similarity Definition, Reference Module in Life Sciences (
          <year>2018</year>
          )
          <fpage>0</fpage>
          -
          <lpage>16</lpage>
          . URL: http://linkinghub.elsevier.com/retrieve/pii/B9780128096338204019. doi:
          <volume>10</volume>
          .1016/B978-0
          <source>-12-809633-8</source>
          .
          <fpage>20401</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>L.</given-names>
            <surname>Page</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Motwani</surname>
          </string-name>
          , T. Winograd,
          <article-title>The PageRank Citation Ranking: Bringing Order to the Web</article-title>
          ,
          <source>Technical Report</source>
          , Stanford InfoLab,
          <year>1998</year>
          . URL: http://ilpubs.stanford.edu:
          <volume>8090</volume>
          /422/.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Couto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lamurias</surname>
          </string-name>
          ,
          <article-title>MER: a shell script and annotation server for minimal named entity recognition and linking</article-title>
          ,
          <source>Journal of Cheminformatics</source>
          <volume>10</volume>
          (
          <year>2018</year>
          )
          <article-title>58</article-title>
          . URL: https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0312-9. doi:
          <volume>10</volume>
          .1186/ s13321-018-0312-9.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cañete</surname>
          </string-name>
          , G. Chaperon,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <article-title>Spanish pre-trained bert model and evaluation data</article-title>
          , in: to appear
          <source>in PML4DC at ICLR</source>
          <year>2020</year>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>