<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>V. D. T. Andrade); fcouto@di.fc.ul.pt (F. M. Couto)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>LASIGE-BioTM at MESINESP2: entity linking with semantic similarity and extreme multi-label classification on Spanish biomedical documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pedro Ruas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vitor D. T. Andrade</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francisco M. Couto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LASIGE, Faculdade de Ciências, Universidade de Lisboa</institution>
          ,
          <addr-line>Lisbon 1749-016</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Our team, LASIGE_BioTM, participated in the three sub-tracks of MESINESP2: (1) scientific literature, (2) clinical trials, and (3) patents. Our system comprises two modules: entity linking and extreme multilabel classification. The first module uses the entities recognized in text and then applies a graph-based entity linking model to link them to the DeCS vocabulary. In the end, it applies a semantic similaritybased filter to determine the most relevant entities in each document, which are then fed to the second module. The second module consists of an adapted version of the X-Transformer algorithm, and is responsible for associating each document with the top-20 relevant DeCS codes, which can be viewed as an extreme multi-label classification algorithm. The obtained results (micro F1-scores) were 0.2007, 0.0686, and 0.0314 for sub-tracks 1, 2, and 3, respectively. These represent low values when compared to other participants, mainly because of the lack of time our team had available to train the models. All of the used software is available in an open access repository.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Named Entity Recognition</kwd>
        <kwd>Named Entity Linking</kwd>
        <kwd>Extreme Multi-Label Classification</kwd>
        <kwd>Multilingual</kwd>
        <kwd>Text Mining</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Automatic semantic indexing is essential to organise the growing text data that is available,
which is particularly critical in scientific domains, including the biomedical one, where most of
the findings are available in the text format. We can view this task as an extreme multi-label
classification (XMC) problem, in which the goal is to tag a given data point with a subset
of relevant labels from an extremely large label list. Therefore, the data points are the text
documents to classify, and the label list provided by a knowledge base, such as an ontology.
Most of the proposed XMC approaches focus on datasets including Wikipedia articles or on
datasets with commercial application (e.g. dynamic search advertising) and less attention is
devoted to the biomedical domain. Additionally, multilingual approaches focusing on other
languages besides English are also scarce, such is the case of Spanish.</p>
      <p>
        In this sense, initiatives such as BioASQ [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] are necessary to stimulate the development of
biomedical, multilingual-focused approaches. In particular, the Medical Semantic Indexing
In Spanish (MESINESP) task was first introduced in the BioASQ 2020 challenge and the goal
was to perform semantic indexing of Spanish health-related documents, like scientific articles,
clinical trials, and healthcare project summaries, with terms from the Spanish version of the
Descriptores en Ciencias de la Salud (DeCS). The second edition, the MESINESP2 shared-task
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] was extended and included the following sub-tracks: MESINESP-L – Scientific Literature:
Automatic indexing with DeCS terms of Spanish abstracts from two databases, IBECS and
LILACS; MESINESP-T - Clinical Trials: Automatic indexing with DeCS terms of Spanish
clinical trials from REEC (Registro Español de Estudios Clínicos); MESINESP-P – Patents:
Automatic indexing with DeCS terms Spanish patents extracted from Google Patents.
      </p>
      <p>
        In the past, named entities have been considered important features that aid the classification
of texts. For instance, Gui et al [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] proposed a hierarchical text classification method that
leverages named entities as features, and, according to the conclusions of the referred study, the
features are responsible for the improvement of the method’s performance. More recently, Anelic
and co-workers [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] have argued that named entities do not improve the performance of text
classification, and can even decrease it. However, none of these works attempted to normalise
the recognised entities to concepts belonging to structured vocabularies, the approaches only
used the surface form of the entities instead of the designations for the associated concepts.
Besides, not every entity recognised in a given document has the same importance, i.e., some
entities may not be related with the main topic of the document, which can be particularly
true in documents containing a large number of diferent entities. Therefore, we explored
the hypothesis that linking the recognised entities to concepts of a structured vocabulary and
selecting only the most relevant entities to feed the text classification algorithm improve its
performance.
      </p>
      <p>
        After participating in the first edition [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], this paper describes the participation of our team,
LASIGE_BioTM, in the sub-tracks of MESINESP2. We developed a pipeline based on two modules:
the first one performs entity linking, by mapping the recognised entities in text to terms of
the DeCS vocabulary and then applying a semantic similarity-based filter to obtain the most
relevant entities in each document; the second module is based on the X-Transformer algorithm
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and is responsible to classify each document with the most relevant DeCS terms. The
software used in the experiments is available on: https://github.com/lasigeBioTM/MESINESP2.
      </p>
      <sec id="sec-1-1">
        <title>1.1. Related work</title>
        <sec id="sec-1-1-1">
          <title>1.1.1. Entity Linking</title>
          <p>The extraction of entities is carried out through the text mining process. This process can be
executed by diferent approaches such as: rule-based methods, machine learning and deep
learning.</p>
          <p>
            Rule-based methods include a set of terms, regular expressions or sentence constructions
defined by experts [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]. Rule-based methods also include dictionary approaches, in which a
given text is matched against a lexicon using string matching [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ].
          </p>
          <p>
            Machine learning methods in text mining are trained on training and validation datasets
to make predictions on a test dataset [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]. Deep learning is a subset of machine learning that
consists of artificial neural networks that include multiple hidden layers between input and
output. An artificial neural network is composed of nodes, processing units with a similar
function to the neurons in the brain. The input for the nodes in text mining applications are
word embeddings, which are vector representations of words. According to the way the nodes
are organised, deep neural networks can be classified as Recurrent Neural Network (RNN),
Convolutional Neural Network (CNN), among others.
          </p>
          <p>Usually text mining approaches include the tasks of Named Entity Recognition (NER) and
Named Entity Linking (NEL). NER corresponds to the recognition of entities mentioned in the
text and NEL to the linking of the recognised entities to concepts of a given knowledge base.</p>
          <p>
            For the NER task, state-of-art approaches usually have a bidirectional long short-term memory
- conditional random fields (BiLSTM-CRF) architecture. However, approaches that use
pretrained language models have recently emerged and showed promising results. One of the
pre-trained language models that has been highlighted in the tasks of text mining is BERT [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ],
which is organized in a multilayer bidirectional transformer encoder. This architecture is based
on an attention mechanism and allows the finding of dependencies between input and output
[
            <xref ref-type="bibr" rid="ref10">10</xref>
            ]. Several variations of the original BERT model are trained in diferent scientific corpora,
such as BioBERT [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] which was trained in PubMed and PMC articles and SciBERT [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ], that
was trained in Semantic Scholar articles. After the pre-training, these variations and the original
BERT model can also be fine-tuned for NEL tasks[
            <xref ref-type="bibr" rid="ref13">13</xref>
            ].
          </p>
          <p>
            In addition to the pre-trained language models, NEL state-of-the-art approaches in the
biomedical domain also include graph-based models. Usually, these build a disambiguation
graph composed by candidates for entity mentions and then ranked according to their relevance
and coherence in the graph. Models that use the Personalized PageRank algorithm to determine
the relevance of the candidates in the graph have been proposed, such as Pershina et al. [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ].
          </p>
        </sec>
        <sec id="sec-1-1-2">
          <title>1.1.2. Semantic similarity</title>
          <p>
            The calculation of the relevance of the candidates in a graph normally requires a similarity
measure to compare its nodes, as was proposed by Lamurias et al. [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]. A semantic similarity
measure is a metric to compare the similarity between sets of text based on their implicit and
explicit semantics. In the present work, we measured the semantic similarity between each
entity and the remaining entities of a given document through Resnik’s metric [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ]. This metric
is based on the extrinsic information content (IC) of the most informative common ancestor
(MICA) of two given concepts [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ] and is defined as:
          </p>
          <p>(1, 2) = ℎ(1, 2)</p>
          <p>Being 1 and 2 the entity 1 and the entity 2, respectively.</p>
        </sec>
        <sec id="sec-1-1-3">
          <title>1.1.3. Extreme multi-label classification and biomedical semantic indexing</title>
          <p>
            Chang et al. [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ] divided the approaches to the XMC task in four categories: one-vs-all,
partitioning methods, embedding-based, and deep-learning-based.
          </p>
          <p>
            The Parabel algorithm [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ] follows a one-vs-all approach because it learns a separate classifier
for each label in the label list. It also applies a tree-based method, since it learns a balanced
hierarchy over labels, which helps identifying the most similar labels with respect to a given
label, i.e. those that are present in the same leaves. It performs sub-sampling of data points by
restricting a given label’s negative training examples to those examples that are annotated with
similar or confusing labels, which decreases training and prediction times from linear to
logarithmic. The approach then applies a hierarchical multi-label model, which is a generalisation of
the multi-class hierarchical softmax model. Each classifier learns a joint probability distribution
over the possible labels that is based on data point features and on the label hierarchy. Parabel
was applied to Dynamic Search Advertising, which aims to predict the subset of search engine
queries that will lead to a click on a given ad page.
          </p>
          <p>
            The current state-of-the-art in XMC consists of approaches that leverage pre-trained deep
language models. The first approach of this type was X-BERT ( BERT for eXtreme Multi-label
Text Classification ) [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ], later renamed to X-Transformer [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ], which fine-tunes BERT, RoBERTa,
and XLNet for the XMC task. The main challenges of applying Transformer to the XMC problem
are the extremely large set of possible labels and the label sparsity, which arises from the fact
that too few labels are associated with a large number of training instances. The model includes
three components: a semantic label indexer, a deep neural matcher, and a ranker. The authors
applied the developed algorithm to four datasets, Eurlex-4K, Wiki10-28K, AmazonCat-13K and
Wiki-500K, obtaining the following precision@1 values: 86.00%, 85.75 %, 95.17 %, 67.87 %.
With respect to the MESINESP task, six teams have participated in the first edition, including
our team, which have developed a pipeline [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] based on the X-Transformer algorithm [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] and
the MER tool [20] for the named entity recognition and linking step. The approach with best
performance was based on AttentionXML with multilingual-BERT [21], which achieved a micro
F-measure value of 0.4254, whereas our approach achieved a micro F1-score of 0.2507.
          </p>
          <p>Besides DeCS and MeSH vocabularies, there are also related works that focus on the
classification or coding of clinical content with codes belonging to other vocabularies, in particular
the International Classification of Diseases (ICD) terminology [22, 23, 24, 25].</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <sec id="sec-2-1">
        <title>2.1. Data description</title>
        <p>The target label list consisted of 34,046 codes belonging to the DeCS vocabulary1 (2020 edition),
complemented with additional COVID-related descriptors added by the organisation. Both
corpora (JSON files) and the DeCS vocabulary (TSV file) were provided by the organisation and
downloaded from the following link: https://zenodo.org/record/4634129#.YHcShxIo9an.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Entity Linking</title>
        <p>Our approach consisted in using the recognised entities from the documents of each subtrack
that were provided in the folder “Additional Data”. The entities of these files were then further
linked to the respective DeCS codes through an entity linking model. This model searches for the
ten best candidates of DeCS through string matching and then develops a disambiguation graph
with those candidates. The Personalized PageRank algorithm is applied to the disambiguation
graph and estimates the coherence of each node, i.e. candidate, to the graph. The coherence is
associated with the node degree, meaning that nodes linked to a high number of other candidate
nodes are probable candidates for their respective entities compared with more isolated nodes.
Besides coherence, the IC of the DeCS code associated with the nodes is used for ranking: nodes
associated with DeCS codes with higher IC receive higher ranking scores. IC corresponds to
the presence of an entity in a corpus, if an entity is not common in a corpus its IC will be high.
The higher the IC of a candidate is, the better ranking that candidate will have in the graphic.
After ranking all the candidates, the PPR selects the candidate with better ranking to map each
entity. At the end, all entities in a given document are linked to their respective DeCS concepts.</p>
        <p>To explore the guiding hypothesis of this work, we filtered the number of entities to include
each document by applying a semantic similarity-based filter, more concretely, by selecting the
entities for which there were other similar entities recognised in the same document.</p>
        <p>After this step, the average of the several semantic similarity values obtained for an entity
corresponded to the final score of that entity. The entities were then sorted by their score. At the
end, we explored two values for the semantic similarity-based filter: 1.0 and 0.25. Considering
the filter 0.25, we only included the top 25% entities according to their score, and for the filter
1.0, we included all the entities in the document. This way, we could determine the impact of
choosing the most relevant entities in the performance of the classifier algorithm.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Extreme Multi-Label Classification</title>
        <p>
          We approached the sub-tracks as an Extreme Multi-Label Classification (XMC) problem. Our
starting point was a pipeline based on the X-Transformer algorithm [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] that was adapted to the
biomedical domain by our group in the context of past competitions, such as BioASQ [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and
CANTEMIST [26]. The pipeline was further adapted to the present competition, and includes
the following modules: entity linking (subsection 2.2), preprocessing, semantic label indexer,
deep neural matcher, and ranker. The main modifications were made in the entity linking and
preprocessing modules. The complete description of the entity linking component is available
in the previous subsection 2.2.
        </p>
        <p>The preprocessing module imports the retrieved dataset JSON files (train, dev, and text subsets)
and the DeCS TSV file, the JSON files with the output from the entity linking (subsection 2.2),
and, for each dataset, it generates several files:
1. vocabulary file ("label_vocab.txt"): it includes the internal numerical identifier for each</p>
        <p>DeCS term. For example, the term "calcimicina" has the internal numerical identifier "0".
2. label correspondence file ("label_correspondence.txt"): it includes the correspondence
between the internal numerical identifiers, and the respective DeCS labels and terms. For
example, "0" corresponds to "D000001", which corresponds to "calcimicina".
3. subset files (" subset.txt", "subset_raw_text.txt", "subset_raw_labels.txt"): for each subset
(train, dev, and test) it is generated the three aforementioned files. The file " subset.txt"
includes the DeCS labels that are associated with the respective documents, separated by
commas, the stemmed texts of documents’ titles, and the DeCS terms that were extracted
in the documents appended to the end of the stemmed titles. The file " subset_raw_text.txt"
includes only the stemmed titles, and the file " subset_raw_labels.txt" only the DeCS terms
relative to the labels associated with the documents.</p>
        <p>
          We only considered the titles of the documents based on the results described by Neves et al.
[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]: the performance of the models using titles is similar to that of models using abstracts, so it
is more eficient to use titles since they have less text. The limited time that we had to train
models also influenced our decision to only use the titles, since the required time is lower. The
titles were stemmed using the Snowball Stemmer implementation for Spanish text provided
by the NLTK package2. As the documents belonging to the test sets were unlabeled, we added
the placeholder "0" to each document in the "subset.txt" files. The module was also modified in
order to integrate extracted entities independently of the tool employed.
        </p>
        <p>
          The X-Transformer algorithm includes three modules: semantic label indexer, deep neural
matcher, and ranker. The semantic label indexer first obtain meaningful representations for
labels that are based on embeddings of the text descriptions associated with the labels, and on
Positive Instance Feature Aggregation (PIFA), which is a type of label embeddings based on the
TF-IDF features that are relevant instances for the labels. Then, it applies k-means clustering in
order to generate label clusters according to the semantic representations described before. The
deep neural matcher performs fine-tuning of BERT to encode an instance embedding, which
is then used to find the most relevant clusters for the instance. At the end of this step, only a
small subset of clusters are considered for the next step, which is performed by the ranker. The
ranker determines the relevance of the labels in the chosen clusters to the instance, which is
substantially more eficient than performing the ranking of all the initial labels. For a more
complete description of the X-Transformer algorithm please refer to the original publication by
Chang et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>The models developed for the diferent sub-tracks are shown in Table 2. We explored the
finetuning of diferent deep neural matchers. The BERT Base Multilingual Cased model was trained
on the Wikipedia dumps of the top 104 largest languages in Wikipedia and has the following</p>
        <p>Model
LASIGE_BioTM-1
LASIGE_BioTM-2
LASIGE_BioTM-3
LASIGE_BioTM-4
LASIGE_BioTM-5</p>
        <p>
          L
T
P
characteristics: 12-layer, 768-hidden, 12-heads, 110M parameters. The X-Transformer algorithm
uses the Pytorch implementation from HuggingFace Transformers [27]. The CANTEMIST
model corresponds to the Model 7 described by Ruas et al. [26]. It is also based on the the BERT
Base Multilingual Cased model and was first fine-tuned on 318,658 Spanish biomedical articles
from the IBECS, LILACS and PubMed databases, jointly with extracted entities in the context of
the participation in the first edition of MESINESP [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Training approach</title>
        <p>We explored several training approaches according to the target corpus:
• L corpus: Fine-tuning of the model CANTEMIST using the provided training dataset of
249,474 documents and the provided test set with 10,179 documents.
• T corpus: Training of the model BERT Multilingual Base Cased using the provided training
dataset of 249,474 documents from the L corpus and a generated test set built from the
3560 clinical trials of the training set, the 147 clinical trials of the development set, and
the 8919 clinical trials of the test set (total of 12,627 documents).
• P corpus: Training of the model BERT Multilingual Base Cased using the provided training
dataset of 249,474 documents from the L corpus and a generated test set built from the
115 patents of the development set and the 68,404 patents from the test set.</p>
        <p>The training of the deep neural matcher is the limiting step of the algorithm in terms of time.
Each model was trained during a single epoch then evaluated on the respective test set. The
training and evaluation time was approx. 2 days for each model using a single NVIDIA Tesla P4
GPU. The values for the hyper-parameters are the following: depth = 6, train_batch_size=4,
eval_batch_size=4, learning_rate=0.00005, warmup_rate=0.1.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results and discussion</title>
      <p>The results obtained for each sub-track are shown on Table 3. The oficial evaluation metric of
the competition was the micro F1-score (MiF). Our best models achieved a MiF of 0.2007, 0.0686,
and 0.0314 in the sub-tracks L, T, and P, respectively. These results are low when compared to
T
P</p>
      <p>Baseline
BERTDeCS version 4</p>
      <p>LASIGE_BioTM-1
LASIGE_BioTM-2</p>
      <p>Baseline
BERTDeCS version 2</p>
      <p>LASIGE_BioTM-3
LASIGE_BioTM-4</p>
      <p>Baseline
BERTDeCS version 2</p>
      <p>LASIGE_BioTM-5</p>
      <p>MiF
the top results in each sub-track, more concretely, there is a diference of 0.2830, 0.2961, and
0.4200 in terms of MiF in the sub-tracks L, T, and P, respectively.</p>
      <p>With respect to the initial hypothesis, the obtained results were mixed. In the sub-track
L, the LASIGE_BioTM-1 model, which included all the entities recognised in the documents,
obtained slightly better results (0.2007 MiF) compared with LASIGE_BioTM-2 model (0.1886
MiF), which only included 25% of the top relevant entities. However, in the sub-track T, the
opposite happened, since LASIGE_BioTM-4 (top 25% entities) obtained marginally better results
(0.0686 MiF) than LASIGE_BioTM-3 (0.0679 MiF). Consequently, we cannot confirm the initial
hypothesis that feeding only the most relevant entities to the classifier algorithm improves its
performance.</p>
      <p>Assuming that there were no coding errors that may have undermined the results, there are
several possible reasons behind the relatively low performance that our models achieved in the
three sub-tracks.</p>
      <p>Arguably, the main one is related with the impossibility of carrying out an optimisation of
the hyper-parameters of the classifier algorithm, in particular the number of training epochs.
Each model was only trained or fine-tuned during one epoch in the respective training dataset,
which is not enough to accurately learn relevant features. The limited time we had available
made it impossible to extend the training process during more epochs. Additionally, we were
not able to train the models in a multi-gpu setting due to unresolved errors, so the duration
of each training epoch was approximately two days using a single gpu. Beyond the number
of training epochs, the optimization of other hyperparameters such as train_batch_size,
eval_batch_size, and learning_rate, would probably lead to a better performance.</p>
      <p>With respect to the sub-track 2 and sub-track 3, the developed models were trained on
documents belonging to the L corpus (sub-track 1), and not on documents of the respective
subtracks corpora. The text present in scientific literature has diferent characteristics compared
with the text associated with clinical trials and patents, so the models fine-tuned in a certain
type of text will necessarily have a worse performance when their evaluation occurs over a
diferent type of text. For sub-track 3, there was no training dataset available, but for sub-track
2 probably it would have been better if we had trained models 3 and 4 over the training dataset
of the task and not over the training dataset for sub-track 1.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>Our approach including an entity linking model and the X-Transformer algorithm obtained a
micro F1-score of 0.2007, 0.0686, and 0.0314 in sub-tracks 1, 2, and 3, respectively, which is a
low performance compared with the top participants, and even with the baseline approaches.
In order to improve the performance, we need to perform a careful error-analysis to identify
any coding errors that may have undermined the results. Next, we need to spend more time in
the training process, more concretely, by training the models during more epochs, to perform
hyper-parameter optimisation, to solve the problems associated with multi-gpu training, to
explore the use of summarisation tools to feed only the relevant content to the classifier, and to
explore less resource-demanding pre-trained models, such as DistilBERT. Besides, we only used
the titles of the articles based on previous studies, but in the future we will explore the impact
of using more text in the performance of the classification algorithm.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was supported by FCT through funding of Deep Semantic Tagger (DeST) project
(ref. PTDC/CCI-BIO/28685/2017) and LASIGE Research Unit (ref. UIDB/00408/2020 and ref.
UIDP/00408/2020); and FCT through funding of PhD Scholarship, ref. 2020.05393.BD.
[20] F. M. Couto, A. Lamurias, MER: a shell script and annotation server for minimal
named entity recognition and linking, Journal of Cheminformatics 10 (2018) 58. URL:
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0312-9. doi:10.1186/
s13321-018-0312-9.
[21] R. You, Z. Zhang, Z. Wang, S. Dai, H. Mamitsuka, S. Zhu, AttentionXML: Label
Treebased Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text
Classification, in: 33rd Conference on Neural Information Processing Systems (NeurIPS
2019), Vancouver, Canada, 2019, pp. 1–11. arXiv:1811.01727.
[22] P. Xie, H. Shi, M. Zhang, E. P. Xing, A Neural Architecture for Automated ICD Coding, in:
Proceedings ofthe 56th Annual Meeting ofthe Association for Computational Linguistics
(Long Papers), Association for Computational Linguistics, 2018, pp. 1066–1076.
[23] H. Shi, P. Xie, Z. Hu, M. Zhang, E. P. Xing, Towards Automated ICD Coding Using Deep</p>
      <p>Learning, Technical Report, 2017. arXiv:1711.04075v3.
[24] S. Silvestri, F. Gargiulo, M. Ciampi, G. De Pietro, Exploit Multilingual Language Model at
Scale for ICD-10 Clinical Text Classification, Proceedings - IEEE Symposium on Computers
and Communications 2020-July (2020). doi:10.1109/ISCC50000.2020.9219640.
[25] C. Sen, B. Ye, J. Aslam, A. Tahmasebi, From Extreme Multi-label to Multi-class: A
Hierarchical Approach for Automated ICD-10 Coding Using Phrase-level Attention (2021). URL:
http://arxiv.org/abs/2102.09136. arXiv:2102.09136.
[26] P. Ruas, A. Neves, V. D. Andrade, F. M. Couto, Lasigebiotm at cantemist: Named entity
recognition and normalization of tumour morphology entities and clinical coding of
Spanish health-related documents, in: Proceedings of the Iberian Languages Evaluation
Forum (IberLEF 2020), 2020, pp. 422–437. URL: http://ceur-ws.org/Vol-2664/cantemist_
paper11.pdf.
[27] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao,
S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural
language processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing: System Demonstrations, Association for Computational Linguistics,
Online, 2020, pp. 38–45. URL: https://www.aclweb.org/anthology/2020.emnlp-demos.6.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Vandorou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gasco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          , G. .
          <article-title>Paliouras, Overview of BioASQ 2021: The ninth BioASQ challenge on Large-Scale Biomedical Semantic Indexing</article-title>
          and
          <string-name>
            <given-names>Question</given-names>
            <surname>Answering</surname>
          </string-name>
          . (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gasco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
            Estrada-Zavala, , R.-T. Murasaki,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Primo-Peña</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Bojo-Canales</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Paliouras</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Krallinger</surname>
          </string-name>
          , Overview of BioASQ 2021-
          <article-title>MESINESP track. Evaluation of advance hierarchical classification techniques for scientific literature, patents and clinical trials</article-title>
          . (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Hierarchical text classification for news articles based-on named entities</article-title>
          ,
          <source>Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 7713 LNAI</source>
          (
          <year>2012</year>
          )
          <fpage>318</fpage>
          -
          <lpage>329</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>642</fpage>
          -35527-1\_
          <fpage>27</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Andelic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kondic</surname>
          </string-name>
          , I. Peric,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jocic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kovacevic</surname>
          </string-name>
          ,
          <source>Text Classification Based on Named Entities, in: 7th International Conference on Information Society and Technology ICIST</source>
          <year>2017</year>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Neves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lamurias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Couto</surname>
          </string-name>
          ,
          <article-title>Extreme Multi-Label Classification applied to the Biomedical and Multilingual Panorama</article-title>
          , in: CLEF 2020 Working Notes,
          <year>2020</year>
          . URL: http: //ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2696</volume>
          /paper_67.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W.-C.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-F.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Dhillon</surname>
          </string-name>
          ,
          <article-title>Taming Pretrained Transformers for Extreme Multi-label Text Classification (</article-title>
          <year>2020</year>
          ). URL: https://doi.org/10.1145/3394486. 3403368. doi:
          <volume>10</volume>
          .1145/3394486.3403368. arXiv:
          <year>1905</year>
          .02331v4.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lamurias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Couto</surname>
          </string-name>
          , Text Mining for Bioinformatics Using Biomedical Literature,
          <year>2019</year>
          , p.
          <fpage>602</fpage>
          -
          <lpage>611</lpage>
          . doi:
          <volume>10</volume>
          .1016/B978-0
          <source>-12-809633-8</source>
          .
          <fpage>20409</fpage>
          -
          <lpage>3</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Couto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lamurias</surname>
          </string-name>
          ,
          <article-title>Mer: a shell script and annotation server for minimal named entity recognition and linking</article-title>
          ,
          <source>Journal of Cheminformatics</source>
          <volume>10</volume>
          (
          <year>2018</year>
          )
          <fpage>58</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: Advances in neural information processing systems</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>So</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <article-title>Biobert: a pre-trained biomedical language representation model for biomedical text mining</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>36</volume>
          (
          <year>2020</year>
          )
          <fpage>1234</fpage>
          -
          <lpage>1240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>I.</given-names>
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Cohan,
          <article-title>SciBERT: A pretrained language model for scientific text</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>3615</fpage>
          -
          <lpage>3620</lpage>
          . URL: https://www.aclweb.org/anthology/D19-1371. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          -1371.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Bert-based ranking for biomedical entity normalization</article-title>
          ,
          <source>AMIA Summits on Translational Science Proceedings</source>
          <year>2020</year>
          (
          <year>2020</year>
          )
          <fpage>269</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pershina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Grishman</surname>
          </string-name>
          ,
          <article-title>Personalized page rank for named entity disambiguation, in: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</article-title>
          , Denver, Colorado,
          <year>2015</year>
          , pp.
          <fpage>238</fpage>
          -
          <lpage>243</lpage>
          . URL: https://www.aclweb.org/anthology/N15-1026. doi:
          <volume>10</volume>
          .3115/v1/
          <fpage>N15</fpage>
          -1026.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lamurias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ruas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Couto</surname>
          </string-name>
          , PPR-SSM:
          <article-title>Personalized PageRank and semantic similarity measures for entity linking</article-title>
          ,
          <source>BMC Bioinformatics 20</source>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          . doi:
          <volume>10</volume>
          .1186/ s12859-019-3157-y.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Resnik</surname>
          </string-name>
          ,
          <article-title>Using information content to evaluate semantic similarity in a taxonomy, arXiv preprint cmp-lg/9511007 (</article-title>
          <year>1995</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>F.</given-names>
            <surname>Couto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lamurias</surname>
          </string-name>
          , Semantic similarity definition,
          <source>Encyclopedia of bioinformatics and computational biology 1</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>W.-C. Chang</surname>
            ,
            <given-names>H.-F.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Zhong</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Dhillon</surname>
            ,
            <given-names>X-BERT</given-names>
          </string-name>
          :
          <article-title>eXtreme Multi-label Text Classification with using Bidirectional Encoder Representations from Transformers (</article-title>
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          . URL: http://arxiv.org/abs/
          <year>1905</year>
          .02331. arXiv:
          <year>1905</year>
          .02331.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Prabhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Harsola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Varma</surname>
          </string-name>
          , Parabel:
          <article-title>Partitioned label trees for extreme classification with application to dynamic search advertising</article-title>
          ,
          <source>in: Proceedings of the World Wide Web Conference, WWW</source>
          <year>2018</year>
          , ACM, New York, NY, USA, April
          <volume>23</volume>
          -
          <issue>27</issue>
          ,
          <year>2018</year>
          , Lyon, France,
          <year>2018</year>
          , pp.
          <fpage>993</fpage>
          -
          <lpage>1002</lpage>
          . doi:
          <volume>10</volume>
          .1145/3178876.3185998.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>