=Paper= {{Paper |id=Vol-2936/paper-24 |storemode=property |title=LASIGE-BioTM at MESINESP2: entity linking with semantic similarity and extreme multi-label classification on Spanish biomedical documents |pdfUrl=https://ceur-ws.org/Vol-2936/paper-24.pdf |volume=Vol-2936 |authors=Pedro Ruas,Vitor D. T. Andrade,Francisco M. Couto |dblpUrl=https://dblp.org/rec/conf/clef/RuasAC21 }} ==LASIGE-BioTM at MESINESP2: entity linking with semantic similarity and extreme multi-label classification on Spanish biomedical documents== https://ceur-ws.org/Vol-2936/paper-24.pdf
LASIGE-BioTM at MESINESP2: entity linking with
semantic similarity and extreme multi-label
classification on Spanish biomedical documents
Pedro Ruas1 , Vitor D. T. Andrade1 and Francisco M. Couto1
1
    LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon 1749-016, Portugal


                                         Abstract
                                         Our team, LASIGE_BioTM, participated in the three sub-tracks of MESINESP2: (1) scientific literature,
                                         (2) clinical trials, and (3) patents. Our system comprises two modules: entity linking and extreme multi-
                                         label classification. The first module uses the entities recognized in text and then applies a graph-based
                                         entity linking model to link them to the DeCS vocabulary. In the end, it applies a semantic similarity-
                                         based filter to determine the most relevant entities in each document, which are then fed to the second
                                         module. The second module consists of an adapted version of the X-Transformer algorithm, and is
                                         responsible for associating each document with the top-20 relevant DeCS codes, which can be viewed
                                         as an extreme multi-label classification algorithm. The obtained results (micro F1-scores) were 0.2007,
                                         0.0686, and 0.0314 for sub-tracks 1, 2, and 3, respectively. These represent low values when compared
                                         to other participants, mainly because of the lack of time our team had available to train the models. All
                                         of the used software is available in an open access repository.

                                         Keywords
                                         Named Entity Recognition, Named Entity Linking, Extreme Multi-Label Classification, Multilingual,
                                         Text Mining




1. Introduction
Automatic semantic indexing is essential to organise the growing text data that is available,
which is particularly critical in scientific domains, including the biomedical one, where most of
the findings are available in the text format. We can view this task as an extreme multi-label
classification (XMC) problem, in which the goal is to tag a given data point with a subset
of relevant labels from an extremely large label list. Therefore, the data points are the text
documents to classify, and the label list provided by a knowledge base, such as an ontology.
Most of the proposed XMC approaches focus on datasets including Wikipedia articles or on
datasets with commercial application (e.g. dynamic search advertising) and less attention is
devoted to the biomedical domain. Additionally, multilingual approaches focusing on other
languages besides English are also scarce, such is the case of Spanish.
   In this sense, initiatives such as BioASQ [1] are necessary to stimulate the development of
biomedical, multilingual-focused approaches. In particular, the Medical Semantic Indexing

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" psruas@fc.ul.pt (P. Ruas); fc49005@alunos.fc.ul.pt (V. D. T. Andrade); fcouto@di.fc.ul.pt (F. M. Couto)
 0000-0002-1293-4199 (P. Ruas); 0000-0003-0627-1496 (F. M. Couto)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
In Spanish (MESINESP) task was first introduced in the BioASQ 2020 challenge and the goal
was to perform semantic indexing of Spanish health-related documents, like scientific articles,
clinical trials, and healthcare project summaries, with terms from the Spanish version of the
Descriptores en Ciencias de la Salud (DeCS). The second edition, the MESINESP2 shared-task
[2] was extended and included the following sub-tracks: MESINESP-L – Scientific Literature:
Automatic indexing with DeCS terms of Spanish abstracts from two databases, IBECS and
LILACS; MESINESP-T - Clinical Trials: Automatic indexing with DeCS terms of Spanish
clinical trials from REEC (Registro Español de Estudios Clínicos); MESINESP-P – Patents:
Automatic indexing with DeCS terms Spanish patents extracted from Google Patents.
   In the past, named entities have been considered important features that aid the classification
of texts. For instance, Gui et al [3] proposed a hierarchical text classification method that
leverages named entities as features, and, according to the conclusions of the referred study, the
features are responsible for the improvement of the method’s performance. More recently, Anelic
and co-workers [4] have argued that named entities do not improve the performance of text
classification, and can even decrease it. However, none of these works attempted to normalise
the recognised entities to concepts belonging to structured vocabularies, the approaches only
used the surface form of the entities instead of the designations for the associated concepts.
Besides, not every entity recognised in a given document has the same importance, i.e., some
entities may not be related with the main topic of the document, which can be particularly
true in documents containing a large number of different entities. Therefore, we explored
the hypothesis that linking the recognised entities to concepts of a structured vocabulary and
selecting only the most relevant entities to feed the text classification algorithm improve its
performance.
   After participating in the first edition [5], this paper describes the participation of our team,
LASIGE_BioTM, in the sub-tracks of MESINESP2. We developed a pipeline based on two modules:
the first one performs entity linking, by mapping the recognised entities in text to terms of
the DeCS vocabulary and then applying a semantic similarity-based filter to obtain the most
relevant entities in each document; the second module is based on the X-Transformer algorithm
[6], and is responsible to classify each document with the most relevant DeCS terms. The
software used in the experiments is available on: https://github.com/lasigeBioTM/MESINESP2.

1.1. Related work
1.1.1. Entity Linking
The extraction of entities is carried out through the text mining process. This process can be
executed by different approaches such as: rule-based methods, machine learning and deep
learning.
   Rule-based methods include a set of terms, regular expressions or sentence constructions
defined by experts [7]. Rule-based methods also include dictionary approaches, in which a
given text is matched against a lexicon using string matching [8].
   Machine learning methods in text mining are trained on training and validation datasets
to make predictions on a test dataset [7]. Deep learning is a subset of machine learning that
consists of artificial neural networks that include multiple hidden layers between input and
output. An artificial neural network is composed of nodes, processing units with a similar
function to the neurons in the brain. The input for the nodes in text mining applications are
word embeddings, which are vector representations of words. According to the way the nodes
are organised, deep neural networks can be classified as Recurrent Neural Network (RNN),
Convolutional Neural Network (CNN), among others.
   Usually text mining approaches include the tasks of Named Entity Recognition (NER) and
Named Entity Linking (NEL). NER corresponds to the recognition of entities mentioned in the
text and NEL to the linking of the recognised entities to concepts of a given knowledge base.
   For the NER task, state-of-art approaches usually have a bidirectional long short-term memory
- conditional random fields (BiLSTM-CRF) architecture. However, approaches that use pre-
trained language models have recently emerged and showed promising results. One of the
pre-trained language models that has been highlighted in the tasks of text mining is BERT [9],
which is organized in a multilayer bidirectional transformer encoder. This architecture is based
on an attention mechanism and allows the finding of dependencies between input and output
[10]. Several variations of the original BERT model are trained in different scientific corpora,
such as BioBERT [11] which was trained in PubMed and PMC articles and SciBERT [12], that
was trained in Semantic Scholar articles. After the pre-training, these variations and the original
BERT model can also be fine-tuned for NEL tasks[13].
   In addition to the pre-trained language models, NEL state-of-the-art approaches in the
biomedical domain also include graph-based models. Usually, these build a disambiguation
graph composed by candidates for entity mentions and then ranked according to their relevance
and coherence in the graph. Models that use the Personalized PageRank algorithm to determine
the relevance of the candidates in the graph have been proposed, such as Pershina et al. [14].

1.1.2. Semantic similarity
The calculation of the relevance of the candidates in a graph normally requires a similarity
measure to compare its nodes, as was proposed by Lamurias et al. [15]. A semantic similarity
measure is a metric to compare the similarity between sets of text based on their implicit and
explicit semantics. In the present work, we measured the semantic similarity between each
entity and the remaining entities of a given document through Resnik’s metric [16]. This metric
is based on the extrinsic information content (IC) of the most informative common ancestor
(MICA) of two given concepts [17] and is defined as:

                             𝑆𝑆𝑀𝑟𝑒𝑠𝑛𝑖𝑘 (𝑒1 , 𝑒2 ) = 𝐼𝐶𝑠ℎ𝑎𝑟𝑒𝑑 (𝑒1 , 𝑒2 )
   Being 𝑒1 and 𝑒2 the entity 1 and the entity 2, respectively.

1.1.3. Extreme multi-label classification and biomedical semantic indexing
Chang et al. [18] divided the approaches to the XMC task in four categories: one-vs-all, parti-
tioning methods, embedding-based, and deep-learning-based.
   The Parabel algorithm [19] follows a one-vs-all approach because it learns a separate classifier
for each label in the label list. It also applies a tree-based method, since it learns a balanced
hierarchy over labels, which helps identifying the most similar labels with respect to a given
Table 1
Number of documents in each corpus.
                       Corpus              Train     Dev               Test
               Scientific literature (L)   249,474   1,065   10,179 (500 gold standard)
                  Clinical Trials (T)       3,560     147    8,919 (250 gold standard)
                     Patents (P)              —       115    68,404 (150 gold standard)


label, i.e. those that are present in the same leaves. It performs sub-sampling of data points by
restricting a given label’s negative training examples to those examples that are annotated with
similar or confusing labels, which decreases training and prediction times from linear to loga-
rithmic. The approach then applies a hierarchical multi-label model, which is a generalisation of
the multi-class hierarchical softmax model. Each classifier learns a joint probability distribution
over the possible labels that is based on data point features and on the label hierarchy. Parabel
was applied to Dynamic Search Advertising, which aims to predict the subset of search engine
queries that will lead to a click on a given ad page.
   The current state-of-the-art in XMC consists of approaches that leverage pre-trained deep
language models. The first approach of this type was X-BERT (BERT for eXtreme Multi-label
Text Classification) [18], later renamed to X-Transformer [6], which fine-tunes BERT, RoBERTa,
and XLNet for the XMC task. The main challenges of applying Transformer to the XMC problem
are the extremely large set of possible labels and the label sparsity, which arises from the fact
that too few labels are associated with a large number of training instances. The model includes
three components: a semantic label indexer, a deep neural matcher, and a ranker. The authors
applied the developed algorithm to four datasets, Eurlex-4K, Wiki10-28K, AmazonCat-13K and
Wiki-500K, obtaining the following precision@1 values: 86.00%, 85.75 %, 95.17 %, 67.87 %.

1.1.4. MESINESP1
With respect to the MESINESP task, six teams have participated in the first edition, including
our team, which have developed a pipeline [5] based on the X-Transformer algorithm [6] and
the MER tool [20] for the named entity recognition and linking step. The approach with best
performance was based on AttentionXML with multilingual-BERT [21], which achieved a micro
F-measure value of 0.4254, whereas our approach achieved a micro F1-score of 0.2507.
  Besides DeCS and MeSH vocabularies, there are also related works that focus on the classifi-
cation or coding of clinical content with codes belonging to other vocabularies, in particular
the International Classification of Diseases (ICD) terminology [22, 23, 24, 25].
2. Methodology
2.1. Data description
The target label list consisted of 34,046 codes belonging to the DeCS vocabulary1 (2020 edition),
complemented with additional COVID-related descriptors added by the organisation. Both
corpora (JSON files) and the DeCS vocabulary (TSV file) were provided by the organisation and
downloaded from the following link: https://zenodo.org/record/4634129#.YHcShxIo9an.

2.2. Entity Linking
Our approach consisted in using the recognised entities from the documents of each subtrack
that were provided in the folder “Additional Data”. The entities of these files were then further
linked to the respective DeCS codes through an entity linking model. This model searches for the
ten best candidates of DeCS through string matching and then develops a disambiguation graph
with those candidates. The Personalized PageRank algorithm is applied to the disambiguation
graph and estimates the coherence of each node, i.e. candidate, to the graph. The coherence is
associated with the node degree, meaning that nodes linked to a high number of other candidate
nodes are probable candidates for their respective entities compared with more isolated nodes.
Besides coherence, the IC of the DeCS code associated with the nodes is used for ranking: nodes
associated with DeCS codes with higher IC receive higher ranking scores. IC corresponds to
the presence of an entity in a corpus, if an entity is not common in a corpus its IC will be high.
The higher the IC of a candidate is, the better ranking that candidate will have in the graphic.
After ranking all the candidates, the PPR selects the candidate with better ranking to map each
entity. At the end, all entities in a given document are linked to their respective DeCS concepts.
   To explore the guiding hypothesis of this work, we filtered the number of entities to include
each document by applying a semantic similarity-based filter, more concretely, by selecting the
entities for which there were other similar entities recognised in the same document.
   After this step, the average of the several semantic similarity values obtained for an entity
corresponded to the final score of that entity. The entities were then sorted by their score. At the
end, we explored two values for the semantic similarity-based filter: 1.0 and 0.25. Considering
the filter 0.25, we only included the top 25% entities according to their score, and for the filter
1.0, we included all the entities in the document. This way, we could determine the impact of
choosing the most relevant entities in the performance of the classifier algorithm.

2.3. Extreme Multi-Label Classification
We approached the sub-tracks as an Extreme Multi-Label Classification (XMC) problem. Our
starting point was a pipeline based on the X-Transformer algorithm [6] that was adapted to the
biomedical domain by our group in the context of past competitions, such as BioASQ [5] and
CANTEMIST [26]. The pipeline was further adapted to the present competition, and includes
the following modules: entity linking (subsection 2.2), preprocessing, semantic label indexer,
deep neural matcher, and ranker. The main modifications were made in the entity linking and

    1
        http://red.bvsalud.org/decs/en/
preprocessing modules. The complete description of the entity linking component is available
in the previous subsection 2.2.
   The preprocessing module imports the retrieved dataset JSON files (train, dev, and text subsets)
and the DeCS TSV file, the JSON files with the output from the entity linking (subsection 2.2),
and, for each dataset, it generates several files:

   1. vocabulary file ("label_vocab.txt"): it includes the internal numerical identifier for each
      DeCS term. For example, the term "calcimicina" has the internal numerical identifier "0".
   2. label correspondence file ("label_correspondence.txt"): it includes the correspondence
      between the internal numerical identifiers, and the respective DeCS labels and terms. For
      example, "0" corresponds to "D000001", which corresponds to "calcimicina".
   3. subset files ("subset.txt", "subset_raw_text.txt", "subset_raw_labels.txt"): for each subset
      (train, dev, and test) it is generated the three aforementioned files. The file "subset.txt"
      includes the DeCS labels that are associated with the respective documents, separated by
      commas, the stemmed texts of documents’ titles, and the DeCS terms that were extracted
      in the documents appended to the end of the stemmed titles. The file "subset_raw_text.txt"
      includes only the stemmed titles, and the file "subset_raw_labels.txt" only the DeCS terms
      relative to the labels associated with the documents.

   We only considered the titles of the documents based on the results described by Neves et al.
[5]: the performance of the models using titles is similar to that of models using abstracts, so it
is more efficient to use titles since they have less text. The limited time that we had to train
models also influenced our decision to only use the titles, since the required time is lower. The
titles were stemmed using the Snowball Stemmer implementation for Spanish text provided
by the NLTK package2 . As the documents belonging to the test sets were unlabeled, we added
the placeholder "0" to each document in the "subset.txt" files. The module was also modified in
order to integrate extracted entities independently of the tool employed.
   The X-Transformer algorithm includes three modules: semantic label indexer, deep neural
matcher, and ranker. The semantic label indexer first obtain meaningful representations for
labels that are based on embeddings of the text descriptions associated with the labels, and on
Positive Instance Feature Aggregation (PIFA), which is a type of label embeddings based on the
TF-IDF features that are relevant instances for the labels. Then, it applies k-means clustering in
order to generate label clusters according to the semantic representations described before. The
deep neural matcher performs fine-tuning of BERT to encode an instance embedding, which
is then used to find the most relevant clusters for the instance. At the end of this step, only a
small subset of clusters are considered for the next step, which is performed by the ranker. The
ranker determines the relevance of the labels in the chosen clusters to the instance, which is
substantially more efficient than performing the ranking of all the initial labels. For a more
complete description of the X-Transformer algorithm please refer to the original publication by
Chang et al. [6].
   The models developed for the different sub-tracks are shown in Table 2. We explored the fine-
tuning of different deep neural matchers. The BERT Base Multilingual Cased model was trained
on the Wikipedia dumps of the top 104 largest languages in Wikipedia and has the following
   2
       https://www.nltk.org/
Table 2
Models used for the three sub-tracks, with the respective target datasets, thresholds (top entities to
consider according to their relevance), and deep neural matcher.
      Model           Target dataset    Threshold         Deep neural matcher
 LASIGE_BioTM-1                              1.0
                             L                                 CANTEMIST
 LASIGE_BioTM-2                             0.25
 LASIGE_BioTM-3                              1.0
                             T                        BERT Multilingual Base Cased
 LASIGE_BioTM-4                             0.25
 LASIGE_BioTM-5              P               1.0      BERT Multilingual Base Cased


characteristics: 12-layer, 768-hidden, 12-heads, 110M parameters. The X-Transformer algorithm
uses the Pytorch implementation from HuggingFace Transformers [27]. The CANTEMIST
model corresponds to the Model 7 described by Ruas et al. [26]. It is also based on the the BERT
Base Multilingual Cased model and was first fine-tuned on 318,658 Spanish biomedical articles
from the IBECS, LILACS and PubMed databases, jointly with extracted entities in the context of
the participation in the first edition of MESINESP [5].

2.4. Training approach
We explored several training approaches according to the target corpus:

    • L corpus: Fine-tuning of the model CANTEMIST using the provided training dataset of
      249,474 documents and the provided test set with 10,179 documents.
    • T corpus: Training of the model BERT Multilingual Base Cased using the provided training
      dataset of 249,474 documents from the L corpus and a generated test set built from the
      3560 clinical trials of the training set, the 147 clinical trials of the development set, and
      the 8919 clinical trials of the test set (total of 12,627 documents).
    • P corpus: Training of the model BERT Multilingual Base Cased using the provided training
      dataset of 249,474 documents from the L corpus and a generated test set built from the
      115 patents of the development set and the 68,404 patents from the test set.

   The training of the deep neural matcher is the limiting step of the algorithm in terms of time.
Each model was trained during a single epoch then evaluated on the respective test set. The
training and evaluation time was approx. 2 days for each model using a single NVIDIA Tesla P4
GPU. The values for the hyper-parameters are the following: depth = 6, train_batch_size=4,
eval_batch_size=4, learning_rate=0.00005, warmup_rate=0.1.


3. Results and discussion
The results obtained for each sub-track are shown on Table 3. The official evaluation metric of
the competition was the micro F1-score (MiF). Our best models achieved a MiF of 0.2007, 0.0686,
and 0.0314 in the sub-tracks L, T, and P, respectively. These results are low when compared to
Table 3
Results on test sets for the three sub-tracks. Performance for the baseline models, the best models, and
our models are shown according to the metrics: MiF-micro F1-score, MiP-micro precision, MiR-micro
recall, MaF-macro F1-score, MaP-macro precision, MaR-macro recall.
    Sub-track            Model             MiF       MiP       MiR       MaF       MaP       MaR
                       Baseline           0.2876    0.2335    0.3746    0.3438    0.2335    0.3746
                  BERTDeCS version 4      0.4837    0.5077    0.4618    0.3926    0.5237    0.3990
         L
                   LASIGE_BioTM-1         0.2007    0.1584    0.2738    0.0941    0.1016    0.1232
                   LASIGE_BioTM-2         0.1886    0.1489    0.2573    0.0920    0.0950    0.1219
                       Baseline           0.1288    0.0781    0.3678    0.2403    0.0977    0.3619
                  BERTDeCS version 2      0.3640    0.3666    0.3614    0.3102    0.4177    0.3391
         T
                   LASIGE_BioTM-3         0.0679    0.0575    0.0828    0.0056    0.0050    0.0136
                   LASIGE_BioTM-4         0.0686    0.0581    0.0838    0.0061    0.0054    0.0133
                       Baseline           0.2992    0.4293    0.2296    0.2518    0.5290    0.2497
         P        BERTDeCS version 2      0.4514    0.4487    0.4541    0.4138    0.5041    0.4271
                   LASIGE_BioTM-5         0.0314    0.0239    0.0459    0.0071    0.0060    0.0135


the top results in each sub-track, more concretely, there is a difference of 0.2830, 0.2961, and
0.4200 in terms of MiF in the sub-tracks L, T, and P, respectively.
   With respect to the initial hypothesis, the obtained results were mixed. In the sub-track
L, the LASIGE_BioTM-1 model, which included all the entities recognised in the documents,
obtained slightly better results (0.2007 MiF) compared with LASIGE_BioTM-2 model (0.1886
MiF), which only included 25% of the top relevant entities. However, in the sub-track T, the
opposite happened, since LASIGE_BioTM-4 (top 25% entities) obtained marginally better results
(0.0686 MiF) than LASIGE_BioTM-3 (0.0679 MiF). Consequently, we cannot confirm the initial
hypothesis that feeding only the most relevant entities to the classifier algorithm improves its
performance.
   Assuming that there were no coding errors that may have undermined the results, there are
several possible reasons behind the relatively low performance that our models achieved in the
three sub-tracks.
   Arguably, the main one is related with the impossibility of carrying out an optimisation of
the hyper-parameters of the classifier algorithm, in particular the number of training epochs.
Each model was only trained or fine-tuned during one epoch in the respective training dataset,
which is not enough to accurately learn relevant features. The limited time we had available
made it impossible to extend the training process during more epochs. Additionally, we were
not able to train the models in a multi-gpu setting due to unresolved errors, so the duration
of each training epoch was approximately two days using a single gpu. Beyond the number
of training epochs, the optimization of other hyperparameters such as train_batch_size,
eval_batch_size, and learning_rate, would probably lead to a better performance.
   With respect to the sub-track 2 and sub-track 3, the developed models were trained on
documents belonging to the L corpus (sub-track 1), and not on documents of the respective sub-
tracks corpora. The text present in scientific literature has different characteristics compared
with the text associated with clinical trials and patents, so the models fine-tuned in a certain
type of text will necessarily have a worse performance when their evaluation occurs over a
different type of text. For sub-track 3, there was no training dataset available, but for sub-track
2 probably it would have been better if we had trained models 3 and 4 over the training dataset
of the task and not over the training dataset for sub-track 1.


4. Conclusion
Our approach including an entity linking model and the X-Transformer algorithm obtained a
micro F1-score of 0.2007, 0.0686, and 0.0314 in sub-tracks 1, 2, and 3, respectively, which is a
low performance compared with the top participants, and even with the baseline approaches.
In order to improve the performance, we need to perform a careful error-analysis to identify
any coding errors that may have undermined the results. Next, we need to spend more time in
the training process, more concretely, by training the models during more epochs, to perform
hyper-parameter optimisation, to solve the problems associated with multi-gpu training, to
explore the use of summarisation tools to feed only the relevant content to the classifier, and to
explore less resource-demanding pre-trained models, such as DistilBERT. Besides, we only used
the titles of the articles based on previous studies, but in the future we will explore the impact
of using more text in the performance of the classification algorithm.


Acknowledgments
This work was supported by FCT through funding of Deep Semantic Tagger (DeST) project
(ref. PTDC/CCI-BIO/28685/2017) and LASIGE Research Unit (ref. UIDB/00408/2020 and ref.
UIDP/00408/2020); and FCT through funding of PhD Scholarship, ref. 2020.05393.BD.


References
 [1] A. Nentidis, G. Katsimpras, E. Vandorou, A. Krithara, L. Gasco, M. Krallinger, G. . Paliouras,
     Overview of BioASQ 2021: The ninth BioASQ challenge on Large-Scale Biomedical Se-
     mantic Indexing and Question Answering. (2021).
 [2] L. Gasco, A. Nentidis, A. Krithara, D. Estrada-Zavala, , R.-T. Murasaki, E. Primo-Peña,
     C. Bojo-Canales, G. Paliouras, M. Krallinger, Overview of BioASQ 2021-MESINESP track.
     Evaluation of advance hierarchical classification techniques for scientific literature, patents
     and clinical trials. (2021).
 [3] Y. Gui, Z. Gao, R. Li, X. Yang, Hierarchical text classification for news articles based-on
     named entities, Lecture Notes in Computer Science (including subseries Lecture Notes
     in Artificial Intelligence and Lecture Notes in Bioinformatics) 7713 LNAI (2012) 318–329.
     doi:10.1007/978-3-642-35527-1\_27.
 [4] S. Andelic, M. Kondic, I. Peric, M. Jocic, A. Kovacevic, Text Classification Based on Named
     Entities, in: 7th International Conference on Information Society and Technology ICIST
     2017, 2017.
 [5] A. Neves, A. Lamurias, F. M. Couto, Extreme Multi-Label Classification applied to the
     Biomedical and Multilingual Panorama, in: CLEF 2020 Working Notes, 2020. URL: http:
     //ceur-ws.org/Vol-2696/paper_67.pdf.
 [6] W.-C. Chang, H.-F. Yu, K. Zhong, Y. Yang, I. Dhillon, Taming Pretrained Transformers
     for Extreme Multi-label Text Classification (2020). URL: https://doi.org/10.1145/3394486.
     3403368. doi:10.1145/3394486.3403368. arXiv:1905.02331v4.
 [7] A. Lamurias, F. Couto, Text Mining for Bioinformatics Using Biomedical Literature, 2019,
     p. 602–611. doi:10.1016/B978-0-12-809633-8.20409-3.
 [8] F. M. Couto, A. Lamurias, Mer: a shell script and annotation server for minimal named
     entity recognition and linking, Journal of Cheminformatics 10 (2018) 58.
 [9] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo-
     sukhin, Attention is all you need, in: Advances in neural information processing systems,
     2017, pp. 5998–6008.
[11] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, Biobert: a pre-trained biomedical
     language representation model for biomedical text mining, Bioinformatics 36 (2020)
     1234–1240.
[12] I. Beltagy, K. Lo, A. Cohan, SciBERT: A pretrained language model for scientific text, in:
     Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing
     and the 9th International Joint Conference on Natural Language Processing (EMNLP-
     IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 3615–
     3620. URL: https://www.aclweb.org/anthology/D19-1371. doi:10.18653/v1/D19-1371.
[13] Z. Ji, Q. Wei, H. Xu, Bert-based ranking for biomedical entity normalization, AMIA
     Summits on Translational Science Proceedings 2020 (2020) 269.
[14] M. Pershina, Y. He, R. Grishman, Personalized page rank for named entity disam-
     biguation, in: Proceedings of the 2015 Conference of the North American Chapter
     of the Association for Computational Linguistics: Human Language Technologies, As-
     sociation for Computational Linguistics, Denver, Colorado, 2015, pp. 238–243. URL:
     https://www.aclweb.org/anthology/N15-1026. doi:10.3115/v1/N15-1026.
[15] A. Lamurias, P. Ruas, F. M. Couto, PPR-SSM: Personalized PageRank and semantic sim-
     ilarity measures for entity linking, BMC Bioinformatics 20 (2019) 1–12. doi:10.1186/
     s12859-019-3157-y.
[16] P. Resnik, Using information content to evaluate semantic similarity in a taxonomy, arXiv
     preprint cmp-lg/9511007 (1995).
[17] F. Couto, A. Lamurias, Semantic similarity definition, Encyclopedia of bioinformatics and
     computational biology 1 (2019).
[18] W.-C. Chang, H.-F. Yu, K. Zhong, Y. Yang, I. Dhillon, X-BERT: eXtreme Multi-label Text
     Classification with using Bidirectional Encoder Representations from Transformers (2019)
     1–12. URL: http://arxiv.org/abs/1905.02331. arXiv:1905.02331.
[19] Y. Prabhu, A. Kag, S. Harsola, R. Agrawal, M. Varma, Parabel: Partitioned label trees for
     extreme classification with application to dynamic search advertising, in: Proceedings of
     the World Wide Web Conference, WWW 2018, ACM, New York, NY, USA, April 23-27,
     2018, Lyon, France, 2018, pp. 993–1002. doi:10.1145/3178876.3185998.
[20] F. M. Couto, A. Lamurias, MER: a shell script and annotation server for minimal
     named entity recognition and linking, Journal of Cheminformatics 10 (2018) 58. URL:
     https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0312-9. doi:10.1186/
     s13321-018-0312-9.
[21] R. You, Z. Zhang, Z. Wang, S. Dai, H. Mamitsuka, S. Zhu, AttentionXML: Label Tree-
     based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text
     Classification, in: 33rd Conference on Neural Information Processing Systems (NeurIPS
     2019), Vancouver, Canada, 2019, pp. 1–11. arXiv:1811.01727.
[22] P. Xie, H. Shi, M. Zhang, E. P. Xing, A Neural Architecture for Automated ICD Coding, in:
     Proceedings ofthe 56th Annual Meeting ofthe Association for Computational Linguistics
     (Long Papers), Association for Computational Linguistics, 2018, pp. 1066–1076.
[23] H. Shi, P. Xie, Z. Hu, M. Zhang, E. P. Xing, Towards Automated ICD Coding Using Deep
     Learning, Technical Report, 2017. arXiv:1711.04075v3.
[24] S. Silvestri, F. Gargiulo, M. Ciampi, G. De Pietro, Exploit Multilingual Language Model at
     Scale for ICD-10 Clinical Text Classification, Proceedings - IEEE Symposium on Computers
     and Communications 2020-July (2020). doi:10.1109/ISCC50000.2020.9219640.
[25] C. Sen, B. Ye, J. Aslam, A. Tahmasebi, From Extreme Multi-label to Multi-class: A Hierar-
     chical Approach for Automated ICD-10 Coding Using Phrase-level Attention (2021). URL:
     http://arxiv.org/abs/2102.09136. arXiv:2102.09136.
[26] P. Ruas, A. Neves, V. D. Andrade, F. M. Couto, Lasigebiotm at cantemist: Named entity
     recognition and normalization of tumour morphology entities and clinical coding of
     Spanish health-related documents, in: Proceedings of the Iberian Languages Evaluation
     Forum (IberLEF 2020), 2020, pp. 422–437. URL: http://ceur-ws.org/Vol-2664/cantemist_
     paper11.pdf.
[27] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
     M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao,
     S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural lan-
     guage processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural
     Language Processing: System Demonstrations, Association for Computational Linguistics,
     Online, 2020, pp. 38–45. URL: https://www.aclweb.org/anthology/2020.emnlp-demos.6.