Extreme Multi-Label Classification applied to
     the Biomedical and Multilingual Panorama

              Andre Neves, Andre Lamurias and Francisco M. Couto

          LASIGE, Faculdade de Ciências, Universidade de Lisboa, Portugal
                          aneves@lasige.di.fc.ul.pt
                        alamurias@lasige.di.fc.ul.pt
                              fcouto@di.fc.ul.pt


        Abstract. This paper presents our participation on the BioASQ tasks
        8a and MESINESP. Our approach was based on X-BERT, a state-of-the-
        art deep learning algorithm developed for Extreme Multi-Label Classifi-
        cation (XMLC). We adapted this algorithm and its most recent version to
        the biomedical and multilingual biomedical panorama which, to the best
        of our knowledge, is the first XMLC-solution to be applied to both tasks.
        In addition, we combined this algorithm with a named-entity recognition
        tool to find entities from DeCS and MeSH in the text in order to improve
        the assignment of labels by the XMLC algorithm. In both challenges, our
        results in the Micro F-Measure, which was the most relevant evaluation
        measure, were low. However, our submissions achieved top results in the
        precision measures, reaching the 1st place in the last two weeks of task
        8a in the Micro Precision measure and 2nd place in MESINESP in Micro
        and Macro Precision, as well as Example Based Precision.

        Keywords: Extreme Multi-Label Classification, Deep Learning, Named-
        Entity Recognition, Semantic Indexing, Multilingual Semantic Indexing


1     Introduction
BioASQ is a series of international challenges focused in promoting the devel-
opment of state-of-the-art solutions for NLP tasks, namely semantic indexing
and question answering, for the biomedical domain. This year, in addition to
the usual annual competitions, BioASQ presented a new task called MESINESP
(Medical Semantic Indexing in Spanish) focused on the multilingual panorama,
in this case, the Spanish language. In this paper, we describe the approach of
our submissions for both BioASQ task 8a and BioASQ MESINESP task.
    The BioASQ task 8a consisted in indexing English biomedical abstracts
with terms from MeSH (Medical Subject Headings) related with the content of
biomedical abstracts retrieved from PubMed. The BioASQ MESINESP task was
similar, but instead of using MeSH, it used DeCS (Health Sciences Descriptors),
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
    ber 2020, Thessaloniki, Greece.
a hierarchy of terms closely related to MeSH that can be used to label biomedical
articles in Spanish, Portuguese and English. The biomedical abstracts to index
were in Spanish and were retrieved from the IBECS and LILACS databases.
    In both competitions, each abstract could have more than one term indexed
to it, making this a multi-label classification challenge. However, since there are
more than 29,000 MeSH terms and more than 33,000 DeCS terms to choose from,
our approach consisted in using an Extreme Multi-Label Classification (XMLC)
algorithm to classify the abstracts complemented by a Named Entity Recognition
tool. XMLC algorithms can provide the best subset of labels to index data,
choosing the labels from an extremely large label set, which can reach hundreds
of thousands or even millions of labels [1]. Our approach was motivated by three
reasons. The first one was the fact that many XMLC algorithms achieved high
precision scores in the last years using large datasets with thousands of labels
[2–5]. The second reason was the fact that, to the best of our knowledge, state-of-
the-art XMLC solutions have not yet been applied to the biomedical domain nor
to the multilingual panorama. Finally, the third reason was to develop a XMLC
model for both English and Spanish and compare their results. Therefore, we
thought that both BioASQ task 8a and task MESINESP would be an appropriate
setting to test our approach.
    We have chosen the XMLC algorithm X-BERT [5], a state-of-the-art solu-
tion that achieved top results in benchmark datasets, surpassing other current
XMLC algorithms. Its architecture could also be applied to other languages too,
which is essential to the MESINESP task. Furthermore, during the time of the
competition, a new version of X-BERT was developed and made available by
the authors, which allowed us to adapt and apply it to this challenge. This way,
we could compare not only the performance between an English and Spanish
model, but also the performance between the older and the newer version of the
algorithm.
    In addition, the algorithm was also combined with MER (Minimal Named-
Entity Recognition) [6, 7], a named-entity recognition tool that given a lexicon
and an input text, is able to identify the entities of the lexicon in the text,
including their exact location. With this combination, we aimed at improving
the assignment of labels by the XMLC algorithm, due to identification of key
concepts in the text.


2     Related Work

2.1   Language Models

Recently, many deep learning solutions developed for NLP tasks use pre-trained
language models in their core, such as BERT (Bidirectional Encoder Represen-
tations for Transformers) [8], which has greatly changed the solutions applied to
most NLP tasks. With the success of BERT, many other models based on BERT
were developed for specific domains, such as SciBERT [9], a BERT model trained
with scientific papers from the corpus of semanticscholar.org. SciBERT was
developed with the goal of improving the performance of deep learning models
in scientific related NLP tasks and, thanks to its characteristics, SciBERT sur-
passed BERT and achieved state-of-the-art results in several NLP tasks from
different scientific domains, such as biomedical sciences.


2.2    Multilingual NLP

Pre-trained language models also constitute an important part of the resources
available to develop deep learning models for multilingual NLP task. BERT has
also been applied to the multilingual panorama through three different models:
BERT Base Multilingual Cased, BERT Base Multilingual Uncased and BERT
Base Chinese. The larger model contains 104 languages, and it is composed
by the text of the Wikipedia pages for each of those languages, excluding the
user and talk pages. These BERT models surpassed the highest accuracy scores
achieved in the XNLI dataset [10], a corpus designed for language transfer and
cross-lingual sentence classification translated in 15 languages by human experts.
     Another great source of multilingual pre-trained models is the Hugging Face
Transformers library1 [11]. Transformers is a Python library that provides several
pre-trained deep learning models, such as BERT, for several NLP and language
generation tasks. The library is enriched by a vast community of collaborators
that adapt the existing models or contribute with novel pre-trained models, many
of them in non-English languages. However, multilingual biomedical models are
still lacking from this library.
     Finally, another example of multilingual improvement, namely in the biomed-
ical domain, is a recent work that aimed at the development of word embeddings
for the Spanish biomedical domain [12]. The authors use a combination of data
in the Spanish language from the biomedical domain retrieved from the Sci-
ELO database, along with text data from specific topics of Wikipedia, such as
pharmacy, medicine or biology. When evaluated and compared with embeddings
from a much larger but general domain corpus, their embeddings model achieved
better performance values, thus showing the importance of domain-specific data.


2.3    Extreme Multi-Label Classification

As to XMLC, several machine learning solutions have been developed in the last
decade [1], however, only more recently there have been deep learning approaches
developed specifically for this task. One of the first attempts to apply deep
learning to the XMLC task was XML-CNN [2], a convolutional neural network
that was adapted from a state-of-the-art approach to a multi-class classification
task [13]. The architecture of the neural network was adapted with additional
layers, one to capture features more precisely from across different regions of
text and another to reduce the model size, increasing performance. The loss
function was also changed so it could better rank the labels. The XML-CNN
1
    https://github.com/huggingface/transformers
architecture was successful in applying deep learning to XMLC, surpassing most
state-of-the-art algorithms in several datasets.
    Another successful approach was AttentionXML [3], which used a BiLSTM
(bidirectional long short-term memory) recurrent neural network with a multi-
label attention layer to capture the most relevant parts of the text. However,
AttentionXML could not scale well with the largest datasets, so HAXMLNet
was created, adding to the same architecture a hierarchical clustering algorithm
to divide the labels into smaller clusters, thus being able to work on larger
datasets effectively where AttentionXML failed [4].
    Lastly, one of the most recent approaches is X-BERT [5], the first deep learn-
ing approach to scale pre-trained language models, such as BERT [8], to XMLC.
X-BERT uses a three-stage framework that firstly, semantically indexes all the
possible labels in clusters using ELMo [14]. Then, using a deep learning Trans-
former model, it indexes each text instance to the most relevant cluster. Finally,
a linear ranker is trained to rank the labels retrieved from the previous cluster
indices, modeling the relevance between each text instance and the retrieved
labels, and then calculating the corresponding label scores. X-BERT surpassed
other state-of-the-art methods in XMLC using benchmark datasets, such as the
Eurlex-4K, AmazonCat-13K or the Wikipedia-500K, all of them available in the
Extreme Classification Repository [15].
    More recently, a newer version of X-BERT has been released, renamed X-
Transformer2 [16]. X-Transformer includes more Transformer models, such as
RoBERTa [17] and XLNet [18] and scales them to XMLC. The ranking phase
was also improved by combining additional sampling strategies to reduce com-
putational complexity and improve the algorithm performance. It also surpasses
the previous results achieved by X-BERT and other XMLC algorithms in the
same benchmark datasets. However, at the time of the BioASQ competitions,
X-Transformer was still under development, thus we did not have access to of-
ficial results nor comparisons between this version and other state-of-the-art
algorithms.


3     Methods

3.1    X-BERT and X-Transformer modifications

For both tasks, modifications in the algorithm code were required. The first one
was made in the vectorization of the labels of the training, test and validation
sets. We have chosen to use all possible labels, including the labels that were not
present in the train, test or validation sets. This change was needed since the
algorithm would fail to work correctly if the number of labels between sets did
not match. Another modification was the inclusion of SciBERT in the choices of
models to train X-BERT, so that we could use this model in task 8a. However,
this inclusion was not made in X-Transformer, since we only used X-Transformer
for the MESINESP task.
2
    https://github.com/OctoberChang/X-Transformer
   X-Transformer required additional changes since, at the time of the com-
petition, the algorithm was still being developed by the X-BERT team and it
could only process the four datasets in which X-BERT was tested [5]. To sur-
pass this limitation, we modified some preprocessing scripts from the original
X-BERT code and implemented them in X-Transformer, so it could process cus-
tom datasets, namely the MESINESP datasets.
   Finally, in addition to these modifications, both X-BERT and X-Transformer
were also changed so that the algorithms could process input data containing
diacritical marks, such as accents, that are common in the Spanish language.

3.2    Pipeline
A common pipeline was developed for both tasks, with some changes according
to the algorithm used or to the task, as it can be seen in Figures 1 and 2.
The first step, was to retrieve the data given by the competition organizers.
A total of 318,658 articles abstracts and titles were used for both task 8a and
MESINESP for comparison purposes. The articles were then split into training,
test and validation sets using a 60%-20%-20% proportion to be used by X-
BERT. The validation set would be used for hyperparameter tuning such as
the number of training epochs, since X-BERT can perform an evaluation during
training and then save the hyperparameters in a checkpoint if the evaluation
results surpassed the results of the previous checkpoint. The values differ when
using X-Transformer, because it only requires training and test sets. The test
set will also be used by X-Transformer to save the model checkpoints during
training evaluation. Therefore, we decided to use a proportion of 70%-30% for
X-Transformer. A summary of this distribution can be seen in Table 1.


Table 1. Total number of unique DeCS and MeSH term codes and the total number
of articles used in the models along with the corresponding proportion used for the
train, test and validation sets. In X-BERT, it was 60%-20%-20%. In X-Transformer it
was 70%-30%, since there is no validation set.

                Total                                     #MeSH #DeCS
      System            Train Set Test Set Validation Set
               Articles                                    Terms Terms
   X-BERT                191,195   63,732      63,731
               318,658                                     29,640 33,702
 X-Transformer           223,060   95,598         -


    The next step was the creation of a vocabulary file, which is required by both
X-BERT and X-Transformer and that contains the labels used to classify the
data and a corresponding numerical identifier. For that, each line of the vocabu-
lary file has a DeCS or MeSH code, according to the task, and its corresponding
internal identifier, which corresponds to a number from 0 to N, where N is the
total number of DeCS or MeSH Terms minus 1. This internal numerical identifier
is the characteristic that allows X-BERT and X-Transformer to work with any
type of labels and with any kind of language, since the algorithms will use these
numeric identifiers to classify the text. In addition, a label mapping file was cre-
ated for each task. For MESINESP, it contains the correspondence between the
DeCS term, its code and its numeric identifier in the vocabulary file, while for
task 8a, the file has the same structure but using the MeSH terms instead. For
example, the term ‘Temefós’, which has the corresponding DeCS code ‘2’, is the
first element in the vocabulary file, thus its numeric identifier will be ‘0’. This
label mapping file will later be used to map the predictions from their numeric
identifiers to the corresponding DeCS or MeSH codes required for the task.


Fig. 1. Developed pipeline to process the articles before running them through X-
BERT and finally converting the resulting predictions in the required format for the
competition. This pipeline was used for both BioASQ task 8a and task MESINESP.


    After converting the codes to numeric identifiers, we used MER [6, 7], which
given a lexicon and an input text, is able to identify the entities of the lexicon
in the text, including their exact location. For task 8a, we created a lexicon
composed by the MeSH terms and their synonyms present in the MeSH 2020
XML file. For task MESINESP, we created a lexicon composed by the DeCS
terms and their synonyms, which were given by the competition, but we decided
to also include their corresponding direct ancestors as an improvement over the
choice made for task 8a. MER was used to recognize the lexicon terms in the
article’s abstracts in both tasks. The identified terms were then added to the end
of the titles or the abstracts, depending on the model that was being trained.
Repeated terms were removed, and the resulting string was then stemmed using
a snowball stemmer3 .


Fig. 2. Pipeline used in the MESINESP task to process the articles before running
them through X-Transformer and finally converting the resulting predictions in the
required format for the competition.


    We chose to train some models using the abstracts and others using the titles
in order to compare the performance of the algorithm according to the amount
of text given in each article. In cases where the title for a specific article was
not present, we used its abstract instead, and in cases where the abstract of a
specific article was not present, we used its title instead. In the end, the files
to be given as input to X-BERT or X-Transformer, contained, for each article,
the list of corresponding DeCS or MeSH terms indexed to that article converted
to their corresponding internal identifier, the article’s title or abstract, and the
terms identified by MER.
    The results of X-BERT and X-Transformer are given in the form of sparse
matrices, with a number of rows equal to the number of articles that compose the
test set, and the columns corresponding to the possible labels. The prediction
for each article is retrieved, comprising a top K of most relevant labels and their
corresponding confidence values. We used K=20 labels per article for both X-
BERT and X-Transformer. For a prediction to be chosen as correct, we discarded
every label with a confidence value under a threshold, which will be more detailed
in Section 4. Each label is then converted to the corresponding DeCS or MeSH
term by using the previously created label mapping file, so that in the end,
3
    https://www.nltk.org/_modules/nltk/stem/snowball.html
the JSON file containing the predictions has the MeSH or DeCS codes for each
article id.
    In the test sets given by competition organizers, there were no MeSH or
DeCS terms indexing the articles, so we had to put a placeholder label on each
article, because the code was not prepared to run on unlabeled data. We also
had to artificially adapt the size of the given test sets by adding extra articles
to them, so that they could have the same size as the ones used on the trained
X-BERT or X-Transformer models, otherwise the algorithm would fail to work.
In this case, the set needed to have a total of 63,732 articles for X-BERT models
and 95,598 articles for X-Transformer models, which corresponds to the number
of articles used in the test set of those models. Since in both competitions the
number of articles given to classify was under that value, we padded the test set
with additional articles equal to the difference. For task 8a, we included indexed
PubMed articles from the 14 million articles given by the competition as training
data. For MESINESP, we included translated and indexed PubMed articles that
came from a larger dataset given as additional material by the task organizers to
the participants. The additional articles in both tasks were used as an additional
validation set to define the confidence threshold values of our submissions and
were not used in neither set to train nor evaluate the models, so that the results
would not come biased.


4   Evaluation Script

To evaluate the results achieved by our models, we used a script developed by
us to measure the Macro and Micro Precision, Recall and F1 scores, which were
some of the measures used in both BioASQ competitions. For each model, two
evaluations were made: one using the test set, and another using the additional
validation set composed by the articles that were padded to the test sets given
by the competition organizers which acted like a preliminary evaluation for the
competition.
     The evaluation script was focused on choosing the confidence score threshold
that achieved the highest Micro F1 (MiF) score and the highest Micro Precision
(MiP). The focus on MiF was chosen, since MiF was defined by the organiza-
tion as the most important measure to both challenges. The focus on MiP was
motivated by the improved precision of X-BERT over other XMLC algorithms
[5], thus we wanted to see how high it could reach with this type of data.
     The confidence score interval given by X-BERT to its predicted labels ranged
from 0 to 1. However, in X-Transformer, we noticed that this interval was
changed from 0 to 1 to -0.99 to 1, thus encompassing negative confidence values.
To facilitate the comparisons between X-Transformer and X-BERT, we have
normalized this scale to fit the interval from 0 to 1. Then, to determine the best
threshold for the confidence score that achieved the highest MiF or MiP values,
we tested the values between 0.01 and 1, incrementing by 0.01.
5     Developed Models

5.1   Titles vs Abstracts

Given the token limit of 512 tokens of the used BERT models, we considered
using only the articles or the titles to train the models since that, if we combined
both articles and titles along with the entities recognized by MER, in most cases
the limit of 512 tokens would be surpassed, thus the articles would be truncated
and consequently relevant information and the list of entities found by MER
would be lost. Therefore we decided to check which combination could achieve
better results: if using the titles and the MER terms or if using the abstracts
and the MER terms.
    The main hypothesis was that, due to the increased objectiveness of the titles
in the core topics of the article in a less amount of words than the abstract, along
with a list of key terms and synonyms given by MER, the model would be able
to achieve better results since it a had a continuous string of words that were
more related to the topics of the article than a larger amount of words with
the key topics more diluted. To test the hypothesis, we used X-BERT using the
BERT Base Multilingual Cased model and the MESINESP dataset, which was
made available first by the organizers. The results are available in Table 2.
    Analyzing the results of this experiment, we can see that the difference be-
tween using the titles and the abstracts is minimal. If evaluation was focused
on achieving the highest MiF value, the model that used the abstracts achieved
slightly higher MiP and MiF values. However, if the evaluation was focused on
attaining the highest MiP value, the model using the titles achieved better scores
in all measures than its abstract counterpart, which confirms our hypothesis.


Table 2. Comparison between the results of two X-BERT models, one trained using
the abstracts and the other the titles. Both models were evaluated with a focus on MiF
and MiP. The threshold corresponds to the minimum confidence score that obtained
the highest MiF or MiP score.
Bold values correspond to the measures where the titles model surpassed the abstracts.

                                           Measures
       Model       Focus                                             Threshold
                            MiP MiR MiF MaP MaR MaF
                    MiF    0.4780 0.4123 0.4427 0.4766 0.4094 0.4028   0.10
 Abstracts + MER
                    MiP    0.7117 0.1574 0.2578 0.6812 0.1642 0.2480   0.93
                    MiF    0.4685 0.4142 0.4397 0.4704 0.4229 0.4072   0.11
    Titles + MER
                    MiP    0.7154 0.1636 0.2663 0.6901 0.1769 0.2617   0.92


5.2   BioASQ task 8a Models

Only two models were trained for the BioASQ task 8a with the following char-
acteristics:
 – Model 1: MER in conjunction with X-BERT, using BERT Base Uncased
   and finetuned using article’s titles.
 – Model 2: MER in conjunction with X-BERT, using SciBERT and finetuned
   using the article’s titles.
Both models used the titles due to results shown in the previous section and due
to time restrictions, since the competition had already begun when we started
training the models, so we could only train 2 models. Therefore, Model 1 was
trained for Batch 2 of the competition, while Model 2 was trained for Batch 3.
    For both models, the parameters given as default to train and evaluate were
kept, except for the eval and train batch sizes which were changed from their
corresponding default values of 64 and 32 to 3 due to hardware limitations. The
number of train epochs was also changed to 7, since the best checkpoint would
usually be achieved before this epoch. The models were trained on a single
NVIDIA Tesla P4 GPU, taking 1 week for each model to be fully trained.

5.3   BioASQ task MESINESP Models
In total, four models were trained for the BioASQ MESINESP task with the
following characteristics:
 – Model 1: MER in conjunction with X-BERT, using BERT Multilingual Base
   Cased and finetuned using the article’s abstracts.
 – Model 2: MER in conjunction with X-BERT, using BERT Multilingual Base
   Cased and finetuned using the article’s titles.
 – Model 3: X-Transformer using BERT Multilingual Base Cased and finetuned
   using the article’s abstracts.
 – Model 4: X-Transformer using BERT Multilingual Base Cased and finetuned
   using the article’s titles.
    From these four models, we decided to train two using X-BERT and the
other two using X-Transformer since it was made available during the time of
this competition. We also decided to train two models using the abstracts and the
other using the titles, so that we could not only compare the differences between
the two versions of the algorithm, but also to check if our hypothesis from Section
5.1 would also be confirmed in the results achieved in the competition.
    For all models, we have kept the default parameters to train and evaluate the
models, except for the eval and train batch sizes, due to hardware limitations.
In X-BERT they were both changed from their corresponding default values of
64 and 32 to 3. In X-Transformer, both were changed to 4 and we also set the
number of gradient accumulation steps to 2 to compensate for the small batch
size. The X-BERT models were trained on 7 epochs, since the best checkpoint
would usually be achieved before this epoch, like in the BioASQ task 8a models.
The X-Transformer models were trained during 3 epochs, the default value. All
models were trained on a set of 4 GPUs composed by 1 NVIDIA Tesla P4 GPU
and 3 NVIDIA Tesla M10 GPUs. Each model took approximately 4 to 5 days
to be fully trained.
6     BioASQ task 8a Results

6.1    Preliminary Evaluation

The results of the preliminary evaluation using the models test sets can be seen
in Table 3. When focused on MiF, both Model 1 and Model 2 achieved similar
results for their best scoring threshold values, which was the same. However
Model 2, the one trained using SciBERT, achieved slightly higher values, which
was expected since SciBERT was especially created to be used with scientific
text. MiP and Macro Precision (MaP) have the highest values, and both Recall
and F1 scores have a relatively similar score and are both higher than 0.50.
   When focused on MiP, the threshold values differ, but the results of both
models were again similar, with Model 2 achieving slightly higher values in all
measures, except for the Macro Recall, which was 0.01 lower compared to Model
1. Both MaP and MiP increased significantly to values over 0.83 and reaching
a maximum of 0.89. However, this gain in precision comes with the cost of the
recall and F1 values dropping between 0.17 and 0.27.


Table 3. Results achieved by the models in the different measures using their corre-
sponding test sets. The values of each measure correspond to the values achieved by
the model using the value present in the Threshold column.
Bold values correspond to the highest achieved scores in each measure.

                                    Measures
      Model Focus                                             Threshold
                     MiP MiR MiF MaP MaR MaF
              MiF   0.6179 0.4911 0.5472 0.6168 0.4972 0.5293   0.18
      Model 1
              MiP   0.8722 0.2467 0.3846 0.8650 0.2526 0.3695   0.86
              MiF   0.6270 0.5095 0.5622 0.6265 0.5175 0.5460   0.18
      Model 2
              MiP   0.8894 0.2408 0.3790 0.8832 0.2480 0.3664   0.89


6.2    Competition Results

We competed in both batch 2 and 3 of BioASQ Task 8a. The results achieved
in each week of the batch can be found on Tables 4 and 5, where our scores are
compared with the solution that achieved the highest MiF score on the same
week.
    In batch 2, we submitted results for weeks 3, 4 and 5. The submitted solutions
were given by Model 1, the model trained using BERT Base uncased, with the
confidence score threshold that achieved the highest MiF score. The results in
those weeks were suboptimal, with our model staying in the last position every
week and in all evaluation measures.
    For batch 3, we submitted results from weeks 2 to 5 using Model 2. In week
2, we used the same strategy of the previous batch, focusing of achieving the
highest MiF score. However, the results were very similar, with only a small
Table 4. Results achieved by our submitted models compared to the best scoring
model in the Micro F-measure (MiF) in the same week. Only the flat measures are
presented.
The bold values correspond to values where the model surpassed the best scoring
models and reached the 1st place in the corresponding measure.
Measures: MiF – Micro F-measure. MiP – Micro Precision. MiR – Micro Recall. MaF
– Macro F-measure. MaP – Macro Precision. MaR – Macro Recall. Acc. – Accuracy.
EBP – Example Based Precision. EBR – Example Based Recall. EBF – Example Based
F-measure.
                                                         Flat Measures
 Batch Week          System
                                   MiF MiP MiR MaF MaP MaR Acc. EBP EBR EBF
                deepmesh dmiip fdu 0.708 0.758 0.664 0.592 0.721 0.590 0.558 0.755 0.684 0.701
            3    X-BERT BioASQ
                                   0.316 0.465 0.239 0.066 0.350 0.054 0.184 0.462 0.238 0.294
                   (BERT - MiF)
                deepmesh dmiip fdu 0.707 0.768 0.655 0.588 0.732 0.584 0.558 0.768 0.673 0.703
            4    X-BERT BioASQ
                                   0.328 0.496 0.245 0.067 0.352 0.054 0.195 0.498 0.247 0.310
      2            (BERT - MiF)
                deepmesh dmiip fdu 0.709 0.760 0.663 0.591 0.725 0.591 0.557 0.757 0.678 0.701
            5    X-BERT BioASQ
                                   0.320 0.477 0.240 0.066 0.343 0.054 0.187 0.480 0.239 0.399
                   (BERT - MiF)
                      dmiip fdu    0.769 0.814 0.730 0.678 0.789 0.674 0.647 0.809 0.744 0.762
            2    X-BERT BioASQ
                                   0.330 0.494 0.248 0.067 0.345 0.055 0.195 0.495 0.247 0.309
                  (SciBERT - MiF)
                      dmiip fdu    0.779 0.830 0.735 0.677 0.810 0.669 0.664 0.825 0.751 0.774
            3    X-BERT BioASQ
                                   0.265 0.803 0.159 0.022 0.640 0.015 0.156 0.796 0.162 0.255
                 (SciBERT - MiP)
                deepmesh dmiip fdu 0.710 0.771 0.659 0.594 0.729 0.595 0.563 0.769 0.680 0.706
      3     4    X-BERT BioASQ
                                   0.245 0.781 0.145 0.024 0.639 0.017 0.140 0.750 0.145 0.232
                 (SciBERT - MiP)
                deepmesh dmiip fdu 0.706 0.758 0.661 0.589 0.714 0.591 0.556 0.756 0.677 0.700
            5    X-BERT BioASQ
                                   0.260 0.792 0.155 0.023 0.618 0.016 0.151 0.772 0.157 0.248
                 (SciBERT - MiP)


increase in some precision measures. From week 3 an onward, we decided to
started submitting the predictions focusing on the highest MiP value, since it was
our best scoring measure in the previous submissions as well as our best scoring
measure in the preliminary evaluation. As a result, we achieved lower accuracy,
recall and MiF scores, although the loss in MiF was less than 0.1. However, all
precision measures increased significantly making the model reach the 3rd, 1st
and again 1st place in the MiP measure on weeks 3, 4 and 5, respectively, and
a 1st place in the Example Based Precision (EBP) in the final week. The only
precision measures where the model did not have a significant increase were the
MaP and the Lowest Common Ancestor Precision (LCA-P) measures.


7         BioASQ task MESINESP Results

7.1       Preliminary Evaluation

Contrarily to the task 8a preliminary evaluation, for task MESINESP we added
a new evaluation parameter, which was the confidence score threshold of 0.50.
Table 5. Results achieved by our submitted models compared to the best scoring
model in the Micro F-measure (MiF) in the same week. Only the hierarchical measures
are presented.
Measures: LCA-F – Lowest Common Ancestor F-measure. LCA-P – Lowest Common
Ancestor Precision. LCA-R – Lowest Common Ancestor Recall. HiF – Hierarchical
F-measure. HiP – Hierarchical Precision. HiR – Hierarchical Recall.
                                                Hierarchical Measures
         Batch Week        System
                                         LCA-F LCA-P LCA-R HiF HiP HiR
                      deepmesh dmiip fdu 0.565   0.608  0.550 0.782 0.839 0.762
                 3     X-BERT BioASQ
                                          0.271  0.422  0.218 0.389 0.619 0.316
                         (BERT - MiF)
                      deepmesh dmiip fdu 0.570   0.624  0.546 0.786 0.855 0.752
                 4     X-BERT BioASQ
                                          0.281  0.448  0.221 0.400 0.652 0.319
           2             (BERT - MiF)
                      deepmesh dmiip fdu 0.567   0.615  0.547 0.785 0.844 0.759
                 5     X-BERT BioASQ
                                          0.273  0.434  0.217 0.394 0.641 0.316
                         (BERT - MiF)
                            dmiip fdu     0.657  0.696  0.641 0.829 0.877 0.809
                 2     X-BERT BioASQ
                                          0.280  0.445  0.221 0.400 0.651 0.321
                        (SciBERT - MiF)
                            dmiip fdu     0.676  0.715  0.658 0.839 0.889 0.815
                 3     X-BERT BioASQ
                                          0.194  0.526  0.125 0.275 0.867 0.175
                       (SciBERT - MiP)
                      deepmesh dmiip fdu 0.576   0.627  0.555 0.789 0.852 0.760
           3     4     X-BERT BioASQ
                                          0.182  0.509  0.115 0.251 0.826 0.157
                       (SciBERT - MiP)
                      deepmesh dmiip fdu 0.568   0.613  0.550 0.784 0.845 0.758
                 5     X-BERT BioASQ
                                          0.189  0.510  0.121 0.262 0.840 0.165
                       (SciBERT - MiP)


We decided to add this value since it was the middle of the confidence interval
scale and it should provide a baseline performance.
     The results achieved in the first evaluation using the test set can be seen in
Table 6. For both Model 1 and Model 2, the scores achieved by each model were
very similar, with slightly higher scores achieved by Model 2. As for Models
3 and 4, we noticed that they achieved higher values than the ones achieved
by the older version of the algorithm. Also, and contrarily to Models 1 and 2,
Model 3, which used the abstracts, achieved higher values in all measures when
compared with Model 4, except when focused on MiP, where Model 4 achieved
higher Precision values.
     However, in the second evaluation which used the additional validation set,
the results achieved were different as can be seen in Table 7. First, there was a
drop in the scores, especially in the recall measures thus dropping the F-score.
When focusing on MiP, the best performing model was Model 2 and not Model
4 as it was in the first evaluation. This small difference might be due to the
usage of MER in Model 2, while Model 4 did not. As to the focus on MiF, the
best scoring model was Model 4 with the highest Macro and Micro recall and
F-scores.
     After this second evaluation, we have decided to submit a total of 4 prediction
files, two from a X-BERT model and the other two from a X-Transformer model
Table 6. Results achieved by the models in the different measures using their corre-
sponding test sets. The values of each measure correspond to the values achieved by
the model using the threshold value present.
Bold values correspond to the highest achieved scores in each measure.

                                    Measures
     Model Focus                                              Threshold
                     MiP MiR MiF MaP MaR MaF
            Base    0.6570 0.2622 0.3748 0.6295 0.2620 0.3386   0.50
    Model 1 MiF     0.4780 0.4123 0.4427 0.4766 0.4094 0.4028   0.10
            MiP     0.7117 0.1574 0.2578 0.6812 0.1642 0.2480   0.93
            Base    0.6492 0.2663 0.3777 0.6307 0.2771 0.3512   0.50
    Model 2 MiF     0.4685 0.4142 0.4397 0.4704 0.4229 0.4072   0.11
            MiP     0.7154 0.1636 0.2663 0.6901 0.1769 0.2617   0.92
            Base    0.7026 0.2819 0.4024 0.6890 0.2922 0.3773   0.50
    Model 3 MiF     0.4826 0.4971 0.4898 0.4830 0.5046 0.4587   0.24
            MiP     0.7249 0.2539 0.3761 0.7081 0.2652 0.3555   0.55
            Base    0.6928 0.2785 0.3974 0.6836 0.2984 0.3798   0.50
    Model 4 MiF     0.5137 0.4318 0.4693 0.5140 0.4513 0.4437   0.28
            MiP     0.7765 0.1465 0.2465 0.7595 0.1674 0.2567   0.79

Table 7. Results achieved by our models when using the validation set composed by
additional indexed articles from PubMed translated to Spanish.
Bold values correspond to the highest achieved values in the corresponding measures.
Gray lines correspond to the models which predictions were submitted to the
MESINESP challenge.

                                    Measures
     Model Focus                                              Threshold
                     MiP MiR MiF MaP MaR MaF
            Base    0.6132 0.0787 0.1395 0.6117 0.0838 0.1400   0.50
    Model 1 MiF     0.2937 0.1506 0.1991 0.3163 0.1545 0.1898   0.05
            MiP     0.6737 0.0550 0.1018 0.6627 0.0603 0.1082   0.97
            Base    0.6290 0.0759 0.1355 0.6323 0.0814 0.1377   0.50
    Model 2 MiF     0.3029 0.1572 0.2070 0.3204 0.1608 0.1975   0.05
            MiP     0.6903 0.0556 0.1028 0.6823 0.0613 0.1103   0.95
            Base    0.5610 0.0824 0.1437 0.5848 0.0886 0.1437   0.50
    Model 3 MiF     0.3198 0.1730 0.2245 0.3314 0.1797 0.2143   0.15
            MiP     0.6468 0.0489 0.0910 0.6471 0.0550 0.0998   0.99
            Base    0.6448 0.0832 0.1473 0.6374 0.0896 0.1499   0.50
    Model 4 MiF     0.3786 0.2041 0.2652 0.3761 0.2107 0.2506   0.14
            MiP     0.6823 0.0549 0.1017 0.6752 0.0613 0.1103   0.77


so that the performance of the models could be compared in the competition. The
results of Model 2 focusing on MiF and MiP were chosen, since the focus on MiP
presented the highest precision scores, and the focus on MiF had better recall
and F-scores than Model 1. From Model 4, the results focusing on MiF and the
baseline were chosen, since the focus on MiF had the achieved the highest recall
and F1 score values, and the baseline since the precision gains from the MiP
focus were not so significant and had heavier losses on the remaining measures.
7.2   Competition Results
In Table 8, we present the results achieved by our four submissions, the BioASQ
baseline and the winning system. Like in BioASQ task 8a, our results in the MiF
measure were not optimal, with the four submissions staying in bottom positions
in the classification. However, in the precision measures, the scores achieved by
two of our submissions were higher than most systems, even surpassing the
winning system.
    The predictions of Model 4 focused on the baseline achieved the 2nd place
in the MiP, MaP and EBP measures, while the predictions of Model 2 focused
on MiP achieved the 3rd place in MiP as well as a 5th place in EBP. As to
the models focused of MiF, they achieved balanced scores between all measures,
but they were not enough to surpass the BioASQ baseline. We can also notice
that the submissions focused on MiF achieved higher accuracy scores, which was
expected.


Table 8. Results achieved by the models submitted for the MESINESP task (gray
lines) compared to the best scoring system in the Micro F-measure (MiF) and the
BioASQ baseline.
Bold values correspond to the measures where our submissions surpassed both the
BioASQ baseline and the winning system.
Measures: MiF – Micro F-measure. MiP – Micro Precision. MiR – Micro Recall. MaF
– Macro F-measure. MaP – Macro Precision. MaR – Macro Recall. Acc. – Accuracy.
EBP – Example Based Precision. EBR – Example Based Recall. EBF – Example Based
F-measure.
                                                   Flat Measures
       System
                         MiF MiP MiR MaF MaP MaR Acc. EBP EBR EBF
    Winning System      0.4254 0.4374 0.4140 0.3194 0.3989 0.3380 0.2786 0.4382 0.4343 0.4240
   BioASQ Baseline      0.2695 0.2337 0.3182 0.2816 0.3733 0.3220 0.1659 0.2681 0.3239 0.2754
 X-BERT BioASQ F1
                         0.1430 0.4577 0.0847 0.0220 0.3095 0.0186 0.0787 0.5057 0.0867 0.1397
 (Model 2 focus MiF)
   X-BERT BioASQ
                         0.0909 0.5449 0.0496 0.0045 0.3422 0.0036 0.0503 0.5415 0.0508 0.0916
 (Model 2 focus MiP)
LasigeBioTM TXMC F1
                         0.2507 0.3559 0.1936 0.0858 0.3646 0.0799 0.1440 0.3641 0.1986 0.2380
 (Model 4 focus MiF)
LasigeBioTM TXMC P
                         0.1271 0.6864 0.0701 0.0104 0.6989 0.0081 0.0708 0.6609 0.0716 0.1261
(Model 4 focus baseline)


8     Discussion
8.1   Achieved Results
Overall, the results achieved in both competitions showed that the usage of
deep learning XMLC solutions in the biomedical and multilingual panorama
can achieve promising results, especially with the high MiP values achieved in
MESINESP and in the last three weeks of task 8a. However, there is also a
great room for improvement. For example, in both task 8a and in MESINESP,
the models struggled in all recall measures, with scores lower than most of the
competing systems, although achieving higher scores in the precision measures.
We think that these low recall values can possibly be explained by several factors.
    The amount of data used to train the models could have impacted our results.
We have chosen to use the same number of articles in both competitions, which
was the number of available articles in the pre-processed train set from task
MESINESP, so that we could compare the performance of a biomedical XLMC
based model across multiple languages. There was much more data available for
both task 8a and task MESINESP that we could have used to train larger models.
However, this was not possible to do in time due to the hardware limitations,
which lead to each model requiring approximately 1 week to train.
    Another limitation was the fact that we did not use more complex lexicons
in MER to identify entities in the text. The lexicon we used in task 8a was
rather simple, only comprising the MeSH terms and their synonyms. We could
have used more complex lexicons such as one with the term’s ancestors or with
ontology information, which could possibly increase the final scores. In task
MESINESP, the fact that we did not use MER in the X-Transformer models
could have negatively influenced our results.
    Finally, another reason for our low recall values could be caused by the fact
that XMLC algorithms tend to achieve higher precision values due to an in-
creased focus on precision measures, namely in Precision at top K (P@K). P@K
is widely used in the evaluation and comparison between XMLC algorithms [1–5,
15, 16], as it considers the number of most relevant labels retrieved by the algo-
rithm among the top K documents. The choice of using this measure to evaluate
the performance of these algorithms makes sense when dealing with datasets
with thousands of documents, from which the objective is to retrieve the most
relevant labels from a set of thousands or millions of possible labels. However,
this increased focus in precision can affect negatively the evaluation of the sys-
tems in other measures, such as the F1 score and recall, which were evaluated
in these two tasks.


8.2   English vs Multilingual

One of our objectives with our participation in these two tasks was to com-
pare the performance of an English model versus its multilingual counterpart
in a biomedical domain. Comparing the results achieved by our submissions in
BioASQ task 8a and in task MESINESP, we notice that there is a great differ-
ence between the performances of an English model and a multilingual model.
In task 8a, Model 1 and 2 shared the same characteristics as task MESINESP
Model 2. The three of them were X-BERT models trained with the same number
of articles, all used the titles and were combined with MER to find key terms in
the abstracts. The major difference between them was the pre-trained language
models used. In task 8a, the models used BERT Base Uncased and SciBERT,
while in task MESINESP the model used BERT Base Multilingual Cased.
Table 9. Results of the different versions of the common Model (Article’s titles +
terms identified by MER) used in the BioASQ competitions, when evaluated using the
corresponding validation set.
Bold values correspond to the highest values achieved in the corresponding measures.

                                        Measures
      Model     Focus                                             Threshold
                         MiP MiR MiF MaP MaR MaF
    BERT      MiF       0.6179 0.4911 0.5472 0.6168 0.4973 0.5293   0.18
 Base Uncased MiP       0.8722 0.2467 0.3846 0.8650 0.2526 0.3695   0.86
              MiF       0.6270 0.5095 0.5622 0.6265 0.5175 0.5460   0.18
   SciBERT
              MiP       0.8894 0.2408 0.3790 0.8832 0.2480 0.3665   0.89
    BERT      MiF       0.4685 0.4142 0.4397 0.4704 0.4229 0.4072   0.11
 Multilingual MiP       0.7154 0.1636 0.2663 0.6901 0.1769 0.2617   0.92


    Analyzing the scores achieved by these three models, we can notice that there
is a significant difference in the scores achieved in both our evaluations and in
the competition evaluations. The clearest is that in task 8a evaluation, our worst
scores in the competition surpass our best submission in task MESINESP in the
non-precision measures, as well as the enormous difference in the scores between
the first classified systems in task 8a and the MESINESP winner. The reason
for this difference can be caused by the number of documents to classify in each
task, which in MESINESP was more than 24,000 articles in total, whereas in
task 8a it was about 5,000 to 7,000 articles in each week.
    However, in our evaluations using the model’s test set, we can also see that
the English-based models achieve better results in all measures, as can be seen in
Table 9. The reason for the difference in the results of the evaluations is probably
caused by the amount of data used to train the pre-trained language models.
The BERT models were trained using mostly text data retrieved from Wikipedia.
However, the amount of Wikipedia text varies between different languages. For
example, in terms of words, the BERT Base model was trained with more than
3,000 million words, with about 2,500 million coming from the English Wikipedia
[8]. In comparison, the number of words in the Spanish Wikipedia, as of June
2020, was less than 900 million words. This enormous gap between the number
of words is very significant and consequently is reflected on the results achieved
by the non-English models.


8.3   Future Work

We could not submit in time a X-Transformer model that used MER in the
abstracts to identify relevant keywords. Therefore, in the future we will develop
a model using X-Transformer in conjunction with MER for both the English
language and the Spanish language and check if the usage of MER with X-
Transformer can improve the results in the different evaluation measures. In
addition, we intend to implement additional terms to the MER lexicon and use
other ontologies other than DeCS and MeSH. Afterwards, we can also run MER
in both titles and abstracts, since the titles might contain relevant keywords that
are not present in the abstracts.
    We could also have developed a model using BioBERT [19], a BERT model
developed with English biomedical text for NLP tasks, but due to time and hard-
ware constrains, we were not able to adapt and train a model using BioBERT,
which could have lead to improved results in task 8a. In the future, we intend
to develop a X-BERT and a X-Transformer model using BioBERT and compare
its performance with the models trained using SciBERT.
    Finally, as a way of further improving the results, we could use semantic
similarity solutions to find which labels are more related with the abstract, thus
hoping to reduce the number of false negatives and consequently improve the
recall and F-measures.


9    Conclusion
With our participation in these two competitions, we successfully adapted and
applied a state-of-the-art deep learning XMLC solution to the biomedical and
multilingual biomedical panorama. As far as we know, deep learning XMLC-
based solutions had not yet been applied to both these scenarios. The results
achieved in the competitions were promising, with our submissions achieving top
classifications in the precision measures when compared with most competing
models. Although our results were not optimal, we achieved a 1st place in MiP
in the last two weeks of task 8a and 2nd place in MiP, MaP and EBP in task
MESINESP. However, the performance of multilingual models is still far from
the performance of English models in NLP tasks, especially in the biomedical
domain. Therefore, we can conclude that there is a great need for additional
research and development of non-English deep learning models and corpora,
especially for specific domains such as the biomedical sciences.


10    Acknowledgements
This project was supported by FCT through funding of the DeST: Deep Semantic
Tagger project, ref. PTDC/CCI-BIO/28685/2017, and the LASIGE Research
Unit, ref. UIDB/00408/2020


References
1. Bhatia, K., Jain, H., Kar, P., Varma, M., and Jain, P.: Sparse local embeddings
   for extreme multi-label classification. Advances in Neural Information Processing
   Systems. pp. 730–738. (2015)
2. Liu, J., Chang, W. C., Wu, Y., and Yang, Y.: Deep learning for extreme multi-
   label text classification. In: Proceedings of the 40th International ACM SIGIR
   Conference on Research and Development in Information Retrieval (SIGIR ’17).
   pp. 115—124. Association for Computing Machinery, New York, NY, USA. (2017)
   https://doi.org/10.1145/3077136.3080834
3. You, R., Zhang, Z., Wang, Z., Dai, S., Mamitsuka, H., and Zhu, S. : Atten-
   tionXML: Label Tree-based Attention-Aware Deep Model for High-Performance
   Extreme Multi-Label Text Classification. Retrieved from: http://arxiv.org/abs/
   1811.01727 (2018)
4. You, R., Zhang, Z., Dai, S., and Zhu, S.: HAXMLNet: Hierarchical Attention Net-
   work for Extreme Multi-Label Text Classification. Retrieved from: http://arxiv.
   org/abs/1904.12578 (2019)
5. Chang, W. C., Yu, H. F., Zhong, K., Yang, Y., and Dhillon, I.: X-BERT: eXtreme
   Multi-label Text Classification with BERT. Retrieved from: https://arxiv.org/
   abs/1905.02331v2 (2019)
6. Couto, F. M., and Lamurias, A.: MER: a shell script and annotation server for
   minimal named entity recognition and linking. J Cheminform 10 (58), (2018)
   https://doi.org/10.1186/s13321-018-0312-9
7. Couto, F. M., Campos, L. F., and Lamurias, A.: MER: a Minimal Named-Entity
   Recognition Tagger and Annotation Server. In: Proceedings of the BioCreative V.5
   Challenge Evaluation Workshop. pp. 130-137. (2017)
8. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.: BERT: Pre-training of Deep
   Bidirectional Transformers for Language Understanding. Retrieved from: http://
   arxiv.org/abs/1810.04805 (2018)
9. Beltagy, I., Lo, K., and Cohan, A.: SciBERT: A Pretrained Language Model for
   Scientific Text. Retrieved from: http://arxiv.org/abs/1903.10676 (2019)
10. Conneau, A., Rinott, R., Lample, G., Williams, A., Bowman, S., Schwenk, H., and
   Stoyanov, V.: XNLI: Evaluating Cross-lingual Sentence Representations. In: Pro-
   ceedings of the 2018 Conference on Empirical Methods in Natural Language Pro-
   cessing. pp. 2475–2485. Association for Computational Linguistics, Brussels, Bel-
   gium. (2018) https://doi.org/10.18653/v1/d18-1269
11. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P.,
   Rault, T., Louf, R., Funtowicz, M., and Brew, J.: HuggingFace’s Transformers:
   State-of-the-art Natural Language Processing. Retrieved from: http://arxiv.org/
   abs/1910.03771 (2019)
12. Soares, F., Villegas, M., Gonzalez-Agirre, A., Krallinger, M., and Armengol-Estapé,
   J.: Medical Word Embeddings for Spanish: Development and Evaluation. In: Pro-
   ceedings Ofthe 2nd Clinical Natural Language Processing Workshop. pp. 124–133.
   Association for Computational Linguistics, Minneapolis, Minnesota, USA. (2019)
   https://doi.org/10.18653/v1/W19-1916
13. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceed-
   ings of the 2014 Conference on Empirical Methods in Natural Language Processing
   (EMNLP). pp. 1746–1751. Association for Computational Linguistics, Doha, Qatar.
   (2014) https://doi.org/10.3115/v1/d14-1181
14. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettle-
   moyer, L.: Deep Contextualized Word Representations. In: Proceedings of the 2018
   Conference of the North American Chapter of the Association for Computational
   Linguistics: Human Language Technologies, Volume 1 (Long Papers). pp. 2227–
   2237. Association for Computational Linguistics, New Orleans, Louisiana, USA.
   (2018) https://doi.org/10.18653/v1/n18-1202
15. The extreme classification repository: Multi-label datasets & code. http://
   manikvarma.org/downloads/XC/XMLRepository.html. Last accessed 30 Jun 2020
16. Chang, W. C., Yu, H. F., Zhong, K., Yang, Y., and Dhillon, I.: Taming Pretrained
   Transformers for Extreme Multi-label Text Classification. Retrieved from: http:
   //arxiv.org/abs/1905.02331v4 (2020)
17. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
   Zettlemoyer, L., and Stoyanovand, V.: RoBERTa: A Robustly Optimized BERT
   Pretraining Approach. Retrieved from: http://arxiv.org/abs/1907.11692 (2019)
18. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., and Le, Q. V.: XL-
   Net: Generalized Autoregressive Pretraining for Language Understanding. Retrieved
   from: http://arxiv.org/abs/1906.08237 (2019)
19. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., and Kang,
   J.: BioBERT: a pre-trained biomedical language representation model for
   biomedical text mining, Bioinformatics 36 (4), pp. 1234—1240. (2020)
   https://doi.org/10.1093/bioinformatics/btz682