<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Priberam at MESINESP Multi-label Classi cation of Medical Texts Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ruben Cardoso</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zita Marinho</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Afonso Mendes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebasti~ao Miranda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Priberam Labs</institution>
          ,
          <addr-line>Lisbon, Portugal labs.priberam.com</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Medical articles are a crucial tool to provide current state of the art treatments and diagnostics to medical professionals. However, existing public databases such as MEDLINE contain over 27 million articles, making the use of e cient search engines crucial in order to navigate and provide meaningful recommendations. Classifying these articles into broader medical topics can improve retrieval of related articles [1]. The set of medical labels considered for the MESINESP task is on the order of several thousands of labels (DeCS codes), which falls under the extreme multi-label classi cation problem [2]. The heterogeneous and highly hierarchical structure of medical topics makes the task of manually classifying articles extremely laborious and costly. It is, therefore, crucial to automate the process of classi cation. Typical machine learning algorithms become computationally demanding with such a large label set and achieving better recall becomes an unsolved problem. This work presents Priberam's participation at the BioASQ task Mesinesp. We address the large multi-label classi cation problem through the use of four di erent models: a Support Vector Machine (SVM) [3], the customised search engine Priberam Search [4], a BERT based classi er [5], and a SVM-rank ensemble [6] of all the previous models. Results show that all three individual models perform well and the best performance is achieved by their ensemble, granting Priberam the 6-th place in the present challenge and making it the 2-nd best team.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        A growing number of medical articles is published every year, with a current
estimated rate of at least one new article every 26 seconds [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The large magnitude
of both the documents and the assigned topics renders automatic classi cation
algorithms a necessity in organising and providing relevant information. Search
engines have a vital role in easing the burden of accessing this information e
ciently, however, these usually rely on the manual indexing or tagging of articles,
which is a slow and burdensome process [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        The Mesinesp task consists in automatically indexing abstracts in Spanish
from two well-known medical databases, IBECS and LILACS, with tags from
a pool of 34118 hierarchically structured medical terms, the DeCS codes. This
trilingual vocabulary (English, Portuguese and Spanish) serves as a unique
vocabulary in indexing medical articles. It follows a tree structure that divides the
codes into broader classes and more re ned sub-classes respecting their
conceptual and semantic relationships [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>In this task, we tackle the extreme multi-label (XML) classi cation problem.
Our goal is to predict for a given article the most relevant subset of labels
from an extremely large label set (order of tens of thousands) using supervised
training.1 Typical multi-label classi cation techniques are not suitable for the
XML setting, due to its large computational requirements: the large number
of labels implies that both label and feature vectors are sparse and exist in
high-dimensional spaces; and to address the sparsity of label occurrence, a large
number of training instances is required. These factors make the application of
such techniques highly demanding in terms of time and memory, increasing the
requirements of computational resources.</p>
      <p>
        The Mesinesp task is even more challenging due to two reasons: rst, the
articles' labels must be predicted only from the abstracts and titles; and
second, all the articles to be classi ed are in Spanish, which prevents the use of
additional resources available only for English, such as BioBERT [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and
ClinicalBERT [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        This paper describes our participation at the BioASQ task Mesinesp. We
explore the performance of a one-vs.-rest model based on Support Vector Machines
(SVM) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] as well as that of a proprietary search engine, Priberam Search [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
which relies on inverted indexes combined with a k-nearest neighbours classi er.
Furthermore, we took advantage of BERT's contextualised embeddings [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and
tested three possible classi ers: a linear classi er; a label attention mechanism
that leverages label semantics; and a recurrent model that predicts a sequence
of labels according to their frequency. We propose the following contributions:
{ Application of BERT's contextualised embeddings to the task of XML
classi cation, including the exploration of linear, attention based and recurrent
classi ers. To the best of our knowledge, this work is the rst to apply a
pretrained BERT model combined with a recurrent network to the XML
classi cation task.
{ Empirical comparison of a simple one-vs.-rest SVM approach with a more
complex model combining a recurrent classi er and BERT embeddings.
{ An ensemble of the previous individual methods using SVM-rank, which was
capable of outperforming them.
1 The task of multi-label classi cation di ers from multi-class classi cation in that
labels are not exclusive, which enables the assignment of several labels to the same
article, making the problem even harder [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Currently, there are two main approaches to XML: embedding based methods
and tree based methods.</p>
      <p>
        Embedding based methods deal with the problem of high dimensional feature
and label vectors by projecting them onto a lower dimensional space [
        <xref ref-type="bibr" rid="ref13 ref8">8,13</xref>
        ].
During prediction, the compressed representation is projected back onto the space of
high dimensional labels. This information bottleneck can often reduce noise and
allow for a way of regularising the problem. Although very e cient and fast, this
approach assumes however that the low-dimensional space is capable of
encoding most of the original information. For real world problems, this assumption
is often too restrictive and may result in decreased performance.
      </p>
      <p>
        Tree based approaches intend to learn a hierarchy of features or labels from
the training set [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ]. Typically, a root node is initialised with the complete
set of labels and its children nodes are recursively partitioned until all the leaf
nodes contain a small number of labels. During prediction, each article is passed
along the tree and the path towards its nal leaf node de nes the predicted set
of labels. These methods tend to be slower than embedding based methods but
achieve better performance. However, if a partitioning error is made near the
top of the tree, its consequences are propagated to the lower levels.
      </p>
      <p>
        Furthermore, other methods should be referred due to their simple approach
capable of achieving competitive results. Among these, DiSMEC [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] should
be highlighted because it follows a one-vs.-rest approach which simply learns a
weight vector for each label. The multiplication of such weight vector with the
data point feature vector yields a score that allows the classi cation of the label.
Another simple approach consists of performing a set of random projections
from the feature space towards a lower dimension space where, for each test
data point, a k-nearest neighbours algorithm performs a weighted propagation
of the neighbour's labels, based on their similarity [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>We propose two new approaches which are substantially distinct from the
ones discussed above. The rst one uses a search engine based on inverted
indexing and the second leverages BERT's contextualised embeddings combined
with either a linear or recurrent layer.
3</p>
    </sec>
    <sec id="sec-3">
      <title>XML Classi cation Models</title>
      <p>We explore the performance of a one-vs.-rest SVM model in x3.1, and a
customised search engine (Priberam Search) in x3.2. We further experiment with
several classi ers leveraging BERT's contextualised embeddings in x3.3. In the
end we aggregate the predictions of all of these individual models using a
SVMRank algorithm in x3.4.
3.1</p>
      <p>
        Support Vector Machine
Our rst baseline consists of a simple Support Vector Machine (SVM) using a
one-vs.-rest strategy. We train an independent SVM classi er for each possible
label. To reduce the burden of computation we only consider labels with
frequency above a given threshold fmin. Each classi er weight w 2 Rd measures
the importance assigned to each feature representation of a given article and is
trained to optimise the max-margin loss of the support vectors and the hyper
plane xi 2 Rd [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]:
      </p>
      <p>1
min
w 2</p>
      <p>l
wwT + C X (w; xi; yi)</p>
      <p>i=1
s.t. yi(w&gt;xi + b
1
i)
(1)
where (xi; yi) are the article-label pairs, C is the regularisation parameter, b is
a bias term and corresponds to a slack function used to penalise incorrectly
classi ed points and w is the vector normal to the decision hyper-plane. We used
the abstract's term frequency{inverse document frequency (tf-idf) as features to
represent xi.
3.2</p>
      <p>
        Priberam Search
The second model consists of a customised search engine, Priberam Search, based
on inverted indexing and retrieval using the Okapi-BM25 algorithm [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. It uses an
additional k-nearest neighbours algorithm (k-NN) to obtain the set of k indexed
articles closest to a query article in feature space. This similarity is based on
the frequency of words, lemmas and root-words, as well as label semantics and
synonyms. A score is given to each one of these articles and to each one of their
labels and label synonyms, and a weighted sum of these scores yields the nal
score assigned to each label.
3.3
      </p>
      <p>
        XML BERT Classi er
Language model pretraining has recently advanced the state of the art in several
Natural Language Processing tasks, with the use of contextualised embeddings
such as BERT, Bidirectional Encoder Representations from Transformers [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
This model consists of 12 stacked transformer blocks and its pretraining is
performed on a very large corpus following two tasks: next sentence prediction
and masked language modelling. The nature of the pretraining tasks makes this
model ideal for representing sentence information (given by the representation
of the [CLS] token added to the beginning of each sentence). After encoding a
sentence with BERT, we apply di erent classi ers, and ne tune the model to
minimise a multi-label classi cation loss:
      </p>
      <p>BCELoss(xi; yi) = yi;j log (xi;j ) + (1
yi;j ) log(1
(xi;j ));
(2)
where yi;j denotes the binary value of label j of article i, which is 1 if it is present
and 0 otherwise, xi;j represents the label predictions (logits) of article i and label
j, and is the sigmoid function.
3.3.1 In-domain transfer knowledge Additionally, we performed an extra
step of pretraining. Starting from the original weights obtained from BERT
pretrained in Spanish, we further pretrained the model with a task of masked
language modelling on the corpus composed by all the articles in the training
set. This extra step results in more meaningful contextualised representations
for this medical corpus, whose domain speci c language might di er from the
original pretraining corpora.</p>
      <p>
        After this, we tested three di erent classi ers: a linear classi er in x3.3.2, a
linear classi er with label attention in x3.3.3 and a recurrent classi er in x3.3.4.
3.3.2 XML BERT Linear Classi er The rst and simplest classi er
consists of a linear layer which maps the sequence output (the 768 dimensional
embedding - corresponding to the [CLS] token) to the label space, composed by
33702 dimensions corresponding to all the labels found in the training set. Such
architecture is represented in gure 1. We minimise binary cross-entropy using
sigmoid activations to allow for multiple active labels per instance, see Eq. 2.
This classi er is hereafter designated Linear.
3.3.3 XML BERT With Label Attention For the second classi er, we
assume a continuous representation with 768 dimensions for each label. We
initialise label embeddings as the pooled output embeddings (corresponding to
the [CLS] token) of a BERT model whose inputs were the string descriptors
and synonyms for each label. We consider a key-query-value attention
mechanism [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], where the query corresponds to the pooled output of the abstract's
contextualised representation and the keys and values correspond to the label
embeddings. We further consider residual connections, and a nal linear layer
maps these results to the decision space of 33702 labels using a linear classi er,
as shown in gure 2. Once again, we choose a binary cross-entropy loss (Eq.2).
This classi er is hereafter designated Label attention.
3.3.4 XML BERT With Gated Recurrent Unit In the last classi er, we
predict the article's labels sequentially. Before the last linear classi er used to
project the nal representation onto the label space, we add a Gated Recurrent
Unit (GRU) network [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] with 768 units that sequentially predicts each label
according to label frequency. A owchart of the architecture is shown in gure
3. This sequential prediction is performed until the prediction of the stopping
label is reached.
      </p>
      <p>We consider a binary cross-entropy loss with two di erent approaches. On
the rst approach, all labels are sequentially predicted and the loss is computed
only after the stopping label is predicted, i.e., the loss value is independent of
the order in which the labels are predicted. It only takes into account the nal
set. This loss is denominated Bag of Labels loss (BOLL) and it is given by:
LBOLL = BCELoss(xi; yi)
(3)
where xi and yi are the total set of predicted logits and gold labels for the
current article i, correspondingly. The models trained with this loss are hereafter
designated Gru Boll.</p>
      <p>The second approach uses an iterative loss which is computed at each step
of the sequential prediction of labels. We compare each predicted label with the
gold label, the loss is computed and added to a running loss value. In this case,
the loss is denominated Iterative Label loss (ILL):</p>
      <p>LILL =</p>
      <p>X BCELoss(xi(t); yi(t))
t2T
(4)
where T is the length of the label sequence, t denotes the time-steps taken by
the GRU until the \stop label" is predicted, and x(t) and y(t) are the predicted
i i
logits and gold labels for time-step t and article i, respectively. Models trained
with this loss are hereafter designated Gru Ill.</p>
      <p>Although only one of the losses accounts directly for prediction order, this
factor is always relevant because it a ects the nal set of predicted labels. This
way, the model must be trained and tested assuming a speci c label ordering.
For this work, we used two orders: ascending and descending label frequency on
the training set, designated Gru ascend and Gru descend, respectively.</p>
      <p>Additionally, we developed a masking system to force the sequential
prediction of labels according to the chosen frequency order. This means that at each
step the output label set is reduced to all labels whose frequency fall bellow or
above the previous label, depending on the monotonically ascending or
descending order, respectively. Models in which such masking is used are designated
Gru w/ mask.
3.4</p>
      <p>Ensemble
Furthermore, we developed an ensemble model combining the results of the
previously described SVM, Priberam Search and BERT with GRU models. This
ensemble's main goal is to leverage the label scores yielded by these three
individual models in order to make a more informed decision regarding the relevance
of each label to the abstracts.</p>
      <p>
        We chose an ensembling method based on a SVM-rank algorithm [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] whose
features are the normalised scores yielded by the three individual models, as
well as their pairwise product and full product. These scores are the distance to
the hyper-plane in the SVM model, the k-nearest neighbours score for Priberam
Search and the label probability for the BERT model.
      </p>
      <p>An SVM-rank is a variant of the support vector machine algorithm used to
solve ranking problems [19]. It essentially leverages pair-wise ranking methods to
sort and score results based on their relevance for a speci c query. This algorithm
optimises an analogous loss to the one shown in Eq. 1. Such ensemble is hereafter
designated SVM-rank ensemble.</p>
    </sec>
    <sec id="sec-4">
      <title>Experimental Setup</title>
      <p>We consider the training set provided for the Mesinesp competition containing
318658 articles with at least one DeCS code and an average of 8:12 codes per
article. We trained the individual models with 95% of this data. The remaining
5% were used to train the SVM-rank algorithm. The provided smaller o cial
development set, with 750 samples, was used to ne-tune the individual model's
and ensemble's hyper-parameters, while the test set, with 500 samples, was used
for reporting nal results. These two sets were manually annotated by experts
speci cally for the MESINESP task.
4.1</p>
      <p>Support Vector Machine
For the SVM model we chose to ignore all labels that appeared in less than
20 abstracts. With this cuto , we decrease the output label set size to 9200.
Additionally, we use a linear kernel to reduce computation time and avoid
overtting, which is critical to train such a large number of classi ers. Regarding
regularisation, we obtained the best performance using a regularisation
parameter set to C = 1:0, and a squared hinge slack function whose penalty over the
misclassi ed data points is computed with an `2 distance.</p>
      <p>Furthermore, to enable more control over the classi cation boundary, after
solving the optimisation problem we moved the decision hyper-plane along the
direction of w. We empirically determined that a distance of 0:3 from its
original position resulted in the best F 1 score. This model was implemented
using a scikit-learn.2
4.2</p>
      <p>Priberam Search
To use the Priberam Search Engine, we rst indexed the training set taking
into account the abstract text, title, complete set of gold DeCS codes, and also
their corresponding string descriptors along with some synonyms provided3. We
tuned the number of neighbours k = [10; 20; 30; 40; 50; 60; 70; 100; 200] in the
development set for the k-NN algorithm and obtained the best results for k = 40.
To decide whether or not a label should be assigned to an article, we
netuned a score threshold over the interval [0:1; 0:5] using the o cial development
set, obtaining a best performing value of 0:24. All labels with score above the
threshold were picked as correct labels.
2 scikit-learn.org
3 https://temu.bsc.es/mesinesp/wp-content/uploads/2019/12/DeCS.2019.v5.</p>
      <p>tsv.zip
4.3
For all types of BERT classi ers, we used the Transformers and PyTorch Python
packages [20, 21].</p>
      <p>We initialised BERT's weights from its cased version pretrained on Spanish
corpora, bert-base-spanish-wwm-cased4.</p>
      <p>We further performed a pretraining on the Mesinesp dataset to obtain better
in domain embeddings. For the pretraining and classi cation task, table 1 shows
the training hyper-parameters.</p>
      <p>For all the experiments with BERT, the complete set of DeCS codes was
considered as the label set.</p>
      <p>Hyper-parameter</p>
      <p>Pretraining Classi cation
Batch size
Learning rate
Warmup steps
Max seq lenght
Learning rate decay
Dropout probability
Our ensemble model aggregates the prediction of all the individual contenders
and produces a nal predicted label set. To improve recall we lowered the
thresholds set for each individual model until the value for which the average number
of predicted labels per abstract was approximately double the average number
of gold labels. This ensures that the SVM-rank algorithm was trained with a
balanced set, resulting in a system in which the individual models have very
high recall and the ensemble model is responsible for precision.</p>
      <p>We trained the SVM-rank model with the 5% hold-out data of the training
set. Furthermore, SVM-rank returns a score for each label in each abstract,
making it necessary to de ne a threshold for classi cation. This threshold was
ne-tuned over the interval [ 0:5; 0:5] using the o cial Mesinesp development
set, yielding a best performing cut-o score of 0:0233.</p>
      <p>We also ne-tuned the regularisation parameter, C. We experimented the
values C = [0:01; 0:1; 0:5; 1; 5; 10] obtaining the best performance for C = 0:1.
The current model was implemented using a Python wrapper for the dlib C++
toolkit [22].
4 https://github.com/dccuchile/beto</p>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>Table 2 shows the -precision, -recall and -F1 metrics for the best performing
models described above, evaluated on both the o cial development and test sets.</p>
      <p>The comparison between the scores obtained for the one-vs.-rest SVM and
Priberam Search models shows that the SVM outperforms the k-NN based
Priberam Search in terms of F1, which is mostly due to its higher recall. Note that,
although not ideal for multi-label problems, the one-vs.-rest strategy fro the
SVM model was able to achieve a relatively good performance, even when the
predicted label set was signi cantly reduced.</p>
      <p>Model
SVM</p>
      <p>Table 3 shows the performance of several classi ers used with BERT. Note
that, for these models, in order to save time and computational resources some
tests were stopped before achieving their maximum performance, allowing
nonetheless comparison with other models.</p>
      <p>We trained linear classi ers using the BERT model with pretraining on the
MESINESP corpus for 660k steps ( 19 epochs) and without such
pretraining (marked with *). Results show that, even with an under-trained classi er,
such pretraining is already advantageous. This pretraining was employed for all
models combining BERT embeddings with a GRU classi er. The label-attentive
Bert model (Gru Boll ascend) shows negligible impact on performance, when
compared with the simple linear classi er (Linear).</p>
      <p>We consider three varying architectures of the Bert-Gru model: Bag of
Labels loss (Boll) or Iterative Label loss (Ill), ascending or descending label
frequency, and usage or not of masking. Taking into account the best score
achieved, the BOLL loss performs better than the ILL loss, even with a smaller
number of training steps. For this BOLL loss, it is also evident that the ordering
of labels with ascending frequency outperforms the opposite order, and that
masking results in decreased performance.</p>
      <p>On the other hand, for the ILL loss, masking improves the achieved score
and the ordering of labels with descending frequency shows better results. The
best classi er for a BERT-based model is the GRU network trained with a Bag
of Labels loss and with labels provided in ascending frequency order (Gru Boll
ascend). This model was further trained for a total of 28 epochs resulting in
a F1=0.4918 on the 5% hold-out of the training set. It is important to notice
the performance drop from the 5% hold-out data to the o cial development set.
This drop is likely a result of the mismatch between the annotation methods
used in the two sets, given that the development set was speci cally manually
annotated for this task.</p>
      <p>Surprisingly, the BERT based model shows worse performance than the SVM
on the test set. Despite their very similar F1 scores for the development set,
the BERT-GRU model su ered a considerable performance drop from the
development to the test set due to a decrease in recall. This might indicate some
over- tting of hyper-parameters and a possible mismatch between these two
expert annotated sets.</p>
      <p>Additionally, as made explicit in table 2, the ensemble combining the results
of the SVM, Priberam Search and the best performing BERT based classi er
achieved the best performance on the development set, outperforming all the
individual models.</p>
      <p>BERT classi er
Linear*
Linear
Label attention*
Gru Boll ascend
Gru Boll descend
Gru Boll ascend w/ mask
Gru Ill descend
Gru Ill descend w/ mask
Gru Ill ascend w/ mask</p>
      <p>Training steps
220k
250ky
700k
80k
40k
100ky
240ky
240ky
240ky</p>
      <p>F1</p>
      <p>Finally, table 4 shows additional classi cation metrics for each one of the
submitted systems, as well as their rank within the Mesinesp task. The analysis
of such results makes clear that for the three considered averages (Micro, Macro
and per sample), the SVM model shows the best recall score. For most of the
remaining metrics, the SVM-rank ensemble is able to leverage the capabilities of
the individual models and achieve considerable performance gains, particularly
noticeable for the precision scores.</p>
      <p>Metric</p>
      <p>SVM
F1
P</p>
      <p>R
MaF1
MaP
MaR
EbF1
EbP
EbR
This paper introduces three type of extreme multi label classi ers: an SVM, a
k-NN based search engine and a series of BERT classi ers. Our one-vs.-rest SVM
model shows the best performance on all recall metrics. We further provide an
empirical comparison of di erent variants of multi-label BERT based classi ers,
where the Gated Recurrent Unit network with the Bag of Labels loss shows the
most promising results. This model yields slightly better results than the SVM
model on the development set, however, due to a drop in recall, under-performs
it on the test set. Finally, the SVM-rank ensemble is able to leverage the label
scores yielded by the three individual models and combine them into a nal
ranking model with a precision gain on all metrics, capable of achieving the
highest F1 score (being the 6-th best model in the task).
7</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>This work is supported by the Lisbon Regional Operational Programme (Lisboa
2020), under the Portugal 2020 Partnership Agreement, through the European
Regional Development Fund (ERDF), within project TRAINER (No 045347).
19. Liu TY. Learning to rank for information retrieval. Springer Science &amp; Business</p>
      <p>Media (2011).
20. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T,
Louf R, Funtowicz M, Brew J. HuggingFace's Transformers: State-of-the-art Natural
Language Processing. ArXiv (2019).
21. Paszke A, Gross S, and Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch:
An Imperative Style, High-Performance Deep Learning Library. Advances in Neural
Information Processing Systems 32, p.8024{8035 (2019)
22. King DE. Dlib-ml: A machine learning toolkit. The Journal of Machine Learning
Research (2009).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Yi</surname>
            <given-names>X</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Allan</surname>
            <given-names>J.</given-names>
          </string-name>
          <article-title>A comparative study of utilizing topic models for information retrieval</article-title>
          .
          <source>European conference on information retrieval</source>
          , pp.
          <fpage>29</fpage>
          -
          <lpage>41</lpage>
          . Springer (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Shen</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            <given-names>HF</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanghavi</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dhillon</surname>
            <given-names>I</given-names>
          </string-name>
          .
          <article-title>Extreme Multi-label Classi cation from Aggregated Labels</article-title>
          . arXiv preprint arXiv:
          <year>2004</year>
          .
          <volume>00198</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Fan</surname>
            <given-names>RE</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            <given-names>KW</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsieh</surname>
            <given-names>CJ</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            <given-names>XR</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>CJ</given-names>
          </string-name>
          .
          <article-title>LIBLINEAR: A library for large linear classi cation</article-title>
          .
          <source>Journal of machine learning research</source>
          (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Miranda</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nogueira</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mendes</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vlachos</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Secker</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garrett</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitchel</surname>
            <given-names>J</given-names>
          </string-name>
          , Marinho Z.
          <source>Automated Fact Checking in the News Room. In The World Wide Web Conference</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Devlin</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>BERT</surname>
          </string-name>
          <article-title>: pretraining of Deep Bidirectional Transformers for Language Understanding</article-title>
          .
          <source>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Joachims</surname>
            <given-names>T.</given-names>
          </string-name>
          <article-title>Optimizing search engines using clickthrough data. InProceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (</article-title>
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Garba</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ahmed</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mai</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Makama</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Odigie</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <article-title>Proliferations of scienti c medical journals: a burden or a blessing</article-title>
          .
          <source>Oman medical journal</source>
          ,
          <volume>25</volume>
          (
          <issue>4</issue>
          ), p.
          <volume>311</volume>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Zhang</surname>
            <given-names>W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            <given-names>X</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zha</surname>
            <given-names>H</given-names>
          </string-name>
          .
          <article-title>Deep extreme multi-label learning</article-title>
          .
          <source>Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>VHL</given-names>
            <surname>Network</surname>
          </string-name>
          <article-title>Portal</article-title>
          . Red.bvsalud.org.
          <year>2020</year>
          . Decs. [online] Available at: http://red.bvsalud.org/decs/en/about-decs
          <source>/ (Accessed 2 May</source>
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Babbar</surname>
            <given-names>R</given-names>
          </string-name>
          , Scholkopf B.
          <article-title>DiSMEC: Distributed Sparse Machines for Extreme Multilabel Classi cation</article-title>
          .
          <source>Proceedings of the Tenth ACM International Conference on Web Search and Data Mining</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lee</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yoon</surname>
            <given-names>W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>So</surname>
            <given-names>CH</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kang</surname>
            <given-names>J.</given-names>
          </string-name>
          <article-title>BioBERT: a pre-trained biomedical language representation model for biomedical text mining</article-title>
          .
          <source>Bioinformatics</source>
          (2020 Feb).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Alsentzer</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murphy</surname>
            <given-names>JR</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boag</surname>
            <given-names>W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weng</surname>
            <given-names>WH</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jin</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naumann</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McDermott</surname>
            <given-names>M</given-names>
          </string-name>
          .
          <article-title>Publicly available clinical BERT embeddings</article-title>
          . arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>03323</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Tai</surname>
            <given-names>F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>HT</given-names>
          </string-name>
          .
          <article-title>Multilabel classi cation with principal label space transformation</article-title>
          .
          <source>Neural Computation</source>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Prabhu</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varma M. Fastxml</surname>
          </string-name>
          :
          <article-title>A fast, accurate and stable tree-classi er for extreme multi-label learning</article-title>
          .
          <source>Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Agrawal</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prabhu</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varma</surname>
            <given-names>M.</given-names>
          </string-name>
          <article-title>Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages</article-title>
          .
          <source>Proceedings of the 22nd international conference on World Wide Web</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Verma</surname>
            <given-names>Y.</given-names>
          </string-name>
          <article-title>An Embarrassingly Simple Baseline for eXtreme Multi-label Prediction</article-title>
          . arXiv preprint arXiv:
          <year>1912</year>
          .
          <volume>08140</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Vaswani</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shazeer</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parmar</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uszkoreit</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez</surname>
            <given-names>AN</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaiser</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polosukhin</surname>
            <given-names>I</given-names>
          </string-name>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>Advances in neural information processing systems</source>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Cho</surname>
            <given-names>K</given-names>
          </string-name>
          , van Merrienboer
          <string-name>
            <given-names>B</given-names>
            ,
            <surname>Gulcehre</surname>
          </string-name>
          <string-name>
            <given-names>C</given-names>
            ,
            <surname>Bahdanau</surname>
          </string-name>
          <string-name>
            <given-names>D</given-names>
            ,
            <surname>Bougares</surname>
          </string-name>
          <string-name>
            <given-names>F</given-names>
            ,
            <surname>Schwenk</surname>
          </string-name>
          <string-name>
            <given-names>H</given-names>
            ,
            <surname>Bengio</surname>
          </string-name>
          <string-name>
            <surname>Y</surname>
          </string-name>
          .
          <article-title>Learning Phrase Representations using RNN Encoder{Decoder for Statistical Machine Translation</article-title>
          .
          <source>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>