=Paper=
{{Paper
|id=Vol-2696/paper_90
|storemode=property
|title=Priberam at MESINESP Multi-label Classification of Medical Texts Task
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_90.pdf
|volume=Vol-2696
|authors=Rúben Cardoso,Zita Marinho,Afonso Mendes,Sebastião Miranda
|dblpUrl=https://dblp.org/rec/conf/clef/CardosoMMM20
}}
==Priberam at MESINESP Multi-label Classification of Medical Texts Task==
Priberam at MESINESP Multi-label Classification of Medical Texts Task Rúben Cardoso, Zita Marinho, Afonso Mendes, and Sebastião Miranda Priberam Labs, Lisbon, Portugal labs.priberam.com {rac,zam,amm,ssm}@priberam.com Abstract. Medical articles are a crucial tool to provide current state of the art treatments and diagnostics to medical professionals. However, existing public databases such as MEDLINE contain over 27 million arti- cles, making the use of efficient search engines crucial in order to navigate and provide meaningful recommendations. Classifying these articles into broader medical topics can improve retrieval of related articles [1]. The set of medical labels considered for the MESINESP task is on the or- der of several thousands of labels (DeCS codes), which falls under the extreme multi-label classification problem [2]. The heterogeneous and highly hierarchical structure of medical topics makes the task of man- ually classifying articles extremely laborious and costly. It is, therefore, crucial to automate the process of classification. Typical machine learn- ing algorithms become computationally demanding with such a large label set and achieving better recall becomes an unsolved problem. This work presents Priberam’s participation at the BioASQ task Mesinesp. We address the large multi-label classification problem through the use of four different models: a Support Vector Machine (SVM) [3], the cus- tomised search engine Priberam Search [4], a BERT based classifier [5], and a SVM-rank ensemble [6] of all the previous models. Results show that all three individual models perform well and the best performance is achieved by their ensemble, granting Priberam the 6-th place in the present challenge and making it the 2-nd best team. 1 Introduction A growing number of medical articles is published every year, with a current es- timated rate of at least one new article every 26 seconds [7]. The large magnitude of both the documents and the assigned topics renders automatic classification algorithms a necessity in organising and providing relevant information. Search engines have a vital role in easing the burden of accessing this information effi- ciently, however, these usually rely on the manual indexing or tagging of articles, which is a slow and burdensome process [8]. Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem- ber 2020, Thessaloniki, Greece. The Mesinesp task consists in automatically indexing abstracts in Spanish from two well-known medical databases, IBECS and LILACS, with tags from a pool of 34118 hierarchically structured medical terms, the DeCS codes. This trilingual vocabulary (English, Portuguese and Spanish) serves as a unique vo- cabulary in indexing medical articles. It follows a tree structure that divides the codes into broader classes and more refined sub-classes respecting their concep- tual and semantic relationships [9]. In this task, we tackle the extreme multi-label (XML) classification problem. Our goal is to predict for a given article the most relevant subset of labels from an extremely large label set (order of tens of thousands) using supervised training.1 Typical multi-label classification techniques are not suitable for the XML setting, due to its large computational requirements: the large number of labels implies that both label and feature vectors are sparse and exist in high-dimensional spaces; and to address the sparsity of label occurrence, a large number of training instances is required. These factors make the application of such techniques highly demanding in terms of time and memory, increasing the requirements of computational resources. The Mesinesp task is even more challenging due to two reasons: first, the articles’ labels must be predicted only from the abstracts and titles; and sec- ond, all the articles to be classified are in Spanish, which prevents the use of additional resources available only for English, such as BioBERT [11] and Clin- icalBERT [12]. This paper describes our participation at the BioASQ task Mesinesp. We ex- plore the performance of a one-vs.-rest model based on Support Vector Machines (SVM) [3] as well as that of a proprietary search engine, Priberam Search [4], which relies on inverted indexes combined with a k-nearest neighbours classifier. Furthermore, we took advantage of BERT’s contextualised embeddings [5] and tested three possible classifiers: a linear classifier; a label attention mechanism that leverages label semantics; and a recurrent model that predicts a sequence of labels according to their frequency. We propose the following contributions: – Application of BERT’s contextualised embeddings to the task of XML clas- sification, including the exploration of linear, attention based and recurrent classifiers. To the best of our knowledge, this work is the first to apply a pretrained BERT model combined with a recurrent network to the XML classification task. – Empirical comparison of a simple one-vs.-rest SVM approach with a more complex model combining a recurrent classifier and BERT embeddings. – An ensemble of the previous individual methods using SVM-rank, which was capable of outperforming them. 1 The task of multi-label classification differs from multi-class classification in that labels are not exclusive, which enables the assignment of several labels to the same article, making the problem even harder [10]. 2 Related Work Currently, there are two main approaches to XML: embedding based methods and tree based methods. Embedding based methods deal with the problem of high dimensional feature and label vectors by projecting them onto a lower dimensional space [8,13]. Dur- ing prediction, the compressed representation is projected back onto the space of high dimensional labels. This information bottleneck can often reduce noise and allow for a way of regularising the problem. Although very efficient and fast, this approach assumes however that the low-dimensional space is capable of encod- ing most of the original information. For real world problems, this assumption is often too restrictive and may result in decreased performance. Tree based approaches intend to learn a hierarchy of features or labels from the training set [14, 15]. Typically, a root node is initialised with the complete set of labels and its children nodes are recursively partitioned until all the leaf nodes contain a small number of labels. During prediction, each article is passed along the tree and the path towards its final leaf node defines the predicted set of labels. These methods tend to be slower than embedding based methods but achieve better performance. However, if a partitioning error is made near the top of the tree, its consequences are propagated to the lower levels. Furthermore, other methods should be referred due to their simple approach capable of achieving competitive results. Among these, DiSMEC [10] should be highlighted because it follows a one-vs.-rest approach which simply learns a weight vector for each label. The multiplication of such weight vector with the data point feature vector yields a score that allows the classification of the label. Another simple approach consists of performing a set of random projections from the feature space towards a lower dimension space where, for each test data point, a k-nearest neighbours algorithm performs a weighted propagation of the neighbour’s labels, based on their similarity [16]. We propose two new approaches which are substantially distinct from the ones discussed above. The first one uses a search engine based on inverted in- dexing and the second leverages BERT’s contextualised embeddings combined with either a linear or recurrent layer. 3 XML Classification Models We explore the performance of a one-vs.-rest SVM model in §3.1, and a cus- tomised search engine (Priberam Search) in §3.2. We further experiment with several classifiers leveraging BERT’s contextualised embeddings in §3.3. In the end we aggregate the predictions of all of these individual models using a SVM- Rank algorithm in §3.4. 3.1 Support Vector Machine Our first baseline consists of a simple Support Vector Machine (SVM) using a one-vs.-rest strategy. We train an independent SVM classifier for each possible label. To reduce the burden of computation we only consider labels with fre- quency above a given threshold fmin . Each classifier weight w ∈ Rd measures the importance assigned to each feature representation of a given article and is trained to optimise the max-margin loss of the support vectors and the hyper plane xi ∈ Rd [3]: l 1 X min wwT + C ξ(w; xi , yi ) (1) w 2 i=1 s.t. yi (w> xi + b ≥ 1 − ξi ) where (xi , yi ) are the article-label pairs, C is the regularisation parameter, b is a bias term and ξ corresponds to a slack function used to penalise incorrectly classified points and w is the vector normal to the decision hyper-plane. We used the abstract’s term frequency–inverse document frequency (tf-idf) as features to represent xi . 3.2 Priberam Search The second model consists of a customised search engine, Priberam Search, based on inverted indexing and retrieval using the Okapi-BM25 algorithm [4]. It uses an additional k-nearest neighbours algorithm (k-NN) to obtain the set of k indexed articles closest to a query article in feature space. This similarity is based on the frequency of words, lemmas and root-words, as well as label semantics and synonyms. A score is given to each one of these articles and to each one of their labels and label synonyms, and a weighted sum of these scores yields the final score assigned to each label. 3.3 XML BERT Classifier Language model pretraining has recently advanced the state of the art in several Natural Language Processing tasks, with the use of contextualised embeddings such as BERT, Bidirectional Encoder Representations from Transformers [5]. This model consists of 12 stacked transformer blocks and its pretraining is per- formed on a very large corpus following two tasks: next sentence prediction and masked language modelling. The nature of the pretraining tasks makes this model ideal for representing sentence information (given by the representation of the [CLS] token added to the beginning of each sentence). After encoding a sentence with BERT, we apply different classifiers, and fine tune the model to minimise a multi-label classification loss: BCELoss(xi ; yi ) = yi,j log σ(xi,j ) + (1 − yi,j ) log(1 − σ(xi,j )), (2) where yi,j denotes the binary value of label j of article i, which is 1 if it is present and 0 otherwise, xi,j represents the label predictions (logits) of article i and label j, and σ is the sigmoid function. 3.3.1 In-domain transfer knowledge Additionally, we performed an extra step of pretraining. Starting from the original weights obtained from BERT pretrained in Spanish, we further pretrained the model with a task of masked language modelling on the corpus composed by all the articles in the training set. This extra step results in more meaningful contextualised representations for this medical corpus, whose domain specific language might differ from the original pretraining corpora. After this, we tested three different classifiers: a linear classifier in §3.3.2, a linear classifier with label attention in §3.3.3 and a recurrent classifier in §3.3.4. 3.3.2 XML BERT Linear Classifier The first and simplest classifier con- sists of a linear layer which maps the sequence output (the 768 dimensional embedding - corresponding to the [CLS] token) to the label space, composed by 33702 dimensions corresponding to all the labels found in the training set. Such architecture is represented in figure 1. We minimise binary cross-entropy using sigmoid activations to allow for multiple active labels per instance, see Eq. 2. This classifier is hereafter designated Linear. Fig. 1: XML BERT Linear Classifier: Flowchart representing BERT’s pooled output (in blue) and the simple linear layer (W in green) used as XML classifier. 3.3.3 XML BERT With Label Attention For the second classifier, we assume a continuous representation with 768 dimensions for each label. We ini- tialise label embeddings as the pooled output embeddings (corresponding to the [CLS] token) of a BERT model whose inputs were the string descriptors and synonyms for each label. We consider a key-query-value attention mecha- nism [17], where the query corresponds to the pooled output of the abstract’s contextualised representation and the keys and values correspond to the label embeddings. We further consider residual connections, and a final linear layer maps these results to the decision space of 33702 labels using a linear classifier, as shown in figure 2. Once again, we choose a binary cross-entropy loss (Eq.2). This classifier is hereafter designated Label attention. Fig. 2: XML BERT with Label Attention Classifier: Article’s pooled output (blue) is followed by an extra step of attention over the label embeddings (red) which are finally mapped to a XML linear classifier over labels (green). 3.3.4 XML BERT With Gated Recurrent Unit In the last classifier, we predict the article’s labels sequentially. Before the last linear classifier used to project the final representation onto the label space, we add a Gated Recurrent Unit (GRU) network [18] with 768 units that sequentially predicts each label according to label frequency. A flowchart of the architecture is shown in figure 3. This sequential prediction is performed until the prediction of the stopping label is reached. Fig. 3: XML BERT GRU Classifier: The GRU network precedes the linear layer and sequentially predicts the labels. The symbol ++ stands for vector concate- nation and lt is the label representation predicted by the GRU at time-step t. We consider a binary cross-entropy loss with two different approaches. On the first approach, all labels are sequentially predicted and the loss is computed only after the stopping label is predicted, i.e., the loss value is independent of the order in which the labels are predicted. It only takes into account the final set. This loss is denominated Bag of Labels loss (BOLL) and it is given by: LBOLL = BCELoss(xi ; yi ) (3) where xi and yi are the total set of predicted logits and gold labels for the current article i, correspondingly. The models trained with this loss are hereafter designated Gru Boll. The second approach uses an iterative loss which is computed at each step of the sequential prediction of labels. We compare each predicted label with the gold label, the loss is computed and added to a running loss value. In this case, the loss is denominated Iterative Label loss (ILL): (t) (t) X LILL = BCELoss(xi ; yi ) (4) t∈T where T is the length of the label sequence, t denotes the time-steps taken by (t) (t) the GRU until the “stop label” is predicted, and xi and yi are the predicted logits and gold labels for time-step t and article i, respectively. Models trained with this loss are hereafter designated Gru Ill. Although only one of the losses accounts directly for prediction order, this factor is always relevant because it affects the final set of predicted labels. This way, the model must be trained and tested assuming a specific label ordering. For this work, we used two orders: ascending and descending label frequency on the training set, designated Gru ascend and Gru descend, respectively. Additionally, we developed a masking system to force the sequential predic- tion of labels according to the chosen frequency order. This means that at each step the output label set is reduced to all labels whose frequency fall bellow or above the previous label, depending on the monotonically ascending or descend- ing order, respectively. Models in which such masking is used are designated Gru w/ mask. 3.4 Ensemble Furthermore, we developed an ensemble model combining the results of the pre- viously described SVM, Priberam Search and BERT with GRU models. This ensemble’s main goal is to leverage the label scores yielded by these three indi- vidual models in order to make a more informed decision regarding the relevance of each label to the abstracts. We chose an ensembling method based on a SVM-rank algorithm [6] whose features are the normalised scores yielded by the three individual models, as well as their pairwise product and full product. These scores are the distance to the hyper-plane in the SVM model, the k-nearest neighbours score for Priberam Search and the label probability for the BERT model. An SVM-rank is a variant of the support vector machine algorithm used to solve ranking problems [19]. It essentially leverages pair-wise ranking methods to sort and score results based on their relevance for a specific query. This algorithm optimises an analogous loss to the one shown in Eq. 1. Such ensemble is hereafter designated SVM-rank ensemble. 4 Experimental Setup We consider the training set provided for the Mesinesp competition containing 318658 articles with at least one DeCS code and an average of 8.12 codes per article. We trained the individual models with 95% of this data. The remaining 5% were used to train the SVM-rank algorithm. The provided smaller official development set, with 750 samples, was used to fine-tune the individual model’s and ensemble’s hyper-parameters, while the test set, with 500 samples, was used for reporting final results. These two sets were manually annotated by experts specifically for the MESINESP task. 4.1 Support Vector Machine For the SVM model we chose to ignore all labels that appeared in less than 20 abstracts. With this cutoff, we decrease the output label set size to ≈ 9200. Additionally, we use a linear kernel to reduce computation time and avoid over- fitting, which is critical to train such a large number of classifiers. Regarding regularisation, we obtained the best performance using a regularisation param- eter set to C = 1.0, and a squared hinge slack function whose penalty over the misclassified data points is computed with an `2 distance. Furthermore, to enable more control over the classification boundary, after solving the optimisation problem we moved the decision hyper-plane along the direction of w. We empirically determined that a distance of −0.3 from its original position resulted in the best µF 1 score. This model was implemented using a scikit-learn.2 4.2 Priberam Search To use the Priberam Search Engine, we first indexed the training set taking into account the abstract text, title, complete set of gold DeCS codes, and also their corresponding string descriptors along with some synonyms provided3 . We tuned the number of neighbours k = [10, 20, 30, 40, 50, 60, 70, 100, 200] in the development set for the k-NN algorithm and obtained the best results for k = 40. To decide whether or not a label should be assigned to an article, we fine- tuned a score threshold over the interval [0.1, 0.5] using the official development set, obtaining a best performing value of 0.24. All labels with score above the threshold were picked as correct labels. 2 scikit-learn.org 3 https://temu.bsc.es/mesinesp/wp-content/uploads/2019/12/DeCS.2019.v5. tsv.zip 4.3 BERT For all types of BERT classifiers, we used the Transformers and PyTorch Python packages [20, 21]. We initialised BERT’s weights from its cased version pretrained on Spanish corpora, bert-base-spanish-wwm-cased4 . We further performed a pretraining on the Mesinesp dataset to obtain better in domain embeddings. For the pretraining and classification task, table 1 shows the training hyper-parameters. For all the experiments with BERT, the complete set of DeCS codes was considered as the label set. Hyper-parameter Pretraining Classification Batch size 4 8 Learning rate 5 · 10−5 2 · 10−5 Warmup steps 0 4000 Max seq lenght 512 512 Learning rate decay - linear Dropout probability 0.1 0.1 Table 1: Training hyper-parameters used for BERT’s pretraining and classifica- tion tasks. 4.4 Ensemble Our ensemble model aggregates the prediction of all the individual contenders and produces a final predicted label set. To improve recall we lowered the thresh- olds set for each individual model until the value for which the average number of predicted labels per abstract was approximately double the average number of gold labels. This ensures that the SVM-rank algorithm was trained with a balanced set, resulting in a system in which the individual models have very high recall and the ensemble model is responsible for precision. We trained the SVM-rank model with the 5% hold-out data of the training set. Furthermore, SVM-rank returns a score for each label in each abstract, making it necessary to define a threshold for classification. This threshold was fine-tuned over the interval [−0.5, 0.5] using the official Mesinesp development set, yielding a best performing cut-off score of −0.0233. We also fine-tuned the regularisation parameter, C. We experimented the values C = [0.01, 0.1, 0.5, 1, 5, 10] obtaining the best performance for C = 0.1. The current model was implemented using a Python wrapper for the dlib C++ toolkit [22]. 4 https://github.com/dccuchile/beto 5 Results Table 2 shows the µ-precision, µ-recall and µ-F1 metrics for the best performing models described above, evaluated on both the official development and test sets. The comparison between the scores obtained for the one-vs.-rest SVM and Priberam Search models shows that the SVM outperforms the k-NN based Prib- eram Search in terms of µF1, which is mostly due to its higher recall. Note that, although not ideal for multi-label problems, the one-vs.-rest strategy fro the SVM model was able to achieve a relatively good performance, even when the predicted label set was significantly reduced. Development set Test set Model µP µR µF1 µP µR µF1 SVM 0.4216 0.3740 0.3964 0.4183 0.3789 0.3976 Priberam Search 0.4471 0.3017 0.3603 0.4571 0.2700 0.3395 Bert-Gru Boll ascend 0.4130 0.3823 0.3971 0.4293 0.3314 0.3740 SVM-rank ensemble 0.5056 0.3456 0.4105 0.5336 0.3320 0.4093 Table 2: Micro precision (µP), micro recall (µR) and micro F1 (µF1) obtained with the 4 submitted models for both the development and test sets. For each metric, the best performing model is identified in bold. Table 3 shows the performance of several classifiers used with BERT. Note that, for these models, in order to save time and computational resources some tests were stopped before achieving their maximum performance, allowing nonethe- less comparison with other models. We trained linear classifiers using the BERT model with pretraining on the MESINESP corpus for 660k steps (≈ 19 epochs) and without such pretrain- ing (marked with *). Results show that, even with an under-trained classifier, such pretraining is already advantageous. This pretraining was employed for all models combining BERT embeddings with a GRU classifier. The label-attentive Bert model (Gru Boll ascend) shows negligible impact on performance, when compared with the simple linear classifier (Linear). We consider three varying architectures of the Bert-Gru model: Bag of Labels loss (Boll) or Iterative Label loss (Ill), ascending or descending label frequency, and usage or not of masking. Taking into account the best score achieved, the BOLL loss performs better than the ILL loss, even with a smaller number of training steps. For this BOLL loss, it is also evident that the ordering of labels with ascending frequency outperforms the opposite order, and that masking results in decreased performance. On the other hand, for the ILL loss, masking improves the achieved score and the ordering of labels with descending frequency shows better results. The best classifier for a BERT-based model is the GRU network trained with a Bag of Labels loss and with labels provided in ascending frequency order (Gru Boll ascend). This model was further trained for a total of 28 epochs resulting in a µF1=0.4918 on the 5% hold-out of the training set. It is important to notice the performance drop from the 5% hold-out data to the official development set. This drop is likely a result of the mismatch between the annotation methods used in the two sets, given that the development set was specifically manually annotated for this task. Surprisingly, the BERT based model shows worse performance than the SVM on the test set. Despite their very similar µF1 scores for the development set, the BERT-GRU model suffered a considerable performance drop from the de- velopment to the test set due to a decrease in recall. This might indicate some over-fitting of hyper-parameters and a possible mismatch between these two ex- pert annotated sets. Additionally, as made explicit in table 2, the ensemble combining the results of the SVM, Priberam Search and the best performing BERT based classifier achieved the best performance on the development set, outperforming all the individual models. BERT classifier Training steps µF1 Linear* 220k 0.4476 † Linear 250k 0.4504 Label attention* 700k 0.4460 Gru Boll ascend 80k 0.4759 Gru Boll descend 40k 0.4655 † Gru Boll ascend w/ mask 100k 0.4352 † Gru Ill descend 240k 0.4258 † Gru Ill descend w/ mask 240k 0.4526 † Gru Ill ascend w/ mask 240k 0.4459 Table 3: µF1 metric evaluated for the 5% hold-out of the training set. All models have been pretrained on the Mesinesp corpus, except for those duly marked. BOLL: Bag of Labels loss. ILL: Iterative Label loss. *: not pretrained on Mesinesp corpus. †: training stopped before maximum µF1 was reached. Finally, table 4 shows additional classification metrics for each one of the submitted systems, as well as their rank within the Mesinesp task. The analysis of such results makes clear that for the three considered averages (Micro, Macro and per sample), the SVM model shows the best recall score. For most of the remaining metrics, the SVM-rank ensemble is able to leverage the capabilities of the individual models and achieve considerable performance gains, particularly noticeable for the precision scores. Priberam BERT-GRU SVM-rank Metric SVM Search boll ascend ensemble µF1 0.3976 (7o ) 0.3395 (13o ) 0.3740 (9o ) 0.4093(6o ) µP 0.4183 (17o ) 0.4571 (10o ) 0.4293 (15o ) 0.5336(6o ) µR 0.3789(6o ) 0.2700 (13o ) 0.3314 (8o ) 0.3320 (7o ) MaF1 0.4183(8o ) 0.1776 (13o ) 0.2009 (11o ) 0.2115 (10o ) MaP 0.4602 (9o ) 0.4971 (8o ) 0.4277 (11o ) 0.5944(3o ) MaR 0.2609(8o ) 0.1742 (16o ) 0.2002 (11o ) 0.2024 (10o ) EbF1 0.3976 (7o ) 0.3393 (13o ) 0.3678 (9o ) 0.4031(6o ) EbP 0.4451 (15o ) 0.4582 (12o ) 0.4477 (14o ) 0.5465(3o ) EbR 0.3904(6o ) 0.2824 (13o ) 0.3463 (8o ) 0.3452 (8o ) Table 4: Micro (µ), macro (Ma) and per sample (Eb) averages of the precision, recall and F1 scores, followed by score position within the Mesinesp task. For each metric, the best performing model is identified in bold. 6 Conclusions This paper introduces three type of extreme multi label classifiers: an SVM, a k-NN based search engine and a series of BERT classifiers. Our one-vs.-rest SVM model shows the best performance on all recall metrics. We further provide an empirical comparison of different variants of multi-label BERT based classifiers, where the Gated Recurrent Unit network with the Bag of Labels loss shows the most promising results. This model yields slightly better results than the SVM model on the development set, however, due to a drop in recall, under-performs it on the test set. Finally, the SVM-rank ensemble is able to leverage the label scores yielded by the three individual models and combine them into a final ranking model with a precision gain on all metrics, capable of achieving the highest µF1 score (being the 6-th best model in the task). 7 Acknowledgements This work is supported by the Lisbon Regional Operational Programme (Lisboa 2020), under the Portugal 2020 Partnership Agreement, through the European Regional Development Fund (ERDF), within project TRAINER (No 045347). References 1. Yi X, Allan J. A comparative study of utilizing topic models for information re- trieval. European conference on information retrieval, pp. 29-41. Springer (2009). 2. Shen Y, Yu HF, Sanghavi S, Dhillon I. Extreme Multi-label Classification from Aggregated Labels. arXiv preprint arXiv:2004.00198 (2020). 3. Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ. LIBLINEAR: A library for large linear classification. Journal of machine learning research (2008). 4. Miranda S, Nogueira D, Mendes A, Vlachos A, Secker A, Garrett R, Mitchel J, Marinho Z. Automated Fact Checking in the News Room. In The World Wide Web Conference (2019). 5. Devlin J, Chang M, Lee K, Toutanova K. BERT: pretraining of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (2019). 6. Joachims T. Optimizing search engines using clickthrough data. InProceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (2002). 7. Garba, S., Ahmed, A., Mai, A., Makama, G. and Odigie, V. Proliferations of scien- tific medical journals: a burden or a blessing. Oman medical journal, 25(4), p.311 (2010). 8. Zhang W, Yan J, Wang X, Zha H. Deep extreme multi-label learning. Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval (2018). 9. VHL Network Portal. Red.bvsalud.org. 2020. Decs. [online] Available at: http://red.bvsalud.org/decs/en/about-decs/ (Accessed 2 May 2020). 10. Babbar R, Schölkopf B. DiSMEC: Distributed Sparse Machines for Extreme Multi- label Classification. Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (2017). 11. Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformat- ics (2020 Feb). 12. Alsentzer E, Murphy JR, Boag W, Weng WH, Jin D, Naumann T, McDermott M. Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323 (2019). 13. Tai F, Lin HT. Multilabel classification with principal label space transformation. Neural Computation (2012). 14. Prabhu Y, Varma M. Fastxml: A fast, accurate and stable tree-classifier for ex- treme multi-label learning. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (2014). 15. Agrawal R, Gupta A, Prabhu Y, Varma M. Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. Proceedings of the 22nd international conference on World Wide Web (2013). 16. Verma Y. An Embarrassingly Simple Baseline for eXtreme Multi-label Prediction. arXiv preprint arXiv:1912.08140 (2019). 17. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. Advances in neural information processing systems, pp. 5998-6008 (2017). 18. Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Ben- gio Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (2014). 19. Liu TY. Learning to rank for information retrieval. Springer Science & Business Media (2011). 20. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Brew J. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. ArXiv (2019). 21. Paszke A, Gross S, and Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural Information Processing Systems 32, p.8024–8035 (2019) 22. King DE. Dlib-ml: A machine learning toolkit. The Journal of Machine Learning Research (2009).