-

Priberam at MESINESP Multi-label Classi cation of Medical Texts Task

Ruben Cardoso

Zita Marinho

Afonso Mendes

Sebasti~ao Miranda

0 0 Priberam Labs , Lisbon, Portugal labs.priberam.com

Medical articles are a crucial tool to provide current state of the art treatments and diagnostics to medical professionals. However, existing public databases such as MEDLINE contain over 27 million articles, making the use of e cient search engines crucial in order to navigate and provide meaningful recommendations. Classifying these articles into broader medical topics can improve retrieval of related articles [1]. The set of medical labels considered for the MESINESP task is on the order of several thousands of labels (DeCS codes), which falls under the extreme multi-label classi cation problem [2]. The heterogeneous and highly hierarchical structure of medical topics makes the task of manually classifying articles extremely laborious and costly. It is, therefore, crucial to automate the process of classi cation. Typical machine learning algorithms become computationally demanding with such a large label set and achieving better recall becomes an unsolved problem. This work presents Priberam's participation at the BioASQ task Mesinesp. We address the large multi-label classi cation problem through the use of four di erent models: a Support Vector Machine (SVM) [3], the customised search engine Priberam Search [4], a BERT based classi er [5], and a SVM-rank ensemble [6] of all the previous models. Results show that all three individual models perform well and the best performance is achieved by their ensemble, granting Priberam the 6-th place in the present challenge and making it the 2-nd best team.

A growing number of medical articles is published every year, with a current estimated rate of at least one new article every 26 seconds [ 7 ]. The large magnitude of both the documents and the assigned topics renders automatic classi cation algorithms a necessity in organising and providing relevant information. Search engines have a vital role in easing the burden of accessing this information e ciently, however, these usually rely on the manual indexing or tagging of articles, which is a slow and burdensome process [ 8 ].

The Mesinesp task consists in automatically indexing abstracts in Spanish from two well-known medical databases, IBECS and LILACS, with tags from a pool of 34118 hierarchically structured medical terms, the DeCS codes. This trilingual vocabulary (English, Portuguese and Spanish) serves as a unique vocabulary in indexing medical articles. It follows a tree structure that divides the codes into broader classes and more re ned sub-classes respecting their conceptual and semantic relationships [ 9 ].

In this task, we tackle the extreme multi-label (XML) classi cation problem. Our goal is to predict for a given article the most relevant subset of labels from an extremely large label set (order of tens of thousands) using supervised training.1 Typical multi-label classi cation techniques are not suitable for the XML setting, due to its large computational requirements: the large number of labels implies that both label and feature vectors are sparse and exist in high-dimensional spaces; and to address the sparsity of label occurrence, a large number of training instances is required. These factors make the application of such techniques highly demanding in terms of time and memory, increasing the requirements of computational resources.

The Mesinesp task is even more challenging due to two reasons: rst, the articles' labels must be predicted only from the abstracts and titles; and second, all the articles to be classi ed are in Spanish, which prevents the use of additional resources available only for English, such as BioBERT [ 11 ] and ClinicalBERT [ 12 ].

This paper describes our participation at the BioASQ task Mesinesp. We explore the performance of a one-vs.-rest model based on Support Vector Machines (SVM) [ 3 ] as well as that of a proprietary search engine, Priberam Search [ 4 ], which relies on inverted indexes combined with a k-nearest neighbours classi er. Furthermore, we took advantage of BERT's contextualised embeddings [ 5 ] and tested three possible classi ers: a linear classi er; a label attention mechanism that leverages label semantics; and a recurrent model that predicts a sequence of labels according to their frequency. We propose the following contributions: { Application of BERT's contextualised embeddings to the task of XML classi cation, including the exploration of linear, attention based and recurrent classi ers. To the best of our knowledge, this work is the rst to apply a pretrained BERT model combined with a recurrent network to the XML classi cation task. { Empirical comparison of a simple one-vs.-rest SVM approach with a more complex model combining a recurrent classi er and BERT embeddings. { An ensemble of the previous individual methods using SVM-rank, which was capable of outperforming them. 1 The task of multi-label classi cation di ers from multi-class classi cation in that labels are not exclusive, which enables the assignment of several labels to the same article, making the problem even harder [ 10 ].

Related Work

Currently, there are two main approaches to XML: embedding based methods and tree based methods.

Embedding based methods deal with the problem of high dimensional feature and label vectors by projecting them onto a lower dimensional space [ 8,13 ]. During prediction, the compressed representation is projected back onto the space of high dimensional labels. This information bottleneck can often reduce noise and allow for a way of regularising the problem. Although very e cient and fast, this approach assumes however that the low-dimensional space is capable of encoding most of the original information. For real world problems, this assumption is often too restrictive and may result in decreased performance.

Tree based approaches intend to learn a hierarchy of features or labels from the training set [ 14, 15 ]. Typically, a root node is initialised with the complete set of labels and its children nodes are recursively partitioned until all the leaf nodes contain a small number of labels. During prediction, each article is passed along the tree and the path towards its nal leaf node de nes the predicted set of labels. These methods tend to be slower than embedding based methods but achieve better performance. However, if a partitioning error is made near the top of the tree, its consequences are propagated to the lower levels.

Furthermore, other methods should be referred due to their simple approach capable of achieving competitive results. Among these, DiSMEC [ 10 ] should be highlighted because it follows a one-vs.-rest approach which simply learns a weight vector for each label. The multiplication of such weight vector with the data point feature vector yields a score that allows the classi cation of the label. Another simple approach consists of performing a set of random projections from the feature space towards a lower dimension space where, for each test data point, a k-nearest neighbours algorithm performs a weighted propagation of the neighbour's labels, based on their similarity [ 16 ].

We propose two new approaches which are substantially distinct from the ones discussed above. The rst one uses a search engine based on inverted indexing and the second leverages BERT's contextualised embeddings combined with either a linear or recurrent layer. 3

XML Classi cation Models

We explore the performance of a one-vs.-rest SVM model in x3.1, and a customised search engine (Priberam Search) in x3.2. We further experiment with several classi ers leveraging BERT's contextualised embeddings in x3.3. In the end we aggregate the predictions of all of these individual models using a SVMRank algorithm in x3.4. 3.1

Support Vector Machine Our rst baseline consists of a simple Support Vector Machine (SVM) using a one-vs.-rest strategy. We train an independent SVM classi er for each possible label. To reduce the burden of computation we only consider labels with frequency above a given threshold fmin. Each classi er weight w 2 Rd measures the importance assigned to each feature representation of a given article and is trained to optimise the max-margin loss of the support vectors and the hyper plane xi 2 Rd [ 3 ]:

1 min w 2

l wwT + C X (w; xi; yi)

i=1 s.t. yi(w>xi + b 1 i) (1) where (xi; yi) are the article-label pairs, C is the regularisation parameter, b is a bias term and corresponds to a slack function used to penalise incorrectly classi ed points and w is the vector normal to the decision hyper-plane. We used the abstract's term frequency{inverse document frequency (tf-idf) as features to represent xi. 3.2

Priberam Search The second model consists of a customised search engine, Priberam Search, based on inverted indexing and retrieval using the Okapi-BM25 algorithm [ 4 ]. It uses an additional k-nearest neighbours algorithm (k-NN) to obtain the set of k indexed articles closest to a query article in feature space. This similarity is based on the frequency of words, lemmas and root-words, as well as label semantics and synonyms. A score is given to each one of these articles and to each one of their labels and label synonyms, and a weighted sum of these scores yields the nal score assigned to each label. 3.3

XML BERT Classi er Language model pretraining has recently advanced the state of the art in several Natural Language Processing tasks, with the use of contextualised embeddings such as BERT, Bidirectional Encoder Representations from Transformers [ 5 ]. This model consists of 12 stacked transformer blocks and its pretraining is performed on a very large corpus following two tasks: next sentence prediction and masked language modelling. The nature of the pretraining tasks makes this model ideal for representing sentence information (given by the representation of the [CLS] token added to the beginning of each sentence). After encoding a sentence with BERT, we apply di erent classi ers, and ne tune the model to minimise a multi-label classi cation loss:

BCELoss(xi; yi) = yi;j log (xi;j ) + (1 yi;j ) log(1 (xi;j )); (2) where yi;j denotes the binary value of label j of article i, which is 1 if it is present and 0 otherwise, xi;j represents the label predictions (logits) of article i and label j, and is the sigmoid function. 3.3.1 In-domain transfer knowledge Additionally, we performed an extra step of pretraining. Starting from the original weights obtained from BERT pretrained in Spanish, we further pretrained the model with a task of masked language modelling on the corpus composed by all the articles in the training set. This extra step results in more meaningful contextualised representations for this medical corpus, whose domain speci c language might di er from the original pretraining corpora.

After this, we tested three di erent classi ers: a linear classi er in x3.3.2, a linear classi er with label attention in x3.3.3 and a recurrent classi er in x3.3.4. 3.3.2 XML BERT Linear Classi er The rst and simplest classi er consists of a linear layer which maps the sequence output (the 768 dimensional embedding - corresponding to the [CLS] token) to the label space, composed by 33702 dimensions corresponding to all the labels found in the training set. Such architecture is represented in gure 1. We minimise binary cross-entropy using sigmoid activations to allow for multiple active labels per instance, see Eq. 2. This classi er is hereafter designated Linear. 3.3.3 XML BERT With Label Attention For the second classi er, we assume a continuous representation with 768 dimensions for each label. We initialise label embeddings as the pooled output embeddings (corresponding to the [CLS] token) of a BERT model whose inputs were the string descriptors and synonyms for each label. We consider a key-query-value attention mechanism [ 17 ], where the query corresponds to the pooled output of the abstract's contextualised representation and the keys and values correspond to the label embeddings. We further consider residual connections, and a nal linear layer maps these results to the decision space of 33702 labels using a linear classi er, as shown in gure 2. Once again, we choose a binary cross-entropy loss (Eq.2). This classi er is hereafter designated Label attention. 3.3.4 XML BERT With Gated Recurrent Unit In the last classi er, we predict the article's labels sequentially. Before the last linear classi er used to project the nal representation onto the label space, we add a Gated Recurrent Unit (GRU) network [ 18 ] with 768 units that sequentially predicts each label according to label frequency. A owchart of the architecture is shown in gure 3. This sequential prediction is performed until the prediction of the stopping label is reached.

We consider a binary cross-entropy loss with two di erent approaches. On the rst approach, all labels are sequentially predicted and the loss is computed only after the stopping label is predicted, i.e., the loss value is independent of the order in which the labels are predicted. It only takes into account the nal set. This loss is denominated Bag of Labels loss (BOLL) and it is given by: LBOLL = BCELoss(xi; yi) (3) where xi and yi are the total set of predicted logits and gold labels for the current article i, correspondingly. The models trained with this loss are hereafter designated Gru Boll.

The second approach uses an iterative loss which is computed at each step of the sequential prediction of labels. We compare each predicted label with the gold label, the loss is computed and added to a running loss value. In this case, the loss is denominated Iterative Label loss (ILL):

LILL =

X BCELoss(xi(t); yi(t)) t2T (4) where T is the length of the label sequence, t denotes the time-steps taken by the GRU until the \stop label" is predicted, and x(t) and y(t) are the predicted i i logits and gold labels for time-step t and article i, respectively. Models trained with this loss are hereafter designated Gru Ill.

Although only one of the losses accounts directly for prediction order, this factor is always relevant because it a ects the nal set of predicted labels. This way, the model must be trained and tested assuming a speci c label ordering. For this work, we used two orders: ascending and descending label frequency on the training set, designated Gru ascend and Gru descend, respectively.

Additionally, we developed a masking system to force the sequential prediction of labels according to the chosen frequency order. This means that at each step the output label set is reduced to all labels whose frequency fall bellow or above the previous label, depending on the monotonically ascending or descending order, respectively. Models in which such masking is used are designated Gru w/ mask. 3.4

Ensemble Furthermore, we developed an ensemble model combining the results of the previously described SVM, Priberam Search and BERT with GRU models. This ensemble's main goal is to leverage the label scores yielded by these three individual models in order to make a more informed decision regarding the relevance of each label to the abstracts.

We chose an ensembling method based on a SVM-rank algorithm [ 6 ] whose features are the normalised scores yielded by the three individual models, as well as their pairwise product and full product. These scores are the distance to the hyper-plane in the SVM model, the k-nearest neighbours score for Priberam Search and the label probability for the BERT model.

An SVM-rank is a variant of the support vector machine algorithm used to solve ranking problems [19]. It essentially leverages pair-wise ranking methods to sort and score results based on their relevance for a speci c query. This algorithm optimises an analogous loss to the one shown in Eq. 1. Such ensemble is hereafter designated SVM-rank ensemble.

Experimental Setup

We consider the training set provided for the Mesinesp competition containing 318658 articles with at least one DeCS code and an average of 8:12 codes per article. We trained the individual models with 95% of this data. The remaining 5% were used to train the SVM-rank algorithm. The provided smaller o cial development set, with 750 samples, was used to ne-tune the individual model's and ensemble's hyper-parameters, while the test set, with 500 samples, was used for reporting nal results. These two sets were manually annotated by experts speci cally for the MESINESP task. 4.1

Support Vector Machine For the SVM model we chose to ignore all labels that appeared in less than 20 abstracts. With this cuto , we decrease the output label set size to 9200. Additionally, we use a linear kernel to reduce computation time and avoid overtting, which is critical to train such a large number of classi ers. Regarding regularisation, we obtained the best performance using a regularisation parameter set to C = 1:0, and a squared hinge slack function whose penalty over the misclassi ed data points is computed with an `2 distance.

Furthermore, to enable more control over the classi cation boundary, after solving the optimisation problem we moved the decision hyper-plane along the direction of w. We empirically determined that a distance of 0:3 from its original position resulted in the best F 1 score. This model was implemented using a scikit-learn.2 4.2

Priberam Search To use the Priberam Search Engine, we rst indexed the training set taking into account the abstract text, title, complete set of gold DeCS codes, and also their corresponding string descriptors along with some synonyms provided3. We tuned the number of neighbours k = [10; 20; 30; 40; 50; 60; 70; 100; 200] in the development set for the k-NN algorithm and obtained the best results for k = 40. To decide whether or not a label should be assigned to an article, we netuned a score threshold over the interval [0:1; 0:5] using the o cial development set, obtaining a best performing value of 0:24. All labels with score above the threshold were picked as correct labels. 2 scikit-learn.org 3 https://temu.bsc.es/mesinesp/wp-content/uploads/2019/12/DeCS.2019.v5.

tsv.zip 4.3 For all types of BERT classi ers, we used the Transformers and PyTorch Python packages [20, 21].

We initialised BERT's weights from its cased version pretrained on Spanish corpora, bert-base-spanish-wwm-cased4.

We further performed a pretraining on the Mesinesp dataset to obtain better in domain embeddings. For the pretraining and classi cation task, table 1 shows the training hyper-parameters.

For all the experiments with BERT, the complete set of DeCS codes was considered as the label set.

Hyper-parameter

Pretraining Classi cation Batch size Learning rate Warmup steps Max seq lenght Learning rate decay Dropout probability Our ensemble model aggregates the prediction of all the individual contenders and produces a nal predicted label set. To improve recall we lowered the thresholds set for each individual model until the value for which the average number of predicted labels per abstract was approximately double the average number of gold labels. This ensures that the SVM-rank algorithm was trained with a balanced set, resulting in a system in which the individual models have very high recall and the ensemble model is responsible for precision.

We trained the SVM-rank model with the 5% hold-out data of the training set. Furthermore, SVM-rank returns a score for each label in each abstract, making it necessary to de ne a threshold for classi cation. This threshold was ne-tuned over the interval [ 0:5; 0:5] using the o cial Mesinesp development set, yielding a best performing cut-o score of 0:0233.

We also ne-tuned the regularisation parameter, C. We experimented the values C = [0:01; 0:1; 0:5; 1; 5; 10] obtaining the best performance for C = 0:1. The current model was implemented using a Python wrapper for the dlib C++ toolkit [22]. 4 https://github.com/dccuchile/beto

Results

Table 2 shows the -precision, -recall and -F1 metrics for the best performing models described above, evaluated on both the o cial development and test sets.

The comparison between the scores obtained for the one-vs.-rest SVM and Priberam Search models shows that the SVM outperforms the k-NN based Priberam Search in terms of F1, which is mostly due to its higher recall. Note that, although not ideal for multi-label problems, the one-vs.-rest strategy fro the SVM model was able to achieve a relatively good performance, even when the predicted label set was signi cantly reduced.

Model SVM

Table 3 shows the performance of several classi ers used with BERT. Note that, for these models, in order to save time and computational resources some tests were stopped before achieving their maximum performance, allowing nonetheless comparison with other models.

We trained linear classi ers using the BERT model with pretraining on the MESINESP corpus for 660k steps ( 19 epochs) and without such pretraining (marked with *). Results show that, even with an under-trained classi er, such pretraining is already advantageous. This pretraining was employed for all models combining BERT embeddings with a GRU classi er. The label-attentive Bert model (Gru Boll ascend) shows negligible impact on performance, when compared with the simple linear classi er (Linear).

We consider three varying architectures of the Bert-Gru model: Bag of Labels loss (Boll) or Iterative Label loss (Ill), ascending or descending label frequency, and usage or not of masking. Taking into account the best score achieved, the BOLL loss performs better than the ILL loss, even with a smaller number of training steps. For this BOLL loss, it is also evident that the ordering of labels with ascending frequency outperforms the opposite order, and that masking results in decreased performance.

On the other hand, for the ILL loss, masking improves the achieved score and the ordering of labels with descending frequency shows better results. The best classi er for a BERT-based model is the GRU network trained with a Bag of Labels loss and with labels provided in ascending frequency order (Gru Boll ascend). This model was further trained for a total of 28 epochs resulting in a F1=0.4918 on the 5% hold-out of the training set. It is important to notice the performance drop from the 5% hold-out data to the o cial development set. This drop is likely a result of the mismatch between the annotation methods used in the two sets, given that the development set was speci cally manually annotated for this task.

Surprisingly, the BERT based model shows worse performance than the SVM on the test set. Despite their very similar F1 scores for the development set, the BERT-GRU model su ered a considerable performance drop from the development to the test set due to a decrease in recall. This might indicate some over- tting of hyper-parameters and a possible mismatch between these two expert annotated sets.

Additionally, as made explicit in table 2, the ensemble combining the results of the SVM, Priberam Search and the best performing BERT based classi er achieved the best performance on the development set, outperforming all the individual models.

BERT classi er Linear* Linear Label attention* Gru Boll ascend Gru Boll descend Gru Boll ascend w/ mask Gru Ill descend Gru Ill descend w/ mask Gru Ill ascend w/ mask

Training steps 220k 250ky 700k 80k 40k 100ky 240ky 240ky 240ky

Finally, table 4 shows additional classi cation metrics for each one of the submitted systems, as well as their rank within the Mesinesp task. The analysis of such results makes clear that for the three considered averages (Micro, Macro and per sample), the SVM model shows the best recall score. For most of the remaining metrics, the SVM-rank ensemble is able to leverage the capabilities of the individual models and achieve considerable performance gains, particularly noticeable for the precision scores.

Metric

SVM F1 P

R MaF1 MaP MaR EbF1 EbP EbR This paper introduces three type of extreme multi label classi ers: an SVM, a k-NN based search engine and a series of BERT classi ers. Our one-vs.-rest SVM model shows the best performance on all recall metrics. We further provide an empirical comparison of di erent variants of multi-label BERT based classi ers, where the Gated Recurrent Unit network with the Bag of Labels loss shows the most promising results. This model yields slightly better results than the SVM model on the development set, however, due to a drop in recall, under-performs it on the test set. Finally, the SVM-rank ensemble is able to leverage the label scores yielded by the three individual models and combine them into a nal ranking model with a precision gain on all metrics, capable of achieving the highest F1 score (being the 6-th best model in the task). 7

Acknowledgements

This work is supported by the Lisbon Regional Operational Programme (Lisboa 2020), under the Portugal 2020 Partnership Agreement, through the European Regional Development Fund (ERDF), within project TRAINER (No 045347). 19. Liu TY. Learning to rank for information retrieval. Springer Science & Business

Media (2011). 20. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, Brew J. HuggingFace's Transformers: State-of-the-art Natural Language Processing. ArXiv (2019). 21. Paszke A, Gross S, and Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural Information Processing Systems 32, p.8024{8035 (2019) 22. King DE. Dlib-ml: A machine learning toolkit. The Journal of Machine Learning Research (2009).

1. Yi

, Allan

A comparative study of utilizing topic models for information retrieval . European conference on information retrieval , pp. 29 - 41 . Springer ( 2009 ).

2. Shen

, Yu

, Sanghavi

, Dhillon

. Extreme Multi-label Classi cation from Aggregated Labels . arXiv preprint arXiv: 2004 . 00198 ( 2020 ).

3. Fan

, Chang

, Hsieh

, Wang

, Lin

. LIBLINEAR: A library for large linear classi cation . Journal of machine learning research ( 2008 ).

4. Miranda

, Nogueira

, Mendes

, Vlachos

, Secker

, Garrett

, Mitchel

, Marinho Z. Automated Fact Checking in the News Room. In The World Wide Web Conference ( 2019 ).

5. Devlin

, Chang

, Lee

, Toutanova

BERT

: pretraining of Deep Bidirectional Transformers for Language Understanding . Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics ( 2019 ).

6. Joachims

Optimizing search engines using clickthrough data. InProceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (

2002 ).

7. Garba , S. , Ahmed , A. , Mai , A. , Makama , G. and Odigie , V. Proliferations of scienti c medical journals: a burden or a blessing . Oman medical journal , 25 ( 4 ), p. 311 ( 2010 ).

8. Zhang

, Yan

, Wang

, Zha

. Deep extreme multi-label learning . Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval ( 2018 ).

VHL

Network Portal . Red.bvsalud.org. 2020 . Decs. [online] Available at: http://red.bvsalud.org/decs/en/about-decs / (Accessed 2 May 2020 ).

10. Babbar

, Scholkopf B. DiSMEC: Distributed Sparse Machines for Extreme Multilabel Classi cation . Proceedings of the Tenth ACM International Conference on Web Search and Data Mining ( 2017 ).

11. Lee

, Yoon

, Kim

, So

, Kang

BioBERT: a pre-trained biomedical language representation model for biomedical text mining . Bioinformatics (2020 Feb).

12. Alsentzer

, Murphy

, Boag

, Weng

, Jin

, Naumann

, McDermott

. Publicly available clinical BERT embeddings . arXiv preprint arXiv: 1904 . 03323 ( 2019 ).

13. Tai

, Lin

. Multilabel classi cation with principal label space transformation . Neural Computation ( 2012 ).

14. Prabhu

, Varma M. Fastxml : A fast, accurate and stable tree-classi er for extreme multi-label learning . Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining ( 2014 ).

15. Agrawal

, Gupta

, Prabhu

, Varma

Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages . Proceedings of the 22nd international conference on World Wide Web ( 2013 ).

16. Verma

An Embarrassingly Simple Baseline for eXtreme Multi-label Prediction . arXiv preprint arXiv: 1912 . 08140 ( 2019 ).

17. Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

, Kaiser

, Polosukhin

. Attention is all you need . Advances in neural information processing systems , pp. 5998 - 6008 ( 2017 ).

18. Cho

, van Merrienboer B , Gulcehre

C , Bahdanau

D , Bougares

F , Schwenk

H , Bengio

Y . Learning Phrase Representations using RNN Encoder{Decoder for Statistical Machine Translation . Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( 2014 ).