Zero-Shot and Few-Shot Classification of Biomedical Articles in Context of the COVID-19 Pandemic Simon Lupart1 , Benoit Favre2 , Vassilina Nikoulina1 and Salah Ait-Mokhtar1 1 Naver Labs Europe, 6 Chem. de Maupertuis, 38240, Meylan, France 2 Aix Marseille University, CNRS, LIS / Marseille, France Abstract MeSH (Medical Subject Headings) is a large thesaurus created by the National Library of Medicine and used for fine-grained indexing of publications in the biomedical domain. In the context of the COVID-19 pandemic, MeSH descriptors have emerged in relation to articles published on the corresponding topic. Zero-shot classification is an adequate response for timely labeling of the stream of papers with MeSH categories. In this work, we hypothesise that rich semantic information available in MeSH has potential to improve BioBERT representations and make them more suitable for zero-shot/few-shot tasks. We frame the problem as determining if MeSH term definitions, concatenated with paper abstracts are valid instances or not, and leverage multi-task learning to induce the MeSH hierarchy in the representations thanks to a seq2seq task. Results establish a baseline on the MedLine and LitCovid datasets, and probing shows that the resulting representations convey the hierarchical relations present in MeSH. Keywords Text Classification, Transfer Domain Adaptation Multi-Task Learning, Healthcare Medicine & Wellness 1. Introduction With the outbreak of the COVID-19 disease, the biomedi- cal domain has evolved: new concepts have emerged, and old ones have been revised. In that context, scientific pa- pers are typically manually or automatically labelled with MeSH terms, Medical Subject Headings [1], which helps routing them to the best target audience. It is crucial for the community to be able to react swiftly to events like pandemics, and manual efforts to annotate large numbers of publications may not be timely. To automate that task, it is difficult to use typical classification methods because of the lack of data for some classes, we therefore consider this problem as a zero-shot/few-shot documents classifi- Figure 1: Distribution of the MeSH descriptors: 2853 MeSH cation problem. Formally, in zero-shot learning, at test terms appear only once in the train dataset, 3565 between time a learner (the model) observes documents of classes 2 and 10 times, while a minority of them appear very fre- quently (out of 19,125 annotated documents with 8,140 MeSH that were not seen during training, and respectively in descriptors). few-shot learning, the model will have seen only a small number of documents with these classes. Class distri- butions from our medline-derived dataset are plotted in Figure 1. As shown on the histogram, lots of classes single MeSH descriptors [3]. are annotated in only one document, which makes them In this work we rely on BioBERT [4] to extract repre- difficult to learn. sentations from paper abstracts and classify them. Such Another obstacle (independent from the pandemic) is model is pretrained with masked language modeling ob- the scale of the MeSH thesaurus, as there are thousand jectives on data from the biomedical domain, and we of MeSH descriptors. State-of-the-art on MeSH classi- assume that BioBERT encodes some semantic knowledge fication thus uses IR techniques [2], or focuses on only related to the biomedical domain. However, it has been shown that this pretraining might not be optimal for SDU@AAAI22: AAAI-22 Workshop on Scientific Document Understanding, March 1, 2022, Online tasks such as NER or NLI [5]. Envelope-Open simon.lupart@naverlabs.com (S. Lupart); benoit.favre@lis-lab.fr We formulate the zero-shot task as an “open input” (B. Favre); vassilina.nikoulina@naverlabs.com (V. Nikoulina); problem where we take both the class and the text as salah.ait-mokhtar@naverlabs.com (S. Ait-Mokhtar) an input, and output a matching score between the two. © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). The motivation behind this formulation is that the model CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) efit from very large datasets they can be trained on in a self-supervised way. Such pretraining allows them to learn rich semantic representation of the text, and perform knowledge transfer on other lower-resourced tasks. Those pre-trained models can be used in a zero- shot setting, by creating representations for the given document and each of the the different classes, and then computing similarity scores based on those representa- tions. Chalkidis et al. [9] proposed for example to com- pute the similarity score with an attention mechanism between classes and documents representations. Rios and Figure 2: Example of the MeSH hierarchy. COVID-19 and Kavuluru [10] proposed in addition to the attention mech- Bronchitis have three common ancestors (Diseases category, anism to include hierarchical knowledge using GCNN, Infections and Respiratory Tract infections) and then split in but they do not handle the case where the hierarchy is distinct MeSH descriptors. The hierarchy is all included in the Tree Numbers of the MeSH terms. only available during training. Wohlwend et al. [11] also worked on the representation space using prototypical network and hyperbolic distances and they showed that there was still possible improvements in metric learning. would learn to use the semantics of the class labels, and thus will be able to extend the semantic knowledge en- Fine-grained biomedical classification. BioASQ coded by pretrained model (eg. BioBERT) to new classes. challenge is one of the reference on fine-grained clas- Therefore, our assumption is that those models host and sification of biomedical articles; however the challenge can make use of a good representation of the semantics does not focus on zero-shot adaptation, which is the sce- underlying the MeSH hierarchy, including unobserved nario we consider in this work. Mylonas et al. [3] have terms. tried to perform zero-shot classification across MeSH In order to improve the semantic representations of descriptors, but their testing settings considered only pretrained model, we propose a multi-task training frame- a small number of MeSH descriptors. In our work we work, where an additional decoder block predicts the try to perform a larger scale evaluation in context of position in the hierarchy of the input MeSH term. MeSH the pandemic. Finally, [12] proposed an architecture for descriptors have a position in the hierarchy that is de- hierarchical classification tasks that is able to learn the fined by their Tree Numbers (see Figure 2), so the goal hierarchy by generating the sequence from the hierarchy of this secondary module would be to generate those tree (using an encoder/decoder architecture). Our work Tree Numbers during training. Model learnt with this considers similar architecture in a zero-shot scenario. additional task should better encode MeSH hierarchical structure and hopefully improve zero-shot and few-shot capacities of thus learnt representations. Enforcing that Probing. Probing models [13, 14] are lightweight clas- semantic knowledge is embedded in the model also guar- sifiers plugged on top of pretrained representations. They antees a degree of explainability, an important feature in allow to assess the amount of “knowledge” encoded in the the medical domain. pretrained representations. Alghanmi et al. [15], Jin et al. The main findings of the work are: (a) Our multi-task [5] introduced frameworks in the biomedical domain for framework improves precision on some datasets, and disease knowledge evaluation focusing on relation types thus the F1-score, but it is not systematic. (b) Probing (Symptom-Disease, Test-Disease, etc.), while we are will- tasks show that performance increases are directly linked ing to assess how well a hierarchical structure is encoded to a better knowledge of the MeSH hierarchy. Still, in- in representations. We rely on the structural probing cluding hierarchical information on a large scale dataset framework [16] that we compare against the hierarchical (Medline) is difficult, especially in few-shot and zero-shot structure encoded by MeSH thesaurus. settings. 3. Proposed approach 2. Related work First, we will explain the architectures we explored to Zero-shot classification. There is a large literature address the zero-shot classification problem, and more on zero-shot learning, which consists in classifying precisely the multi-task learning framework. Then, we samples with labels not seen in training [6, 7]. Pre- will focus on the design of the probing tasks, that we used trained models such as BERT [8] or BioBERT [4] are to analyse to what extent the hierarchical knowledge was central to zero-shot learning in NLP. These models ben- encoded by the different representations. Figure 3: Architecture BioBERT MTL. The encoder (blue) creates a representation of the MeSH, then the decoder (red) generates the Tree Number from this representation. GRU cells are used in the decoder as well as an attention mechanism to better handle long sequences. The binary output (green) is the matching score. 3.1. Zero-shot Architecture (768,). On line (4), the + operator corresponds to a sum of the two vectors (both of shape (768,)) in each of their di- BioBERT Single-Task Learning (STL). The first mensions. For the word generation, 𝑜𝑢𝑡𝑗+1 is then passed model is a BioBERT encoder, followed by a dense layer through a dense layer and a logsoftmax function. on the [ C L S ] token. Input of BioBERT is composed Note that 𝑏𝑒𝑟𝑡_ℎ is formed from the output tokens of of the MeSH term, the MeSH description and a doc- BioBERT corresponding to the MeSH description (by ument abstract: [ C L S ] M e S H t e r m : M e S H d e s c r i p t i o n applying the MASK to all other tokens). We also apply [ S E P ] A b s t r a c t , and output is a single neuron, that goes “teacher forcing”, to reduce error accumulation. through a sigmoid activation function. The original problem is thus transformed in a multi- As an example, input for the MeSH term Infections tasks problem, where the two losses (binary cross en- is [ C L S ] I n f e c t i o n s : I n v a s i o n o f t h e h o s t o r g a n i s m tropy and negative log likelihood losses) are then jointly by microorganisms or their toxins or by parasites learned: that can cause pathological conditions or diseases. 1 1 [SEP] Abstract. 𝑙𝑜𝑠𝑠𝑡𝑜𝑡 = 2 𝑙𝑜𝑠𝑠1 + 2 𝑙𝑜𝑠𝑠2 + log(𝜎1 𝜎2 ) (6) 2𝜎1 2𝜎2 BioBERT Multi-Tasks Learning (MTL). The second where both 𝜎1 and 𝜎2 are learnable parameters included architecture is similar to the BioBERT STL, but in ad- in the models parameters [17, 18], to allow the model dition to the binary classification task it learns simulta- to balance between the binary and tree number gener- neously an additional task of MeSH term hierarchical ation losses. The last regularization term is only here position generation. The motivation behind this addi- to prevent the model to learn the naive solution of just tional task is that the learnt representations would better increasing 𝜎1 and 𝜎2 to reduce the loss. encode hierarchy of MeSH terms and hopefully better The output vocabulary of the decoder is composed of deal with zero-shot classification or fine-grained classifi- the tree numbers tokens: A - Z letters, 0 0 - 9 9 digits, and 0 0 0 - cation problems. 9 9 9 digits. All together, the vocabulary size is around Figure 3 describes the architecture of this model. It 1100, on which we apply an embedding layer to trans- uses an additional decoder block, as a secondary task to form discrete tree numbers to a continuous embedding predict the tree number of the given input. The gener- space. As this vocabulary is completely new, embedding ation step 𝑗 + 1 is defined as (without considering the is learnt from scratch using back-propagation. Note that batch size): for MeSH descriptors that have multiple tree numbers, we just duplicate the inputs to learn the multiple posi- 𝑎𝑡𝑡𝑗 = 𝑏𝑒𝑟𝑡_ℎ × ℎ𝑗 (1) tions in the hierarchy. In both architectures, the full set of parameters is ̂𝑗 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑎𝑡𝑡𝑗 ) 𝑎𝑡𝑡 (2) updated during training. Thus, each model provides 𝑇 new representations of the input text and labels, that ̂𝑗 × 𝑏𝑒𝑟𝑡_ℎ 𝑎𝑡𝑡𝑛_𝑎𝑝𝑝𝑙𝑖𝑒𝑑𝑗 = 𝑎𝑡𝑡 (3) are further evaluated through zero-shot classification or 𝑖𝑛𝑝𝑢𝑡𝑗 = 𝑒𝑚𝑏𝑒𝑑𝑗 + 𝑎𝑡𝑡𝑛_𝑎𝑝𝑝𝑙𝑖𝑒𝑑𝑗 (4) through probing. ℎ𝑗+1 , 𝑜𝑢𝑡𝑗+1 = 𝐺𝑅𝑈 (ℎ𝑗 , 𝑖𝑛𝑝𝑢𝑡𝑗 ) (5) 3.2. Hierarchical Probing Task where 𝑏𝑒𝑟𝑡_ℎ is the output of BERT, of shape (512, 768), ℎ𝑗 the hidden state of the GRU cell, of shape (768,), and To better understand the capacity of pretrained repre- 𝑒𝑚𝑏𝑒𝑑𝑗 the embedding of the current word, also of shape sentations to encode hierarchical relations of biomedical terms, we considered two probing tasks, adapted from 4. Experimental settings [16]. The objective is to test whether the representations learned by the model are linearly separable with regards In this section we present the datasets we used to train the to the probing tasks. models and evaluate the corresponding representations. We define the probing task from a main model, taking We also explain how we construct a zero-shot dataset out as input a MeSH descriptor m𝑖 and returning its internal of the Medline dataset with MeSH annotations. representation h𝑖 . We then recall that it is possible to define a scalar product h𝑇 𝐴h from any positive semi 4.1. Datasets definite symmetric matrix 𝐴 ∈ S𝑚×𝑚+ , and more generally from any matrix 𝐵 ∈ R𝑘×𝑚 by taking 𝐴 = 𝐵𝑇 𝐵. Using the Medline/MeSH. Medline is the US National Library metric distance corresponding to this scalar product we of Medical/Biomedical dataset2 , containing millions of then define a distance from any matrix : biomedical scientific documents, and built around the MeSH thesaurus (Medical Subject Heading). This the- 𝑑𝐵 (h𝑖 , h𝑗 ) = (𝐵(h𝑖 − h𝑗 ))𝑇 (𝐵(h𝑖 − h𝑗 )) saurus contains about 30,000 MeSH descriptors, updated every year, and used for semantic indexing of the doc- with h𝑖 and h𝑗 the representations of two MeSH descrip- uments. These MeSH terms also define a hierarchy: tors (more details on representations in section 5.2). Our the first level separates the MeSH terms into 16 main model has as parameter the matrix 𝐵, which is trained branches, then each MeSH term is the child of another to reconstruct a gold distance from one MeSH term to more general MeSH term, and this over up to a depth another. More specifically, the task aims to approximate of fifteen. The sequence of nodes traversed to reach a by gradient descent : MeSH term from the root is called Tree Number. 𝑚𝑖𝑛 ∑ |𝑑𝑇 (h𝑖 , h𝑗 ) − 𝑑𝐵 (h𝑖 , h𝑗 )2 | An example of the hierarchy is shown in Figure 2 with 𝐵 𝑖,𝑗 the two MeSH COVID-19 and Bronchitis. For example here, Covid-19 has the Tree Number C 0 1 . 7 4 8 . 2 1 4 (C be- where 𝑑𝐵 is the predicted distance and 𝑑𝑇 the gold dis- ing the main branch Disease, then we have 3 sub-levels: tance. We also note that, as in the original paper, we add C 0 1 for Infections, C 0 1 . 7 4 8 for Respiratory Tract Infections, a squared term on the predicted distance. Concerning then C 0 1 . 7 4 8 . 2 1 4 for Covid-19). the dimensions of 𝐴 and 𝐵, 𝑚 is the dimension of the rep- The majority of MeSH descriptors from the hierarchy resentation space (same than h), and 𝑘 is the dimension have multiple Tree Numbers, so the hierarchy follows of the linear transformation which we will take equal to a directed acyclic graph structure. Also, the annotation 512. We did not further experiment on the dimension of scientific documents with ancestors of a MeSH is not of the linear transformation (see the original paper by always explicit. For example, a document can be indexed [16] for discussion on both the squared distance and the with term Covid-19, but not necessarily with terms Infec- linear transformation dimension). tions or Respiratory Tract Infections. There are on average 13 annotated MeSH descriptors Gold distance. The only difference with the original per document, where 2 or 3 will be annotated as major paper is the definition of the gold distance. We have MeSH to indicate that the document deals more specifi- evaluated two probes: cally with these topics. In our work, we use the whole 1. Shortest-Path Probe: given two MeSH descrip- set of major and non-major MeSH descriptors. In addi- tors, we ask the model to predict the distance tion to the MeSH annotation and hierarchy, the Medline between the two MeSH terms, as the length of database provides a description for each MeSH term, used the shortest path in the graph defining the MeSH by our models as specified in section 3.1. hierarchy; 2. Common-Ancestors Probe: model predicts LitCovid. LitCovid is a subset of the Medline database whether two MeSH descriptors have k common [19, 20], where extraction is done via PubMed (search ancestors. For this second task we thus define engine of Medline), using the keywords: “coronavirus”, multiples binary probe models that predict if the “ncov”, “cov”, “2019-nCoV”, “COVID-19” and “SARS-CoV- two MeSH terms have at least k common ances- 2”. Using this subset of articles allow us to work more tors or not (for k between 1 and 3). In this partic- specifically on COVID-19 related articles, with also a sub- ular case of a binary probe task, we thus add a sig- set of 9,000 COVID-19 related MeSH descriptors (instead moid function on the predicted distance (where of the full set of MeSH descriptors). The LitCovid dataset the sigmoid function is not centered on zero, but on a positive constant, as distances are always dicting the number of common ancestors given two MeSH descrip- tors, but the regressor was unable to train from the representations, positive)1 . hence the use of binary tasks. 1 2 We have also tried to cast the probe as a regression directly pre- https://www.ncbi.nlm.nih.gov/mesh/ also contains its own categorization, composed of only also add all the ancestors of the annotated MeSH 8 classes: Case report, Diagnosis, Forecasting, General, terms as positive labels to overcome annotation Mechanism, Prevention, Transmission and Treatment, incompleteness problem stated above. Adding with as for the MeSH a short description for each of the ancestors increases the size of the dataset by them. an important factor, so this is why this configura- All our experiments are made on the LitCovid dataset, tion is used for evaluation only. with 27,321 articles (Train-Val-Test split: 19,125 / 2,732 / 5,464) that have both LitCovid and MeSH annotations The choice of the negatives is crucial in metric learning, (several of the 8 classes from LitCovid + avg 13.5 MeSH/ar- and there have been lots of efforts given on developing ticle out of around 9,000 COVID-19 related MeSH descrip- techniques to find “hard negatives”. In our case, the tors from Medline). S i b l i n g s configuration creates by its nature negatives Training and evaluation relies on MeSH annotations that are difficult to distinguish from actual positives. Also (semantically richer), with results that reflects both few- we made the choice to use a binary classification layer, shot for low frequency terms, and zero-shot results for but losses like hinge loss or triplet loss could have been 747 held-out MeSH descriptors. In a second step we also interesting in this particular case. evaluate on LitCovid categories to test a transfer learning For LitCovid we consider all the {document, label} pairs scenario where we change the categorization at test time. since we do not have scaling problem (only 8 labels). 4.2. Evaluation 4.3. Training parameters We present in this section the adaptation of annotations The losses (binary cross entropy for the binary task and for the “open input” architectures. The objective is to negative log-likelihood for the hierarchy generation task) create pairs ({𝑙𝑎𝑏𝑒𝑙, 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡}, 𝑏𝑜𝑜𝑙𝑒𝑎𝑛) from the original are optimised using the AdamW and Adam algorithm annotation. with a learning rate of 2E-5 and 5E-4 respectively. Train- ing is done over 4 epochs, with a save every 0.25 epochs. Zero-shot dataset creation. Inputs in zero-shot are Best model is selected based on the validation loss. We different due to new class appearances: a document 𝑑1 used a batch size of 16, and performed 3 runs for each being associated with two labels (𝑙1 and 𝑙2 ) will thus be model (see standard deviation in the section 5). transformed into two inputs ({𝑙1 , 𝑑1 }, positive) and ({𝑙2 , 𝑑1 }, The BioBERT pretrained model we used was positive). When a new label 𝑙𝑛𝑒𝑤 appears, it is enough to m o n o l o g g / b i o b e r t _ v 1 . 1 _ p u b m e d from Hugging Face. This create the input ({𝑙𝑛𝑒𝑤 , 𝑑1 }) to predict whether this label model accepts an input sequence of up to 512 tokens, 𝑙𝑛𝑒𝑤 is positive or not for the document 𝑑1 . therefore extra tokens were truncated. To make the task meaningful, we also add negatives Concerning probing tasks, each MeSH-to-MeSH pair to both train and test datasets. However, in the case of requires a gold distance (see section 3.2). For the MeSH classification, using all the negatives is not pos- Shortest-Path Probe they are computed using the sible, for two reasons: (i) scalability problem: there are Floyd-Warshall algorithm, while for the Common- more than 9,000 labels and 27,321 documents, that results Ancestors Probe, they were deducted from Tree Num- in hundreds of millions of combinations. (ii) data balanc- bers. Optimizer of the probe task is AdamW, with a ing problem: 9,000 negatives for 13 positives on avg. We 2.5E-5 learning rate. We also only focused on the “N”, therefore use following configurations: “E”, “C”, “D” and “G” branches of the MeSH hiearchy, corresponding resp. to “Health Care Category”, “Analyti- • B a l a n c e d : one random negative pair is added for cal, Diagnostic and Therapeutic Techniques and Equip- each positive pair, to ensure a balanced distribu- ment Category”, “Diseases Category”, “Chemicals and tion. So, given a document, we would always Drugs Category” and “Phenomena and Processes Cate- have the same number of positive and negative gory” (they are the most representative of the dataset). pairs. The negatives are sampled from all the From all possible MeSH-to-MeSH pairs, we have ran- possible negatives, based on the MeSH terms dis- domly selected 10% of them to reduce computation time. tributions to ensure that, given a MeSH term, we Validation and evaluation are performed on 30% of the would also have the same number of positive and MeSH descriptors, that we held out from probe training. negative pairs. • S i b l i n g s : This configuration is only used in eval- uation and aims at better disentangle errors due 5. Results and discussion to “incompleteness of MeSH annotations” from real indexing errors. In this configuration, sib- We present in this section the results in zero-shot and lings according to MeSH hierarchy of the pos- few-shot on both MeSH descriptors and LitCovid labels itive pairs are added with negative labels. We from our two models BioBERT STL and MTL. We then F1-score (std) Precision Recall F1-score (std) Precision Recall Balanced ZSC Balanced BioBERT STL 0.894 (0.001) 0.875 0.913 0.760 (0.004) 0.849 0.688 BioBERT MTL 0.873 (0.003) 0.868 0.878 0.754 (0.006) 0.856 0.674 Siblings ZSC Siblings BioBERT STL 0.390 (0.002) 0.300 0.562 0.281 (0.005) 0.558 0.470 BioBERT MTL 0.402 (0.005) 0.312 0.570 0.285 (0.005) 0.594 0.455 Table 1 Results on Medline/MeSH with different evaluation configurations. Training has been done on the b a l a n c e d dataset, while here we test on both b a l a n c e d and s i b l i n g s dataset. In z s c b a l a n c e d , all MeSH descriptors are zero-shot, while in z s c s i b l i n g s it is a generalized zero-shot setting (mixture of both zero-shot and non zero-shot MeSH descriptors). discuss the results of the probing tasks, the architectures, the quality of annotations and also how we approached the problem of large scale zero-shot classification. 5.1. Zero/Few-shot classification Medline/MeSH. The models have been trained on the b a l a n c e d configuration, and then tested on b a l a n c e d and s i b l i n g s . Table 1 compares results on those different test set configurations both in non zero-shot and zero- shot settings. Note, that as highlighted in section 4.1, the s i b l i n g s configuration is more difficult than the b a l a n c e d one, which explains a high gap between the F1-score from the two configurations. This is mainly due to a lower precision as the model tends to wrongly predict the siblings from the positive MeSH terms as positives ex- amples (while we consider them as negatives in s i b l i n g s Figure 4: F1-score depending on the number of occurrences of settings). On the b a l a n c e d test set, BioBERT STL has the MeSH term in the train set. Evaluation dataset is s i b l i n g s , a better F1-score both in zero-shot and non zero-shot training dataset is b a l a n c e d . settings, while on the s i b l i n g s one, the BioBERT MTL model performs better. This difference is due to the high precision of BioBERT MTL model in both settings. More precisely, the BioBERT MTL model seems to be better 5 shows F1-scores with respect to the deepness of the on difficult pairs, like in the s i b l i n g s settings where you MeSH descriptors both for BioBERT STL and MTL mod- may have very close negative pairs (for example Breast els. As shown in the figure, F1-score tends to decrease Cyst positive and Bronchogenic Cyst negative). for more general terms (first 4 levels of hierarchy), but Figure 4 plots the F1-score with respect to the number increases for more specific terms (after 4th level of hierar- of occurrences of the MeSH descriptors in the train set chy). We believe this could be due to the incompleteness thus allowing to evaluate few-shot learning quality. First of annotations used during training of the MeSH descrip- we note a clear increase of F1-score with the number tors. Recall that training is performed in the b a l a n c e d of occurrences of the MeSH terms in the dataset (up to setting, and therefore ancestor MeSH descriptors are not 0.7) on both models. This indicates, as one would expect, always explicitly annotated, which could result in low that the models are really struggling with difficult pairs performance on high hierarchy levels. This graph is also that contains rare MeSH descriptors. In comparison with difficult to interpret since some branches from the MeSH Table 1, the F1-score in zero-shot on the balanced pairs hierarchy are deeper than others, therefore “specificity” is of 0.76, while the F1-score for the rare MeSH terms is of term with respect to its absolute depth may be differ- much lower. We also note, that the BioBERT MTL model ent; the only information on depth is that deeper MeSH allows to slightly improve performance in low resources terms are in general more specific one. settings (for the terms occurring less then 10 times in the training data: 1 and (1, 10] bins in the figure). LitCovid. Table 2 reports results on LitCovid dataset Concerning the MeSH descriptors themselves, Figure in zero-shot setting for the representations obtained with ZSC LitCovid F1-score Precision Recall Baseline IsIn 0.329 0.520 0.241 Baseline Cos Sim 0.308 0.228 0.471 BioBERT STL 0.512 0.444 0.604 BioBERT MTL 0.465 0.401 0.553 Table 2 Results in zero-shot on LitCovid, where the model has been trained on the MeSH descriptors (b a l a n c e d ). F1-score is the best from three runs. are based on the mean error on the predicted distance with respect to the gold distance (shortest path length between two MeSH terms), while for the Common- Ancestors Probe we report F1-score on binary tasks for each 𝑘 (evaluating whether the two MeSH have at Figure 5: Comparison of F1-score depending on the deep- least 𝑘 common ancestors). ness of the MeSH descriptors on both BioBERT STL and MTL models. Test set is s i b l i n g s , and depth is computed from the MeSH representations. We use C L S token of respec- length of the MeSH descriptor Tree Numbers (average depth tive models as representation of different MeSH in our when MeSH terms have multiple Tree Numbers). experiments. We have compared this representation to both average pooling over the MeSH descriptors tokens and max pooling in our preliminary experiments, and STL and MTL models. On LitCovid, the STL model is bet- observed that C L S token was leading to the best perfor- ter. We believe this may be due to LitCovid categories be- mance on the probe tasks. ing very general in comparison to the MeSH descriptors, As a baseline, we give in table 3 two other representa- therefore the MTL model could not take advantage of its tions: BioBERT vanilla and Random. In BioBERT vanilla, better precision on more specialised pairs. In addition, representations are an average pooling of the MeSH out- as previously, this could be due to the incompleteness put tokens provided by the BioBERT pretrained model of MeSH annotations, where only most specific MeSH without any finetuning (avg pooling was in this only case terms are present in training data, while LitCovid relies better than the C L S token3 ), and for Random, MeSH rep- on more generic labels. resentations are random representations sampled from a We report two simple baselines for LitCovid dataset normal distribution. in Table 2: • Baseline IsIn where an abstract is associated with Comparison STL/MTL. Table 3 indicates that both a label that appears in the abstract itself (both STL and MTL model encode hierarchical structure of lower-cased); MeSH terms better than random baseline, but also bet- • Baseline Cos Sim, where we take the C L S token ter that BioBERT vanilla baseline. More specifically, the representations of all labels (through a vanilla Common-Ancestors Probe implies that between two BioBERT), same for all abstracts, and then com- MeSH descriptors from the same categories, we have en- pute the cosine similarity between each pair, with coded a common base, and there is a projection where a threshold defined on the validation set. MeSH descriptors from the same categories are closer to each others. Concerning the Shortest-Path Probe, re- We note that both BioBERT-based models perform signif- sults shows that there is also a projection where distances icantly better compared to those naive baselines, which (as shortest paths in the hierarchy graph) are respected. indicates that the models are able to exploit the semantics From the results, we also see that the additional task, of the label to some extent, and goes beyond simple label MTL model with the decoder block, is able to encode lookup in the abstract. We also see that the increase of even more hierarchical information in the C L S token, and F1-score is mainly due to a better recall which implies may be a hit to the better precision in zero-shot and better coverage of our models. few-shot results. Also, it is interesting to see that in the BioBERT vanilla model, there is already some good knowl- 5.2. Probing hierarchy knowledge 3 possibly because in BioBERT vanilla model C L S token represen- Finally, Table 3 reports the results of probing the learnt tations has not been finetuned for any task as opposed to our learnt representations. Results for the Shortest-Path Probe models. Shortest-Path Probe Distance Error (std) as we need to create too many pairs for a given docu- BioBERT vanilla 1.494 (-) ment. Our approach was to work on a balanced subset Random 2.597 (-) of the possible pairs or a coherent one (resp. balanced BioBERT STL 1.462 (0.053) and siblings configurations) for training and evaluation, BioBERT MTL 1.323 (0.013) however, this technique does not adapt to real world Common-Ancestors Probe F1 (k=1) k=2 k=3 applications. BioBERT vanilla 0.855 0.521 0.538 An interesting future direction could be combining BioBERT STL 0.864 0.546 0.541 our “open input” architecture with the high-coverage BioBERT MTL 0.933 0.659 0.576 retrieval-like step which would first pre-select a subset of possible MeSH terms, and therefore restrict the search Table 3 space for the second “open input” classification step. For Results of the probe tasks. For comparison of the shortest- example of the first step, a ColBERT model could compute path task, the avg MeSH to MeSH distance is 10.033, with a std of 3.016. For the common-ancestors task, k is the representations of abstracts and classes independently number of common ancestors. instead of creating representations of pairs, and so reduce computational cost. Another possibility could be to use simple algorithms like BM25. edge about this hierarchical structure, which makes sense Other techniques in metric learning also exist, like as this hierarchy is constructed on the semantics of the triplet loss learning or hinge loss. Using triplets with biomedical terms. “hard negatives” may help to learn a better representation space. 5.3. Limitations and possible future directions 6. Conclusion Multi-Tasks-Learning. When dealing with MTL, the In this work, we try to address the problem of zero-shot main difficulty comes from convergence speed of the classification that we defined as an open-input problem. different losses. In our framework, the main loss (clas- We compare a simple BioBERT model with a multi-tasks sification loss) converges faster than the secondary loss learning architecture that includes hierarchical seman- (decoder loss), so we are not able to take full advantage tic knowledge. In zero-shot and few-shot settings, the of the decoder architecture. In a perfect scenario, the multi-tasks framework does not increase performances main task should be the harder one, but here it was not significantly. Still, we observe good results on precision the case, so we were forced to stop training earlier even and on structural probing tasks, which implies that the when using different coefficients and learning rates for addition of the seq2seq task has some beneficial effect in the two losses. Another possible future direction to ex- the ability of trained models to capture semantics. In par- plore is to start training the decoder block before training ticular, the model is able to build a representation space the classification layer. where MeSH descriptors that have common ancestors are closer to each other, and where the overall hierarchi- Annotations. When dealing with transfer learning cal organisation of the MeSH is respected. It would be across different datasets, the question of the quality of interesting to further investigate additional tasks to take the annotation needs to be taken into account. Different even better advantage of hierarchical knowledge encoded annotation systems (even when documents are manually in medical terminologies, and thus improve quality and annotated) may have labels that have different coverage, robustness of models representations. and overlapping, which adds some bias in results. As an example, when we train our model on the Medline annotations and then test in zero-shot on LitCovid labels, References results are difficult to interpret, because the scale and [1] F. B. Rogers, Medical subject headings., Bulletin of the coverage is completely different. [21] have studied the Medical Library Association 51 (1963) 114–116. the semantic interoperability of different biomedical an- [2] Y. Mao, Z. lu, Mesh now: Automatic mesh index- notation tools across multiple countries and databases, ing at pubmed scale via learning to rank, Journal and they show that this was a real issue, that needs to be of Biomedical Semantics 8 (2017) 15. doi:1 0 . 1 1 8 6 / considered when dealing with such terminologies. s13326-017-0123-3. [3] N. Mylonas, S. Karlos, G. Tsoumakas, Zero-shot Large scale Zero-shot. “Open input” architectures classification of biomedical articles with emerg- are not adapted to very large scale zero-shot problems, ing mesh descriptors, in: 11th Hellenic Confer- ence on Artificial Intelligence, SETN 2020, Asso- ciation for Computing Machinery, New York, NY, ings of the 2019 Conference of the North American USA, 2020, p. 175–184. URL: https://doi.org/10.1145/ Chapter of the Association for Computational Lin- 3411408.3411414. doi:1 0 . 1 1 4 5 / 3 4 1 1 4 0 8 . 3 4 1 1 4 1 4 . guistics: Human Language Technologies, Volume [4] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. 1 (Long and Short Papers), Association for Compu- So, J. Kang, Biobert: a pre-trained biomedi- tational Linguistics, Minneapolis, Minnesota, 2019, cal language representation model for biomed- pp. 4129–4138. URL: https://aclanthology.org/N19- ical text mining, Bioinformatics (2019). URL: 1419. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 1 9 . http://dx.doi.org/10.1093/bioinformatics/btz682. [17] T. Gong, T. Lee, C. Stephenson, V. Renduchin- doi:1 0 . 1 0 9 3 / b i o i n f o r m a t i c s / b t z 6 8 2 . tala, S. Padhy, A. Ndirango, G. Keskin, O. H. Eli- [5] Q. Jin, B. Dhingra, W. W. Cohen, X. Lu, Prob- bol, A comparison of loss weighting strategies ing biomedical embeddings from language models, for multi task learning in deep neural networks, NAACL HLT 2019 (2019) 82. IEEE Access 7 (2019) 141627–141632. doi:1 0 . 1 1 0 9 / [6] W. Wang, V. W. Zheng, H. Yu, C. Miao, A survey ACCESS.2019.2943604. of zero-shot learning: Settings, methods, and appli- [18] A. Kendall, Y. Gal, R. Cipolla, Multi-task learning us- cations, ACM Transactions on Intelligent Systems ing uncertainty to weigh losses for scene geometry and Technology (TIST) 10 (2019) 1–37. and semantics, 2018. a r X i v : 1 7 0 5 . 0 7 1 1 5 . [7] J. Chen, Y. Geng, Z. Chen, I. Horrocks, J. Z. [19] Q. Chen, A. Allot, Z. Lu, Keep up with the latest Pan, H. Chen, Knowledge-aware zero-shot learn- coronavirus research, Nature 579 (2020) 193. URL: ing: Survey and perspective, arXiv preprint https://www.ncbi.nlm.nih.gov/pubmed/32157233. arXiv:2103.00070 (2021). doi:1 0 . 1 0 3 8 / d 4 1 5 8 6 - 0 2 0 - 0 0 6 9 4 - 1 . [8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: [20] Q. Chen, A. Allot, Z. Lu, Litcovid: an open database Pre-training of deep bidirectional transformers for of covid-19 literature, Nucleic Acids Research language understanding, 2019. a r X i v : 1 8 1 0 . 0 4 8 0 5 . (2020). [9] I. Chalkidis, M. Fergadiotis, S. Kotitsas, P. Malaka- [21] J. A. Miñarro-Giménez, R. Cornet, M. Jaulent, siotis, N. Aletras, I. Androutsopoulos, An empir- H. Dewenter, S. Thun, K. R. Gøeg, D. Karlsson, ical study on large-scale multi-label text classifi- S. Schulz, Quantitative analysis of manual anno- cation including few and zero-shot labels, 2020. tation of clinical text samples, International Jour- arXiv:2010.01653. nal of Medical Informatics 123 (2019) 37–48. URL: [10] A. Rios, R. Kavuluru, Few-shot and zero-shot multi- https://www.sciencedirect.com/science/article/pii/ label learning for structured label spaces, in: Pro- S1386505618305446. doi:h t t p s : / / d o i . o r g / 1 0 . 1 0 1 6 / ceedings of the 2018 Conference on Empirical Meth- j.ijmedinf.2018.12.011. ods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, 2018, pp. 3132–3142. URL: https://aclanthology.org/ D18-1352. doi:1 0 . 1 8 6 5 3 / v 1 / D 1 8 - 1 3 5 2 . [11] J. Wohlwend, E. R. Elenberg, S. Altschul, S. Henry, T. Lei, Metric learning for dynamic text classifica- tion, 2019. a r X i v : 1 9 1 1 . 0 1 0 2 6 . [12] J. Risch, S. Garda, R. Krestel, Hierarchical document classification as a sequence generation task, 2020. doi:1 0 . 1 1 4 5 / 3 3 8 3 5 8 3 . 3 3 9 8 5 3 8 . [13] A. Conneau, G. Kruszewski, G. Lample, L. Barrault, M. Baroni, What you can cram into a single vector: Probing sentence embeddings for linguistic proper- ties, 2018. a r X i v : 1 8 0 5 . 0 1 0 7 0 . [14] I. Tenney, P. Xia, B. Chen, A. Wang, A. Poliak, R. T. McCoy, N. Kim, B. V. Durme, S. R. Bowman, D. Das, E. Pavlick, What do you learn from context? prob- ing for sentence structure in contextualized word representations, 2019. a r X i v : 1 9 0 5 . 0 6 3 1 6 . [15] I. Alghanmi, L. Espinosa-Anke, S. Schockaert, Prob- ing pre-trained language models for disease knowl- edge, 2021. a r X i v : 2 1 0 6 . 0 7 2 8 5 . [16] J. Hewitt, C. D. Manning, A structural probe for finding syntax in word representations, in: Proceed-