Zero-Shot and Few-Shot Classification of Biomedical
Articles in Context of the COVID-19 Pandemic
Simon Lupart1 , Benoit Favre2 , Vassilina Nikoulina1 and Salah Ait-Mokhtar1
1
    Naver Labs Europe, 6 Chem. de Maupertuis, 38240, Meylan, France
2
    Aix Marseille University, CNRS, LIS / Marseille, France


                                             Abstract
                                             MeSH (Medical Subject Headings) is a large thesaurus created by the National Library of Medicine and used for fine-grained
                                             indexing of publications in the biomedical domain. In the context of the COVID-19 pandemic, MeSH descriptors have emerged
                                             in relation to articles published on the corresponding topic. Zero-shot classification is an adequate response for timely
                                             labeling of the stream of papers with MeSH categories. In this work, we hypothesise that rich semantic information available
                                             in MeSH has potential to improve BioBERT representations and make them more suitable for zero-shot/few-shot tasks. We
                                             frame the problem as determining if MeSH term definitions, concatenated with paper abstracts are valid instances or not, and
                                             leverage multi-task learning to induce the MeSH hierarchy in the representations thanks to a seq2seq task. Results establish a
                                             baseline on the MedLine and LitCovid datasets, and probing shows that the resulting representations convey the hierarchical
                                             relations present in MeSH.

                                             Keywords
                                             Text Classification, Transfer Domain Adaptation Multi-Task Learning, Healthcare Medicine & Wellness


1. Introduction
With the outbreak of the COVID-19 disease, the biomedi-
cal domain has evolved: new concepts have emerged, and
old ones have been revised. In that context, scientific pa-
pers are typically manually or automatically labelled with
MeSH terms, Medical Subject Headings [1], which helps
routing them to the best target audience. It is crucial for
the community to be able to react swiftly to events like
pandemics, and manual efforts to annotate large numbers
of publications may not be timely. To automate that task,
it is difficult to use typical classification methods because
of the lack of data for some classes, we therefore consider
this problem as a zero-shot/few-shot documents classifi-                                                              Figure 1: Distribution of the MeSH descriptors: 2853 MeSH
cation problem. Formally, in zero-shot learning, at test                                                              terms appear only once in the train dataset, 3565 between
time a learner (the model) observes documents of classes                                                              2 and 10 times, while a minority of them appear very fre-
                                                                                                                      quently (out of 19,125 annotated documents with 8,140 MeSH
that were not seen during training, and respectively in
                                                                                                                      descriptors).
few-shot learning, the model will have seen only a small
number of documents with these classes. Class distri-
butions from our medline-derived dataset are plotted in
Figure 1. As shown on the histogram, lots of classes                                                                  single MeSH descriptors [3].
are annotated in only one document, which makes them                                                                     In this work we rely on BioBERT [4] to extract repre-
difficult to learn.                                                                                                   sentations from paper abstracts and classify them. Such
    Another obstacle (independent from the pandemic) is                                                               model is pretrained with masked language modeling ob-
the scale of the MeSH thesaurus, as there are thousand                                                                jectives on data from the biomedical domain, and we
of MeSH descriptors. State-of-the-art on MeSH classi-                                                                 assume that BioBERT encodes some semantic knowledge
fication thus uses IR techniques [2], or focuses on only                                                              related to the biomedical domain. However, it has been
                                                                                                                      shown that this pretraining might not be optimal for
SDU@AAAI22: AAAI-22 Workshop on Scientific Document
Understanding, March 1, 2022, Online                                                                                  tasks such as NER or NLI [5].
Envelope-Open simon.lupart@naverlabs.com (S. Lupart); benoit.favre@lis-lab.fr                                            We formulate the zero-shot task as an “open input”
(B. Favre); vassilina.nikoulina@naverlabs.com (V. Nikoulina);                                                         problem where we take both the class and the text as
salah.ait-mokhtar@naverlabs.com (S. Ait-Mokhtar)                                                                      an input, and output a matching score between the two.
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                     The motivation behind this formulation is that the model
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
                                                                efit from very large datasets they can be trained on in
                                                                a self-supervised way. Such pretraining allows them
                                                                to learn rich semantic representation of the text, and
                                                                perform knowledge transfer on other lower-resourced
                                                                tasks. Those pre-trained models can be used in a zero-
                                                                shot setting, by creating representations for the given
                                                                document and each of the the different classes, and then
                                                                computing similarity scores based on those representa-
                                                                tions. Chalkidis et al. [9] proposed for example to com-
                                                                pute the similarity score with an attention mechanism
                                                                between classes and documents representations. Rios and
Figure 2: Example of the MeSH hierarchy. COVID-19 and Kavuluru [10] proposed in addition to the attention mech-
Bronchitis have three common ancestors (Diseases category, anism to include hierarchical knowledge using GCNN,
Infections and Respiratory Tract infections) and then split in
                                                                but they do not handle the case where the hierarchy is
distinct MeSH descriptors. The hierarchy is all included in the
Tree Numbers of the MeSH terms.
                                                                only available during training. Wohlwend et al. [11] also
                                                                worked on the representation space using prototypical
                                                                network and hyperbolic distances and they showed that
                                                                there was still possible improvements in metric learning.
would learn to use the semantics of the class labels, and
thus will be able to extend the semantic knowledge en-
                                                                Fine-grained biomedical classification. BioASQ
coded by pretrained model (eg. BioBERT) to new classes.
                                                                challenge is one of the reference on fine-grained clas-
Therefore, our assumption is that those models host and
                                                                sification of biomedical articles; however the challenge
can make use of a good representation of the semantics
                                                                does not focus on zero-shot adaptation, which is the sce-
underlying the MeSH hierarchy, including unobserved
                                                                nario we consider in this work. Mylonas et al. [3] have
terms.
                                                                tried to perform zero-shot classification across MeSH
   In order to improve the semantic representations of
                                                                descriptors, but their testing settings considered only
pretrained model, we propose a multi-task training frame-
                                                                a small number of MeSH descriptors. In our work we
work, where an additional decoder block predicts the
                                                                try to perform a larger scale evaluation in context of
position in the hierarchy of the input MeSH term. MeSH
                                                                the pandemic. Finally, [12] proposed an architecture for
descriptors have a position in the hierarchy that is de-
                                                                hierarchical classification tasks that is able to learn the
fined by their Tree Numbers (see Figure 2), so the goal
                                                                hierarchy by generating the sequence from the hierarchy
of this secondary module would be to generate those
                                                                tree (using an encoder/decoder architecture). Our work
Tree Numbers during training. Model learnt with this
                                                                considers similar architecture in a zero-shot scenario.
additional task should better encode MeSH hierarchical
structure and hopefully improve zero-shot and few-shot
capacities of thus learnt representations. Enforcing that Probing. Probing models [13, 14] are lightweight clas-
semantic knowledge is embedded in the model also guar- sifiers plugged on top of pretrained representations. They
antees a degree of explainability, an important feature in allow to assess the amount of “knowledge” encoded in the
the medical domain.                                             pretrained representations. Alghanmi et al. [15], Jin et al.
   The main findings of the work are: (a) Our multi-task [5] introduced frameworks in the biomedical domain for
framework improves precision on some datasets, and disease knowledge evaluation focusing on relation types
thus the F1-score, but it is not systematic. (b) Probing (Symptom-Disease, Test-Disease, etc.), while we are will-
tasks show that performance increases are directly linked ing to assess how well a hierarchical structure is encoded
to a better knowledge of the MeSH hierarchy. Still, in- in representations. We rely on the structural probing
cluding hierarchical information on a large scale dataset framework [16] that we compare against the hierarchical
(Medline) is difficult, especially in few-shot and zero-shot structure encoded by MeSH thesaurus.
settings.
                                                               3. Proposed approach
2. Related work                                                First, we will explain the architectures we explored to
Zero-shot classification. There is a large literature          address the zero-shot classification problem, and more
on zero-shot learning, which consists in classifying           precisely the multi-task learning framework. Then, we
samples with labels not seen in training [6, 7]. Pre-          will focus on the design of the probing tasks, that we used
trained models such as BERT [8] or BioBERT [4] are             to analyse to what extent the hierarchical knowledge was
central to zero-shot learning in NLP. These models ben-        encoded by the different representations.
Figure 3: Architecture BioBERT MTL. The encoder (blue) creates a representation of the MeSH, then the decoder (red)
generates the Tree Number from this representation. GRU cells are used in the decoder as well as an attention mechanism to
better handle long sequences. The binary output (green) is the matching score.


3.1. Zero-shot Architecture                                                            (768,). On line (4), the + operator corresponds to a sum of
                                                                                       the two vectors (both of shape (768,)) in each of their di-
BioBERT Single-Task Learning (STL). The first
                                                                                       mensions. For the word generation, 𝑜𝑢𝑡𝑗+1 is then passed
model is a BioBERT encoder, followed by a dense layer
                                                                                       through a dense layer and a logsoftmax function.
on the [ C L S ] token. Input of BioBERT is composed
                                                                                           Note that 𝑏𝑒𝑟𝑡_ℎ is formed from the output tokens of
of the MeSH term, the MeSH description and a doc-
                                                                                       BioBERT corresponding to the MeSH description (by
ument abstract: [ C L S ] M e S H t e r m : M e S H d e s c r i p t i o n
                                                                                       applying the MASK to all other tokens). We also apply
[ S E P ] A b s t r a c t , and output is a single neuron, that goes
                                                                                       “teacher forcing”, to reduce error accumulation.
through a sigmoid activation function.
                                                                                           The original problem is thus transformed in a multi-
    As an example, input for the MeSH term Infections
                                                                                       tasks problem, where the two losses (binary cross en-
is [ C L S ] I n f e c t i o n s : I n v a s i o n o f t h e h o s t o r g a n i s m
                                                                                       tropy and negative log likelihood losses) are then jointly
by microorganisms or their toxins or by parasites
                                                                                       learned:
that can cause pathological conditions or diseases.
                                                                                                            1          1
[SEP] Abstract.                                                                                 𝑙𝑜𝑠𝑠𝑡𝑜𝑡 = 2 𝑙𝑜𝑠𝑠1 + 2 𝑙𝑜𝑠𝑠2 + log(𝜎1 𝜎2 )              (6)
                                                                                                          2𝜎1         2𝜎2
BioBERT Multi-Tasks Learning (MTL). The second                                         where both 𝜎1 and 𝜎2 are learnable parameters included
architecture is similar to the BioBERT STL, but in ad-                                 in the models parameters [17, 18], to allow the model
dition to the binary classification task it learns simulta-                            to balance between the binary and tree number gener-
neously an additional task of MeSH term hierarchical                                   ation losses. The last regularization term is only here
position generation. The motivation behind this addi-                                  to prevent the model to learn the naive solution of just
tional task is that the learnt representations would better                            increasing 𝜎1 and 𝜎2 to reduce the loss.
encode hierarchy of MeSH terms and hopefully better                                        The output vocabulary of the decoder is composed of
deal with zero-shot classification or fine-grained classifi-                           the tree numbers tokens: A - Z letters, 0 0 - 9 9 digits, and 0 0 0 -
cation problems.                                                                       9 9 9 digits. All together, the vocabulary size is around
   Figure 3 describes the architecture of this model. It                               1100, on which we apply an embedding layer to trans-
uses an additional decoder block, as a secondary task to                               form discrete tree numbers to a continuous embedding
predict the tree number of the given input. The gener-                                 space. As this vocabulary is completely new, embedding
ation step 𝑗 + 1 is defined as (without considering the                                is learnt from scratch using back-propagation. Note that
batch size):                                                                           for MeSH descriptors that have multiple tree numbers,
                                                                                       we just duplicate the inputs to learn the multiple posi-
                   𝑎𝑡𝑡𝑗 = 𝑏𝑒𝑟𝑡_ℎ × ℎ𝑗                                          (1)     tions in the hierarchy.
                                                                                           In both architectures, the full set of parameters is
                   ̂𝑗 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑎𝑡𝑡𝑗 )
                   𝑎𝑡𝑡                                                         (2)
                                                                                       updated during training. Thus, each model provides
                                               𝑇                                       new representations of the input text and labels, that
                                   ̂𝑗 × 𝑏𝑒𝑟𝑡_ℎ
                   𝑎𝑡𝑡𝑛_𝑎𝑝𝑝𝑙𝑖𝑒𝑑𝑗 = 𝑎𝑡𝑡                                         (3)
                                                                                       are further evaluated through zero-shot classification or
                   𝑖𝑛𝑝𝑢𝑡𝑗 = 𝑒𝑚𝑏𝑒𝑑𝑗 + 𝑎𝑡𝑡𝑛_𝑎𝑝𝑝𝑙𝑖𝑒𝑑𝑗                             (4)     through probing.
                   ℎ𝑗+1 , 𝑜𝑢𝑡𝑗+1 = 𝐺𝑅𝑈 (ℎ𝑗 , 𝑖𝑛𝑝𝑢𝑡𝑗 )                          (5)
                                                                                       3.2. Hierarchical Probing Task
where 𝑏𝑒𝑟𝑡_ℎ is the output of BERT, of shape (512, 768),
ℎ𝑗 the hidden state of the GRU cell, of shape (768,), and To better understand the capacity of pretrained repre-
𝑒𝑚𝑏𝑒𝑑𝑗 the embedding of the current word, also of shape sentations to encode hierarchical relations of biomedical
terms, we considered two probing tasks, adapted from                        4. Experimental settings
[16]. The objective is to test whether the representations
learned by the model are linearly separable with regards                    In this section we present the datasets we used to train the
to the probing tasks.                                                       models and evaluate the corresponding representations.
   We define the probing task from a main model, taking                     We also explain how we construct a zero-shot dataset out
as input a MeSH descriptor m𝑖 and returning its internal                    of the Medline dataset with MeSH annotations.
representation h𝑖 . We then recall that it is possible to
define a scalar product h𝑇 𝐴h from any positive semi                        4.1. Datasets
definite symmetric matrix 𝐴 ∈ S𝑚×𝑚+ , and more generally
from any matrix 𝐵 ∈ R𝑘×𝑚 by taking 𝐴 = 𝐵𝑇 𝐵. Using the                      Medline/MeSH. Medline is the US National Library
metric distance corresponding to this scalar product we                     of Medical/Biomedical dataset2 , containing millions of
then define a distance from any matrix :                                    biomedical scientific documents, and built around the
                                                                            MeSH thesaurus (Medical Subject Heading). This the-
             𝑑𝐵 (h𝑖 , h𝑗 ) = (𝐵(h𝑖 − h𝑗 ))𝑇 (𝐵(h𝑖 − h𝑗 ))                   saurus contains about 30,000 MeSH descriptors, updated
                                                                            every year, and used for semantic indexing of the doc-
with h𝑖 and h𝑗 the representations of two MeSH descrip-                     uments. These MeSH terms also define a hierarchy:
tors (more details on representations in section 5.2). Our                  the first level separates the MeSH terms into 16 main
model has as parameter the matrix 𝐵, which is trained                       branches, then each MeSH term is the child of another
to reconstruct a gold distance from one MeSH term to                        more general MeSH term, and this over up to a depth
another. More specifically, the task aims to approximate                    of fifteen. The sequence of nodes traversed to reach a
by gradient descent :                                                       MeSH term from the root is called Tree Number.
                 𝑚𝑖𝑛 ∑ |𝑑𝑇 (h𝑖 , h𝑗 ) − 𝑑𝐵 (h𝑖 , h𝑗 )2 |                        An example of the hierarchy is shown in Figure 2 with
                   𝐵   𝑖,𝑗                                                  the two MeSH COVID-19 and Bronchitis. For example
                                                                            here, Covid-19 has the Tree Number C 0 1 . 7 4 8 . 2 1 4 (C be-
where 𝑑𝐵 is the predicted distance and 𝑑𝑇 the gold dis-                     ing the main branch Disease, then we have 3 sub-levels:
tance. We also note that, as in the original paper, we add                  C 0 1 for Infections, C 0 1 . 7 4 8 for Respiratory Tract Infections,
a squared term on the predicted distance. Concerning                        then C 0 1 . 7 4 8 . 2 1 4 for Covid-19).
the dimensions of 𝐴 and 𝐵, 𝑚 is the dimension of the rep-                       The majority of MeSH descriptors from the hierarchy
resentation space (same than h), and 𝑘 is the dimension                     have multiple Tree Numbers, so the hierarchy follows
of the linear transformation which we will take equal to                    a directed acyclic graph structure. Also, the annotation
512. We did not further experiment on the dimension                         of scientific documents with ancestors of a MeSH is not
of the linear transformation (see the original paper by                     always explicit. For example, a document can be indexed
[16] for discussion on both the squared distance and the                    with term Covid-19, but not necessarily with terms Infec-
linear transformation dimension).                                           tions or Respiratory Tract Infections.
                                                                                There are on average 13 annotated MeSH descriptors
Gold distance. The only difference with the original                        per document, where 2 or 3 will be annotated as major
paper is the definition of the gold distance. We have                       MeSH to indicate that the document deals more specifi-
evaluated two probes:                                                       cally with these topics. In our work, we use the whole
    1. Shortest-Path Probe: given two MeSH descrip-                         set of major and non-major MeSH descriptors. In addi-
       tors, we ask the model to predict the distance                       tion to the MeSH annotation and hierarchy, the Medline
       between the two MeSH terms, as the length of                         database provides a description for each MeSH term, used
       the shortest path in the graph defining the MeSH                     by our models as specified in section 3.1.
       hierarchy;
    2. Common-Ancestors Probe: model predicts                               LitCovid. LitCovid is a subset of the Medline database
       whether two MeSH descriptors have k common                           [19, 20], where extraction is done via PubMed (search
       ancestors. For this second task we thus define                       engine of Medline), using the keywords: “coronavirus”,
       multiples binary probe models that predict if the                    “ncov”, “cov”, “2019-nCoV”, “COVID-19” and “SARS-CoV-
       two MeSH terms have at least k common ances-                         2”. Using this subset of articles allow us to work more
       tors or not (for k between 1 and 3). In this partic-                 specifically on COVID-19 related articles, with also a sub-
       ular case of a binary probe task, we thus add a sig-                 set of 9,000 COVID-19 related MeSH descriptors (instead
       moid function on the predicted distance (where                       of the full set of MeSH descriptors). The LitCovid dataset
       the sigmoid function is not centered on zero, but
       on a positive constant, as distances are always                      dicting the number of common ancestors given two MeSH descrip-
                                                                            tors, but the regressor was unable to train from the representations,
       positive)1 .
                                                                            hence the use of binary tasks.
   1                                                                             2
       We have also tried to cast the probe as a regression directly pre-          https://www.ncbi.nlm.nih.gov/mesh/
also contains its own categorization, composed of only                          also add all the ancestors of the annotated MeSH
8 classes: Case report, Diagnosis, Forecasting, General,                        terms as positive labels to overcome annotation
Mechanism, Prevention, Transmission and Treatment,                              incompleteness problem stated above. Adding
with as for the MeSH a short description for each of                            the ancestors increases the size of the dataset by
them.                                                                           an important factor, so this is why this configura-
   All our experiments are made on the LitCovid dataset,                        tion is used for evaluation only.
with 27,321 articles (Train-Val-Test split: 19,125 / 2,732
/ 5,464) that have both LitCovid and MeSH annotations                     The choice of the negatives is crucial in metric learning,
(several of the 8 classes from LitCovid + avg 13.5 MeSH/ar-           and there have been lots of efforts given on developing
ticle out of around 9,000 COVID-19 related MeSH descrip-              techniques to find “hard negatives”. In our case, the
tors from Medline).                                                   S i b l i n g s configuration creates by its nature negatives
   Training and evaluation relies on MeSH annotations                 that are difficult to distinguish from actual positives. Also
(semantically richer), with results that reflects both few-           we made the choice to use a binary classification layer,
shot for low frequency terms, and zero-shot results for               but losses like hinge loss or triplet loss could have been
747 held-out MeSH descriptors. In a second step we also               interesting in this particular case.
evaluate on LitCovid categories to test a transfer learning               For LitCovid we consider all the {document, label} pairs
scenario where we change the categorization at test time.             since we do not have scaling problem (only 8 labels).


4.2. Evaluation                                                       4.3. Training parameters
We present in this section the adaptation of annotations              The losses (binary cross entropy for the binary task and
for the “open input” architectures. The objective is to               negative log-likelihood for the hierarchy generation task)
create pairs ({𝑙𝑎𝑏𝑒𝑙, 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡}, 𝑏𝑜𝑜𝑙𝑒𝑎𝑛) from the original           are optimised using the AdamW and Adam algorithm
annotation.                                                           with a learning rate of 2E-5 and 5E-4 respectively. Train-
                                                                      ing is done over 4 epochs, with a save every 0.25 epochs.
Zero-shot dataset creation. Inputs in zero-shot are                   Best model is selected based on the validation loss. We
different due to new class appearances: a document 𝑑1                 used a batch size of 16, and performed 3 runs for each
being associated with two labels (𝑙1 and 𝑙2 ) will thus be            model (see standard deviation in the section 5).
transformed into two inputs ({𝑙1 , 𝑑1 }, positive) and ({𝑙2 , 𝑑1 },       The BioBERT pretrained model we used was
positive). When a new label 𝑙𝑛𝑒𝑤 appears, it is enough to             m o n o l o g g / b i o b e r t _ v 1 . 1 _ p u b m e d from Hugging Face. This
create the input ({𝑙𝑛𝑒𝑤 , 𝑑1 }) to predict whether this label         model accepts an input sequence of up to 512 tokens,
𝑙𝑛𝑒𝑤 is positive or not for the document 𝑑1 .                         therefore extra tokens were truncated.
   To make the task meaningful, we also add negatives                     Concerning probing tasks, each MeSH-to-MeSH pair
to both train and test datasets. However, in the case of              requires a gold distance (see section 3.2). For the
MeSH classification, using all the negatives is not pos-              Shortest-Path Probe they are computed using the
sible, for two reasons: (i) scalability problem: there are            Floyd-Warshall algorithm, while for the Common-
more than 9,000 labels and 27,321 documents, that results             Ancestors Probe, they were deducted from Tree Num-
in hundreds of millions of combinations. (ii) data balanc-            bers. Optimizer of the probe task is AdamW, with a
ing problem: 9,000 negatives for 13 positives on avg. We              2.5E-5 learning rate. We also only focused on the “N”,
therefore use following configurations:                               “E”, “C”, “D” and “G” branches of the MeSH hiearchy,
                                                                      corresponding resp. to “Health Care Category”, “Analyti-
      • B a l a n c e d : one random negative pair is added for       cal, Diagnostic and Therapeutic Techniques and Equip-
        each positive pair, to ensure a balanced distribu-            ment Category”, “Diseases Category”, “Chemicals and
        tion. So, given a document, we would always                   Drugs Category” and “Phenomena and Processes Cate-
        have the same number of positive and negative                 gory” (they are the most representative of the dataset).
        pairs. The negatives are sampled from all the                 From all possible MeSH-to-MeSH pairs, we have ran-
        possible negatives, based on the MeSH terms dis-              domly selected 10% of them to reduce computation time.
        tributions to ensure that, given a MeSH term, we              Validation and evaluation are performed on 30% of the
        would also have the same number of positive and               MeSH descriptors, that we held out from probe training.
        negative pairs.
      • S i b l i n g s : This configuration is only used in eval-
        uation and aims at better disentangle errors due 5. Results and discussion
        to “incompleteness of MeSH annotations” from
        real indexing errors. In this configuration, sib- We present in this section the results in zero-shot and
        lings according to MeSH hierarchy of the pos- few-shot on both MeSH descriptors and LitCovid labels
        itive pairs are added with negative labels. We from our two models BioBERT STL and MTL. We then
                                          F1-score (std)        Precision        Recall      F1-score (std)        Precision       Recall
                                                           Balanced                                        ZSC Balanced
                   BioBERT STL             0.894 (0.001)          0.875          0.913        0.760 (0.004)          0.849          0.688
                   BioBERT MTL             0.873 (0.003)          0.868          0.878        0.754 (0.006)          0.856          0.674
                                                            Siblings                                        ZSC Siblings
                   BioBERT STL             0.390 (0.002)          0.300          0.562        0.281 (0.005)          0.558          0.470
                   BioBERT MTL             0.402 (0.005)          0.312          0.570        0.285 (0.005)          0.594          0.455
Table 1
Results on Medline/MeSH with different evaluation configurations. Training has been done on the b a l a n c e d dataset, while here
we test on both b a l a n c e d and s i b l i n g s dataset. In z s c b a l a n c e d , all MeSH descriptors are zero-shot, while in z s c s i b l i n g s it is
a generalized zero-shot setting (mixture of both zero-shot and non zero-shot MeSH descriptors).


discuss the results of the probing tasks, the architectures,
the quality of annotations and also how we approached
the problem of large scale zero-shot classification.

5.1. Zero/Few-shot classification
Medline/MeSH. The models have been trained on the
b a l a n c e d configuration, and then tested on b a l a n c e d and
s i b l i n g s . Table 1 compares results on those different
test set configurations both in non zero-shot and zero-
shot settings. Note, that as highlighted in section 4.1, the
s i b l i n g s configuration is more difficult than the b a l a n c e d
one, which explains a high gap between the F1-score
from the two configurations. This is mainly due to a
lower precision as the model tends to wrongly predict
the siblings from the positive MeSH terms as positives ex-
amples (while we consider them as negatives in s i b l i n g s
                                                                                 Figure 4: F1-score depending on the number of occurrences of
settings). On the b a l a n c e d test set, BioBERT STL has
                                                                                 the MeSH term in the train set. Evaluation dataset is s i b l i n g s ,
a better F1-score both in zero-shot and non zero-shot
                                                                                 training dataset is b a l a n c e d .
settings, while on the s i b l i n g s one, the BioBERT MTL
model performs better. This difference is due to the high
precision of BioBERT MTL model in both settings. More
precisely, the BioBERT MTL model seems to be better                              5 shows F1-scores with respect to the deepness of the
on difficult pairs, like in the s i b l i n g s settings where you               MeSH descriptors both for BioBERT STL and MTL mod-
may have very close negative pairs (for example Breast                           els. As shown in the figure, F1-score tends to decrease
Cyst positive and Bronchogenic Cyst negative).                                   for more general terms (first 4 levels of hierarchy), but
    Figure 4 plots the F1-score with respect to the number                       increases for more specific terms (after 4th level of hierar-
of occurrences of the MeSH descriptors in the train set                          chy). We believe this could be due to the incompleteness
thus allowing to evaluate few-shot learning quality. First                       of annotations used during training of the MeSH descrip-
we note a clear increase of F1-score with the number                             tors. Recall that training is performed in the b a l a n c e d
of occurrences of the MeSH terms in the dataset (up to                           setting, and therefore ancestor MeSH descriptors are not
0.7) on both models. This indicates, as one would expect,                        always explicitly annotated, which could result in low
that the models are really struggling with difficult pairs                       performance on high hierarchy levels. This graph is also
that contains rare MeSH descriptors. In comparison with                          difficult to interpret since some branches from the MeSH
Table 1, the F1-score in zero-shot on the balanced pairs                         hierarchy are deeper than others, therefore “specificity”
is of 0.76, while the F1-score for the rare MeSH terms is                        of term with respect to its absolute depth may be differ-
much lower. We also note, that the BioBERT MTL model                             ent; the only information on depth is that deeper MeSH
allows to slightly improve performance in low resources                          terms are in general more specific one.
settings (for the terms occurring less then 10 times in the
training data: 1 and (1, 10] bins in the figure).                                LitCovid. Table 2 reports results on LitCovid dataset
    Concerning the MeSH descriptors themselves, Figure                           in zero-shot setting for the representations obtained with
                                                                           ZSC LitCovid            F1-score      Precision      Recall
                                                                           Baseline IsIn             0.329         0.520        0.241
                                                                           Baseline Cos Sim          0.308         0.228        0.471
                                                                           BioBERT STL               0.512         0.444        0.604
                                                                           BioBERT MTL               0.465         0.401        0.553
                                                                       Table 2
                                                                       Results in zero-shot on LitCovid, where the model has been
                                                                       trained on the MeSH descriptors (b a l a n c e d ). F1-score is the
                                                                       best from three runs.


                                                        are based on the mean error on the predicted distance
                                                        with respect to the gold distance (shortest path length
                                                        between two MeSH terms), while for the Common-
                                                        Ancestors Probe we report F1-score on binary tasks
                                                        for each 𝑘 (evaluating whether the two MeSH have at
Figure 5: Comparison of F1-score depending on the deep- least 𝑘 common ancestors).
ness of the MeSH descriptors on both BioBERT STL and MTL
models. Test set is s i b l i n g s , and depth is computed from the   MeSH representations. We use C L S token of respec-
length of the MeSH descriptor Tree Numbers (average depth              tive models as representation of different MeSH in our
when MeSH terms have multiple Tree Numbers).                           experiments. We have compared this representation to
                                                                       both average pooling over the MeSH descriptors tokens
                                                                       and max pooling in our preliminary experiments, and
STL and MTL models. On LitCovid, the STL model is bet-                 observed that C L S token was leading to the best perfor-
ter. We believe this may be due to LitCovid categories be-             mance on the probe tasks.
ing very general in comparison to the MeSH descriptors,                   As a baseline, we give in table 3 two other representa-
therefore the MTL model could not take advantage of its                tions: BioBERT vanilla and Random. In BioBERT vanilla,
better precision on more specialised pairs. In addition,               representations are an average pooling of the MeSH out-
as previously, this could be due to the incompleteness                 put tokens provided by the BioBERT pretrained model
of MeSH annotations, where only most specific MeSH                     without any finetuning (avg pooling was in this only case
terms are present in training data, while LitCovid relies              better than the C L S token3 ), and for Random, MeSH rep-
on more generic labels.                                                resentations are random representations sampled from a
   We report two simple baselines for LitCovid dataset                 normal distribution.
in Table 2:
      • Baseline IsIn where an abstract is associated with  Comparison STL/MTL. Table 3 indicates that both
        a label that appears in the abstract itself (both   STL and MTL model encode hierarchical structure of
        lower-cased);                                       MeSH terms better than random baseline, but also bet-
      • Baseline Cos Sim, where we take the C L S token     ter that BioBERT vanilla baseline. More specifically, the
        representations of all labels (through a vanilla    Common-Ancestors Probe implies that between two
        BioBERT), same for all abstracts, and then com-     MeSH descriptors from the same categories, we have en-
        pute the cosine similarity between each pair, with  coded a common base, and there is a projection where
        a threshold defined on the validation set.          MeSH descriptors from the same categories are closer to
                                                            each others. Concerning the Shortest-Path Probe, re-
We note that both BioBERT-based models perform signif- sults shows that there is also a projection where distances
icantly better compared to those naive baselines, which (as shortest paths in the hierarchy graph) are respected.
indicates that the models are able to exploit the semantics From the results, we also see that the additional task,
of the label to some extent, and goes beyond simple label MTL model with the decoder block, is able to encode
lookup in the abstract. We also see that the increase of even more hierarchical information in the C L S token, and
F1-score is mainly due to a better recall which implies may be a hit to the better precision in zero-shot and
better coverage of our models.                              few-shot results. Also, it is interesting to see that in the
                                                            BioBERT vanilla model, there is already some good knowl-
5.2. Probing hierarchy knowledge
                                                                            3
                                                                              possibly because in BioBERT vanilla model C L S token represen-
Finally, Table 3 reports the results of probing the learnt             tations has not been finetuned for any task as opposed to our learnt
representations. Results for the Shortest-Path Probe                   models.
 Shortest-Path Probe                 Distance Error (std)       as we need to create too many pairs for a given docu-
 BioBERT vanilla                          1.494 (-)             ment. Our approach was to work on a balanced subset
 Random                                   2.597 (-)             of the possible pairs or a coherent one (resp. balanced
 BioBERT STL                            1.462 (0.053)           and siblings configurations) for training and evaluation,
 BioBERT MTL                            1.323 (0.013)           however, this technique does not adapt to real world
 Common-Ancestors Probe          F1 (k=1)     k=2       k=3     applications.
 BioBERT vanilla                   0.855     0.521      0.538      An interesting future direction could be combining
 BioBERT STL                       0.864     0.546      0.541   our “open input” architecture with the high-coverage
 BioBERT MTL                       0.933     0.659      0.576   retrieval-like step which would first pre-select a subset
                                                                of possible MeSH terms, and therefore restrict the search
Table 3
                                                                space for the second “open input” classification step. For
Results of the probe tasks. For comparison of the shortest-
                                                                example of the first step, a ColBERT model could compute
path task, the avg MeSH to MeSH distance is 10.033, with
a std of 3.016. For the common-ancestors task, k is the         representations of abstracts and classes independently
number of common ancestors.                                     instead of creating representations of pairs, and so reduce
                                                                computational cost. Another possibility could be to use
                                                                simple algorithms like BM25.
edge about this hierarchical structure, which makes sense          Other techniques in metric learning also exist, like
as this hierarchy is constructed on the semantics of the        triplet loss learning or hinge loss. Using triplets with
biomedical terms.                                               “hard negatives” may help to learn a better representation
                                                                space.

5.3. Limitations and possible future
     directions                                                 6. Conclusion
Multi-Tasks-Learning. When dealing with MTL, the           In this work, we try to address the problem of zero-shot
main difficulty comes from convergence speed of the        classification that we defined as an open-input problem.
different losses. In our framework, the main loss (clas-   We compare a simple BioBERT model with a multi-tasks
sification loss) converges faster than the secondary loss  learning architecture that includes hierarchical seman-
(decoder loss), so we are not able to take full advantage  tic knowledge. In zero-shot and few-shot settings, the
of the decoder architecture. In a perfect scenario, the    multi-tasks framework does not increase performances
main task should be the harder one, but here it was not    significantly. Still, we observe good results on precision
the case, so we were forced to stop training earlier even  and on structural probing tasks, which implies that the
when using different coefficients and learning rates for   addition of the seq2seq task has some beneficial effect in
the two losses. Another possible future direction to ex-   the ability of trained models to capture semantics. In par-
plore is to start training the decoder block before training
                                                           ticular, the model is able to build a representation space
the classification layer.                                  where MeSH descriptors that have common ancestors
                                                           are closer to each other, and where the overall hierarchi-
Annotations. When dealing with transfer learning cal organisation of the MeSH is respected. It would be
across different datasets, the question of the quality of interesting to further investigate additional tasks to take
the annotation needs to be taken into account. Different even better advantage of hierarchical knowledge encoded
annotation systems (even when documents are manually in medical terminologies, and thus improve quality and
annotated) may have labels that have different coverage, robustness of models representations.
and overlapping, which adds some bias in results. As
an example, when we train our model on the Medline
annotations and then test in zero-shot on LitCovid labels, References
results are difficult to interpret, because the scale and
                                                             [1] F. B. Rogers, Medical subject headings., Bulletin of
the coverage is completely different. [21] have studied
                                                                 the Medical Library Association 51 (1963) 114–116.
the semantic interoperability of different biomedical an-
                                                             [2] Y. Mao, Z. lu, Mesh now: Automatic mesh index-
notation tools across multiple countries and databases,
                                                                 ing at pubmed scale via learning to rank, Journal
and they show that this was a real issue, that needs to be
                                                                 of Biomedical Semantics 8 (2017) 15. doi:1 0 . 1 1 8 6 /
considered when dealing with such terminologies.
                                                                 s13326-017-0123-3.
                                                             [3] N. Mylonas, S. Karlos, G. Tsoumakas, Zero-shot
Large scale Zero-shot. “Open input” architectures                classification of biomedical articles with emerg-
are not adapted to very large scale zero-shot problems,          ing mesh descriptors, in: 11th Hellenic Confer-
                                                                 ence on Artificial Intelligence, SETN 2020, Asso-
     ciation for Computing Machinery, New York, NY,                              ings of the 2019 Conference of the North American
     USA, 2020, p. 175–184. URL: https://doi.org/10.1145/                        Chapter of the Association for Computational Lin-
     3411408.3411414. doi:1 0 . 1 1 4 5 / 3 4 1 1 4 0 8 . 3 4 1 1 4 1 4 .        guistics: Human Language Technologies, Volume
 [4] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H.                              1 (Long and Short Papers), Association for Compu-
     So, J. Kang, Biobert: a pre-trained biomedi-                                tational Linguistics, Minneapolis, Minnesota, 2019,
     cal language representation model for biomed-                               pp. 4129–4138. URL: https://aclanthology.org/N19-
     ical text mining, Bioinformatics (2019). URL:                               1419. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 1 9 .
     http://dx.doi.org/10.1093/bioinformatics/btz682.                       [17] T. Gong, T. Lee, C. Stephenson, V. Renduchin-
     doi:1 0 . 1 0 9 3 / b i o i n f o r m a t i c s / b t z 6 8 2 .             tala, S. Padhy, A. Ndirango, G. Keskin, O. H. Eli-
 [5] Q. Jin, B. Dhingra, W. W. Cohen, X. Lu, Prob-                               bol, A comparison of loss weighting strategies
     ing biomedical embeddings from language models,                             for multi task learning in deep neural networks,
     NAACL HLT 2019 (2019) 82.                                                   IEEE Access 7 (2019) 141627–141632. doi:1 0 . 1 1 0 9 /
 [6] W. Wang, V. W. Zheng, H. Yu, C. Miao, A survey                              ACCESS.2019.2943604.
     of zero-shot learning: Settings, methods, and appli-                   [18] A. Kendall, Y. Gal, R. Cipolla, Multi-task learning us-
     cations, ACM Transactions on Intelligent Systems                            ing uncertainty to weigh losses for scene geometry
     and Technology (TIST) 10 (2019) 1–37.                                       and semantics, 2018. a r X i v : 1 7 0 5 . 0 7 1 1 5 .
 [7] J. Chen, Y. Geng, Z. Chen, I. Horrocks, J. Z.                          [19] Q. Chen, A. Allot, Z. Lu, Keep up with the latest
     Pan, H. Chen, Knowledge-aware zero-shot learn-                              coronavirus research, Nature 579 (2020) 193. URL:
     ing: Survey and perspective, arXiv preprint                                 https://www.ncbi.nlm.nih.gov/pubmed/32157233.
     arXiv:2103.00070 (2021).                                                    doi:1 0 . 1 0 3 8 / d 4 1 5 8 6 - 0 2 0 - 0 0 6 9 4 - 1 .
 [8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert:                    [20] Q. Chen, A. Allot, Z. Lu, Litcovid: an open database
     Pre-training of deep bidirectional transformers for                         of covid-19 literature, Nucleic Acids Research
     language understanding, 2019. a r X i v : 1 8 1 0 . 0 4 8 0 5 .             (2020).
 [9] I. Chalkidis, M. Fergadiotis, S. Kotitsas, P. Malaka-                  [21] J. A. Miñarro-Giménez, R. Cornet, M. Jaulent,
     siotis, N. Aletras, I. Androutsopoulos, An empir-                           H. Dewenter, S. Thun, K. R. Gøeg, D. Karlsson,
     ical study on large-scale multi-label text classifi-                        S. Schulz, Quantitative analysis of manual anno-
     cation including few and zero-shot labels, 2020.                            tation of clinical text samples, International Jour-
     arXiv:2010.01653.                                                           nal of Medical Informatics 123 (2019) 37–48. URL:
[10] A. Rios, R. Kavuluru, Few-shot and zero-shot multi-                         https://www.sciencedirect.com/science/article/pii/
     label learning for structured label spaces, in: Pro-                        S1386505618305446. doi:h t t p s : / / d o i . o r g / 1 0 . 1 0 1 6 /
     ceedings of the 2018 Conference on Empirical Meth-                          j.ijmedinf.2018.12.011.
     ods in Natural Language Processing, Association
     for Computational Linguistics, Brussels, Belgium,
     2018, pp. 3132–3142. URL: https://aclanthology.org/
     D18-1352. doi:1 0 . 1 8 6 5 3 / v 1 / D 1 8 - 1 3 5 2 .
[11] J. Wohlwend, E. R. Elenberg, S. Altschul, S. Henry,
     T. Lei, Metric learning for dynamic text classifica-
     tion, 2019. a r X i v : 1 9 1 1 . 0 1 0 2 6 .
[12] J. Risch, S. Garda, R. Krestel, Hierarchical document
     classification as a sequence generation task, 2020.
     doi:1 0 . 1 1 4 5 / 3 3 8 3 5 8 3 . 3 3 9 8 5 3 8 .
[13] A. Conneau, G. Kruszewski, G. Lample, L. Barrault,
     M. Baroni, What you can cram into a single vector:
     Probing sentence embeddings for linguistic proper-
     ties, 2018. a r X i v : 1 8 0 5 . 0 1 0 7 0 .
[14] I. Tenney, P. Xia, B. Chen, A. Wang, A. Poliak, R. T.
     McCoy, N. Kim, B. V. Durme, S. R. Bowman, D. Das,
     E. Pavlick, What do you learn from context? prob-
     ing for sentence structure in contextualized word
     representations, 2019. a r X i v : 1 9 0 5 . 0 6 3 1 6 .
[15] I. Alghanmi, L. Espinosa-Anke, S. Schockaert, Prob-
     ing pre-trained language models for disease knowl-
     edge, 2021. a r X i v : 2 1 0 6 . 0 7 2 8 5 .
[16] J. Hewitt, C. D. Manning, A structural probe for
     finding syntax in word representations, in: Proceed-