Universal Passage Weighting Mecanism (UPWM) in
BioASQ 9b
Tiago Almeida1 , Sérgio Matos1
1
    University of Aveiro, IEETA


                                         Abstract
                                         This paper presents the participation of the University of Aveiro Biomedical Informatics and Techolo-
                                         gies group (BIT) in the ninth edition of the BioASQ challenge for document and snippet retrieval. Our
                                         proposed systems follow a two-stage retrieval pipeline, similar to our BioASQ 8B submissions. However,
                                         we completely rebuilt our neural ranking model, maintaining the key ideas of its inception while im-
                                         proving its computational efficiency as well as adding interoperability with the transformer architecture.
                                         This resulted in a novel universal passage weighting mechanism (UPWM), which offers a more power-
                                         ful way to derive a document relevance score from the combination of its sentences. More concretely,
                                         we have built two variants that use our passage mechanism, the lightning UPWM and the transformer
                                         UPWM. The first uses a shallow interaction model and the second uses a BERT model. Additionally,
                                         we also propose an effective pairwise joint training mechanism that combines document retrieval with
                                         snippet retrieval. Our systems achieved competitive results scoring at the top and close to the top for all
                                         the batches, with MAP values ranging from 0.3573 to 0.4236 in the document retrieval task. Although
                                         we only submitted for the snippet retrieval task in the last two batches, our system scored at top position
                                         in the last batch by using rank reciprocal fusion of pointwise and pairwise joint training approaches.
                                         Code to reproduce our submissions are available on https://github.com/bioinformatics-ua/BioASQ9b.

                                         Keywords
                                         Neural ranking, Sentence aggregation, Document Retrieval, Snippet Retrieval, BioASQ 9B


1. Introduction
The BioASQ [1] challenge is an annual competition on document classification, retrieval and
question-answering applied to the biomedical domain. This competition is notorious for contin-
uously fostering the research in automatic and intelligent retrieval systems over the biomedical
literature. A fresh example is the difficulty that researchers and biomedical experts have to
search the growing literature about the 2019 novel coronavirus, showing us that new and more
powerful methods are still needed.
   More concretely, the BioASQ challenge is divided into two tasks (A and B) that are isolated
challenges. Task A addresses the biomedical annotation problem, and is concerned with
automatic document labelling with terms from the MeSH hierarchy. On the other hand, task B
is further subdivided into phase A and phase B, the first addressing the information retrieval
problem and the second addressing the answer extraction and answer generation problems.

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" tiagomeloalmeida (T. Almeida); aleixomatos@ua.pt (S. Matos)
~ https://t-almeida.github.io/online-cv/ (T. Almeida)
 0000-0002-4258-3350 (T. Almeida); 0000-0003-1941-3983 (S. Matos)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
In more detail, in phase A the objective is to retrieve, from the PubMed baseline, the most
relevant articles and/or document snippets that answer a given biomedical question written in
English. In contrast, the objective in phase B is to extract or generate ideal answers from the
information retrieved in phase A.
   This paper describes our participation the participation of the Biomedical Informatics and
Techologies (BIT) group of the Aveiro University in the BioASQ task B phase A challenge.
Our approach is a direct evolution of the document retrieval approach used in our previous
participation [2]. This year we focused on a ground up rebuilt of the previous model by
maintaining the key ideas of its inception, while improving the computation flow for efficiency
purposes and adding interoperability with the transformer architecture. In the end, we devised
a universal passage weighting mechanism (UPWM) that offers a novel way to combine the
sentence relevance in order to derive the final document score. This mechanism tries to solve
the usual overlooked problem of sentence aggregation. More precisely, it is usually unfeasible to
feed the entire document to a neural IR model and a common practice is to do sentence splitting
followed by simple level sentence aggregation [3, 4]. However, little attention has been given to
this important step.
   We built two variants that use our passage mechanism, the lightning UPWM (L-UPWM) and
the transformer UPWM (T-UPWM). The first uses a shallow interaction model, with only 597
trainable parameters, and the later uses a BERT model. These were the cornerstone models
of our submissions for this year challenge for the document and snippet retrieval challenge.
   Our submissions achieved the top and close to the top positions for all document retrieval
batches. While we only participated in the last two batches for snippet retrieval, we still
achieved a top scoring position in the last batch. Finally, our transformer UPWM consistently
outperformed the lightning UPWM bringing to evidence the efficacy vs efficiency trade off.
   In the remaining of the paper we start by describing our universal passage weighting mecha-
nism and detailing the two concrete implementations, lightning UPWM and transformer UPWM.
We then describe the submissions and present and discuss the results obtained.


2. UPWM
The universal passage weighting mechanism (UPWM) is, as the name suggests, a high-level
mechanism that provides other neural architectures the capability to perform sentence score
aggregation. In other words, this mechanism can be viewed as a wrapper that encapsulates
already existing models offering them the capability to better combine the individual sentence
scores in order to derive the final document score.
   The UPWM is inspired by the human judgement process of selecting relevant articles, as
proposed in previous works [5, 6]. More precisely, when searching for some information, a
person usually scans a document looking for relevant sentences, signalled by the presence of
keywords that are similar to their information need. Then the relevance of the entire document
will be judged based only on the selected sentences. Therefore, the purpose of the UPWM is
to mimic this judgement process and combine it with the neural relevance signal extracted by
another model. So, this human judgement process can be viewed as a heuristic to guide neural
models during the sentence aggregation step.
   Another aspect taken into consideration by the UPWM is that query terms are not equally
relevant. Similarly, in the human judgement process, when scanning a document only a few
keywords, the most representative terms, are considered. So, based on this observation we
also compute the importance that each query term carries and weigh each sentence by this
importance, thus boosting the sentences that contain the most important query terms.


Figure 1: A general overview of the Universal passage weighting mechanism.


   Figure 1 presents an overview of the main architectural concepts of the UPWM, which is
divided into two major blocks, the sentence relevance block and the document score block.
The idea is that the first block will produce the individual sentence scores that carry the sentence
relevance. Then the document score will select the most representative features based on these
sentences to derive the final document relevance with respect to the query.

2.1. Sentence Relevance Block
The sentence relevance block has two parallel layers, the Interaction Model layer and the A
priori layer. Both layers produces scores for each sentence that will be linearly combined. The
intuition here is that the interaction model layer focuses on thoroughly analysing the sentence
information, while the a priori layer will act as a heuristic by mimicking the quick human
judgement process of finding relevance. In other words, the main idea of the a priori layer is to
produce sentence scores without thoroughly processing the sentences, therefore its name, since
we will give an “a priori” score without analysing it.
   Then, the final sentence score comes from the linear combination of the a priori scores with
the interaction model scores. So, the a priori score will act as a gating mechanism deciding which
scores from the interaction model should be considered for the document ranking. Another
interpretation can be gained by considering the types of signals that both extract. Specifically,
the interaction model layer will focus on analysing the meaning, context and sentence semantics,
hence carrying a more semantic interaction signal, while the a priori will only focus on a more
exact matching type of signal, weighted by the importance of each individually query term.
2.1.1. Interaction Model Layer
The interaction model layer acts as a placeholder for neural models that are more specialised in
analysing the relevance between the query and sentence. For a better fit, the type of models that
make sense to adopt in this layer are models that derive the final relevance score by taking into
consideration the meaning, context and semantic relations between the query and the sentence.
For example, ARC-II [7] is an example of an earlier and simpler candidate. However, a more
powerful transformer based model, like BERT [8] can also be adopted as the interaction model
in this architecture.

2.1.2. A Priori Layer
The a priori layer implementation is summarised in Figure 2 and has two major steps, the exact
matching signal extraction and the query terms importance weighting.


Figure 2: The main concepts behind the inner workings of the a priori layer


   To better understand the layer inner workings, lets us first define a query as a sequence of
tokens 𝑞 = {𝑢0 , 𝑢1 , ..., 𝑢𝑄 }, where 𝑢𝑖 is the 𝑖-th token of the query and 𝑄 the size of the query;
a document passage as 𝑝 = {𝑣0 , 𝑣1 , ..., 𝑣𝑆 }, where 𝑣𝑗 is the 𝑗-th token of the passage and 𝑆
the size of the passage; and a document as a sequence of passages 𝐷 = {𝑝0 , 𝑝1 , ..., 𝑝𝑁 }, where
𝑁 is the total number of passages in the document. To further simplify the explanation lets
consider that this layer is only applied to one sentence query pair, since the extension to 𝑁
pairs is trivially achieved by replicating this procedure for each pair.
   Regarding the exact matching signal extraction, an interaction matrix is first created by
applying an interaction function, 𝑓𝑖𝑛𝑡𝑒𝑟𝑎𝑐𝑡𝑖𝑜𝑛 (𝑞, 𝑝), to the query and the passage. This function
will perform a pairwise combination of all the query terms with all the passage terms. The
output is a matrix, 𝐼 [0,1] , where the rows are the query terms and the columns are the passage
terms. We adopt two types of implementations for this function, an exact interaction and a
semantic interaction. The first one directly uses the token index to create the matrix 𝐼, therefore
can be defined as follows
                                           {︃
                                             1     𝑢𝑖 = 𝑣𝑗
                                     𝐼𝑖𝑗 =                   .                                  (1)
                                             0     𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
  The second approach computes the cosine similarity between the embedding of the terms,
described as follows,

                                                   𝑢⃗𝑖 𝑇 · 𝑣⃗𝑗
                                      𝐼𝑖𝑗 =                     .                               (2)
                                              ‖𝑢⃗𝑖 ‖2 × ‖𝑣⃗𝑗 ‖2
   In either case, the exact matching signal will always be captured, since it will correspond to
the matrix entry that has the corresponding value of 1 or close to. The next step is to filter these
exact matching signals. For that, we defined a matching threshold, that when applied over the
interaction matrix returns a matrix with all matching terms between the query and the passage,
i.e. it returns a matrix where each entry defines if a query term is present in the passage or not,
as described in Equation 3
                                      {︃
                                        1 𝑖𝑓 𝐼𝑖𝑗 >= threshold
                                𝐼𝑖𝑗 =                             .                              (3)
                                        0 𝑖𝑓 𝐼𝑖𝑗 < threshold
   Note, that it only makes sense to apply this equation over the semantic interaction, since the
exact interaction directly indicates all the exact matches.
   Then, since we only care about the presence or absence of a term, we collapse the column
                                                  [0,1]
dimension in 𝐼, transforming it into a vector, ⃗𝐼       , where each entry indicates if a query-term
is present or not in the passage. Note that each entry in 𝐼 is a boolean value; we also tried
different alternatives, such as using the number of times a query term appeared in a sentence,
but this did not improve the results.
   In parallel with this step, this layer also computes the importance of each query-term, since
different terms in a query can carry different importance regarding the information need, as
addressed by [6, 9]. To accomplishing this, we compute a probabilistic distribution over the
query terms, as shown in Equation 4,

                                               ⃗ 𝑇 · 𝑢⃗𝑖 ,
                                         𝑐𝑢𝑖 = 𝑤
                                                 𝑒𝑐𝑢𝑖                                           (4)
                                      𝑎𝑢𝑖 = ∑︀          𝑐𝑢𝑤
                                                            .
                                               𝑢𝑤 ∈𝑄 𝑒
   Here, we used the standard softmax operation to compute the probabilistic distribution over
a linear combination of each query term embedding, 𝑢⃗𝑖 , and a trainable vector, 𝑤. Finally, 𝑎𝑢𝑖
corresponds to the probabilistic importance of the query-term 𝑢𝑖 regarding the entire query.
   Finally, the a priori sentence score arises from the linear combination of the query-importance
distribution and the presence vector, ⃗𝐼 , as described in Equation 5,
                                               ∑︁
                                       𝑎𝑝 =          𝑎𝑢𝑖 × 𝐼⃗𝑖 .                                (5)
                                              𝑖∈ 𝑄

  Given this formulation, 𝑎𝑝 , represents the importance of each sentence following the human
judgement heuristic. Let us consider a passage containing all the query terms as an intuitive
example. Under this condition the a priori sentence score will be 1, and contrarily, if a passage
does not contain any term, it will have a score of zero. Similarly, if a sentence contains important
query-terms, it will have a score close to 1, and close to zero on the opposite case.
  An important note is to consider the role of the matching threshold. If too high, e.g., 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 =
0.99 it will only consider exact match signals, i.e., terms that exactly appear in the passage.
However, for lower values, it will also include semantically similar terms, e.g., a 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 = 0.7
can be beneficial since it will dynamically include semantically similar terms, in some cases
synonyms.

2.2. Document Score Block
The document score block aims to produce the final document score based on the passages
score previously obtained. To avoid creating a bias for longer documents, since they have more
scores, we employed a feature selection step. More precisely, we construct a feature vector that
contains the max sentence score, the normalised summation over the scores, the average over
the scores and several top 𝑘-max-average scores. Finally, the document score is computed by
combining this feature vector using a multi-layer perceptron.


3. UPWM Models
As previously mentioned, the UPWM is intended to be combined with an neural model that
is capable of computing the relevance between the query and a sentence. So, in this section
we present the two concrete implementation that we adopted for the BioASQ challenge, the
lightning UPWM and the transformer UPWM.

3.1. Lightning UPWM
The lightning UPWM, uses a neural interaction model similar to our last year submission
[2], which was already a fast and lightweight model [10]. However, with this much cleaner
architecture we achieved a 4x speed up making it a lightning fast model, when compared to the
current transformer based models.


Figure 3: A simplistic overview of the neural architecture for the interaction model.


   We show in Figure 3 the overview of the neural architecture. Very briefly, an interaction
matrix is built using Equation 2. Then 3 by 3 convolution kernels are applied over this matrix
to learn context patterns that are then extracted by a pooling layer. This layer is applied over
the filter dimension and we combined the max, average and k-max average pooling.
3.2. Transformer UPWM
The transformer UPWM, as the name suggests, uses a transformer based model in the interaction
model layer. More precisely, we chose PubMedBERT [11], which is a BERT model that was
trained from scratch using abstracts from PubMed and full-text articles from PubMed Central.
This model keeps the biomedical specific terms, that would be decomposed into subwords if
using other BERT models. This is an important aspect since we can directly use, in the a priori
layer, the tokens produced by the PubMedBERT model.
   To produce the relevance score between the query and the sentence, we adopted the usual
strategy [12], described in Figure 4, of concatenating the query tokens with the sentence tokens,
separated by the [SEP] token. Then we feed this to the BERT model, that outputs a sequence of
contextualised embedding per tokens. Finally, we feed the [CLS] embedding to a linear layer in
order to produce the sentence score.


Figure 4: Diagram of the interaction model as a BERT model.


4. Overall Architecture
This section addresses the complete system architecture that we adopted for the BioASQ 9b
challenge. Similarly to our last year submission, we used a two-stage reranking mechanism, as
presented in Figure 5.
   For the first stage we adopted a traditional retrieval model, BM25 [13], to efficiently select
the top-100 scientific articles for each biomedical question. Then this set of documents is fully
reranked by our neural model, in this case, the lightning UPWM and transformer UPWM.
   Given this pipeline and the operation of the UPWM as described above, we observe that the
UPWM operates a detailed analysis of the exact match signals captured by the BM25. This makes
sense, because according to the UPWM, the interaction model will only evaluate sentences that
carry some exact match signal, which is the only type of signal used by BM25. Therefore, we can
say that, in this pipeline, the UPWM will “look” more in depth to the signals that contributed
for the BM25 score.
Figure 5: Overview of our two-stage retrieval system.


4.1. Training
Regarding the training of the neural models, both were trained with the same exact data
collection from BioASQ. In terms of document ranking, similarly to last years submission, we
trained both models using a pairwise cross entropy loss, described in Equation 6,
                                                               +
                                  −                    𝑒(𝑠𝑐𝑜𝑟𝑒(𝑞,𝑑 ))
                              +
                     𝐿𝑑𝑜𝑐 (𝑞, 𝑑 , 𝑑 ) = −𝑙𝑜𝑔( (𝑠𝑐𝑜𝑟𝑒(𝑞,𝑑+ ))                   ).             (6)
                                               𝑒             + 𝑒(𝑠𝑐𝑜𝑟𝑒(𝑞,𝑑− ))
   Here, the loss is computed as a function over a triplet that contains a query, a positive
document and a negative document. Since the BioASQ data only provides positive examples,
we sampled the negatives examples from the documents in the BM25 ranking order that are not
labelled as positives. Additionally, given that the gold-standard was built as a concatenation of
the judged documents from different years, we decided to restrict the BM25 search by year so
that only the available documents at that time are available for the training of the model.
   Additionally to the document level training, we also tried to perform joint training by using
the snippets feedback data available in the training data. Equation 7 describes how the overall
loss is computed in the joint training approach, namely as a weighted average between the
document loss and the snippet loss.

                                  𝐿 = 𝛾𝐿𝑑𝑜𝑐 × (1 − 𝛾)𝐿𝑠𝑛𝑖𝑝𝑝𝑒𝑡 .                               (7)
   Regarding the snippet loss, we experimented with pointwise and pairwise variants for the
loss. Furthermore, we only applied the snippet loss to the sentences from the positive documents
that had feedback.
   Following the ideas of joint training presented in [14], we further augmented our UPWM
to produce new snippet scores, as presented in Figure 6. Note that the actual solution already
produces snippet scores in the sentence relevance block. However, these do not depend on the
final document score, therefore the snippet loss would not contribute, through back propagation,
Figure 6: Pointwise snippet loss.


to the training of the document score block. So, we added a MLP that computes a new snippet
score from the concatenation of the interaction model score, sentence score and document score.
This way, the new snippet score is dependent on the final document score, making it a more
joint train approach.
   For the pointwise loss, we adopted the binary cross entropy loss described in Equation 8,

                         𝐿𝑠𝑛𝑖𝑝𝑝𝑒𝑡 (𝑦, 𝑦ˆ) = −(𝑦𝑙𝑜𝑔(𝑦ˆ) + (1 − 𝑦)𝑙𝑜𝑔(𝑦ˆ)).                     (8)
    Here, 𝑦 correspond to the true relevance of a snippet, 1 for positive and 0 for negative, and
𝑦ˆ corresponds to the probability of a snippet being positive assigned by the model. From this
definition, we also explored using label smoothing by considering a wide range of positive and
negative values for 𝑦.
    For the pairwise loss, we adopted the pairwise cross entropy loss, already described in
Equation 6. This pairwise loss is computed between all the positive snippets and all the negative
snippets that belong to the same document. As before, a snippet loss is only computed over the
snippets of a positive document.


5. Submission
In this section, we start by describing the data collection and some pre-processing steps that
are common to our official submission for all the batches. Then we describe each run that was
submitted for the BioASQ 9b phase A challenge.

5.1. Collection and Pre-processing
In this year’s edition, the document collection was the 2020 PubMed/MEDLINE annual baseline
consisting of almost 31 million articles. However, all the articles with missing abstract were
discarded, meaning that around 21 million articles were indexed with ElasticSearch using the
english text analyser, which performs tokenization, stemming and stopword filtering.
  Regarding the neural model, we built a simple Regex based tokenizer and trained 200-
dimension word embeddings using the GenSim [15] implementation of the word2vec algorithm
[16] over the 21 million abstracts from the baseline. Furthermore, when using the UPWM each
document is split into a set of individual sentences through the Punkt algorithm [17].
  For the training data we used the BioASQ dataset with the exception of the last year test set
that we used for validation purposes. With respect to the joint training, given that the snippet
gold standard does not respect sentence boundaries, e.g. it can be composed of several sentence,
we consider a sentence to be relevant (hard label of 1) if its text matches the text in a positive
snippet. However, when using soft label, we instead use the overlap between the sentence text
and the gold snippet text as the value for the soft label. The intuition is that a sentence that
only partially belongs to a gold snippet should not be considered fully relevant, and therefore
should not have a label of 1.
  Regarding the first stage of the pipeline, we finetuned the BM25 parameters, 𝑘1 and 𝑏, for
each batch by performing an extensive grid search. The validation data used for this process
corresponds to the last year’s test set for each corresponding batch.

5.2. Runs
This year, the document/snippet retrieval challenge received, on average across all batches, 27
submission from seven teams. Our group submitted five runs for each batch, which are identified
by the prefix “bioinfo” on the official results1 . Table 1 presents a summarised description of
the systems used in each run, where L-UPWM stans for lightning UPWM, T-UPWM stands
for transformer UPWM, JT corresponds to joint training and “-> RRF” means that we made an
ensemble of several runs using the rank reciprocal fusion (RRF) method [18].

Table 1
Summary of the submitted runs for each round of the 2021 BioASQ 9B phase A.
 Run name                                              Description
                  Batch 1 2 and 3              Batch 4                            Batch 5
  bioinfo-0          L-UPWM                 L-UPWM -> RRF                    L-UPWM -> RRF
  bioinfo-1       L-UPWM -> RRF       L-UPWM + JT(Pointwise) -> RRF   L-UPWM + JT(Point/Pairwise) -> RRF
  bioinfo-2          T-UPWM                 T-UPWM -> RRF                    T-UPWM -> RRF
  bioinfo-3       T-UPWM -> RRF       T-UPWM + JT(Pointwise) -> RRF   T-UPWM + JT(Point/Pairwise) -> RRF
  bioinfo-4        RRF of all runs       T-UPWM + JT(Pointwise)           T-UPWM + JT(Pointwise)

  In more detail, for the first, second and third batches, as observable in Table 1, we submitted
the same base system configuration for each run. In some cases, we made an ensemble run
that used several models trained with a slight difference in its hyperparameters. In these three
batches we did not employ the joint training technique and therefore we did not submit any
run for snippet retrieval.
  For the fourth batch, the main difference was the addition of the joint training methodology
with pointwise snippet loss. Then in the fifth batch we only added the pairwise snippet loss
for some models that used joint training. More precisely, runs “bioinfo-1” and “bioinfo-3”

   1
       http://participants-area.bioasq.org/results/9b/phaseA/
correspond to RRF ensembles of two models, one jointly trained using pointwise and the other
using pairwise snippet loss.
   Regarding snippet retrieval, the only runs in which we submitted snippets were the ones
using joint training. Furthermore, to produce the ranked snippet list we followed a heuristic
where the sentences presented on the top documents should have a higher probability of being
relevant. Therefore, during snippet ranking we preserve this document order and only extract
the snippets that score above a specific threshold. More precisely, for the forth batch we retrieved
snippets from the top-10 retrieved documents, while for the fifth batch we only considered the
top-1 retrieved document.


6. Results and Discussion
In this section, we separately address the document results and the snippet results, since we
only submitted snippets for the last two batches. Note that at the time of writing only the
preliminary results, regarding the systems’ performance, were available.

6.1. Document Retrieval
The overall results regarding the document retrieval task are shown in Table 2, together with
the median of all submissions, the top performing system in each batch, apart from our own
submissions, and the baseline score obtained with BM25, corresponding to the ranking order
without applying neural reranking. The results are organised in terms of Mean Average Precision
at ten (MAP@10), which was the official measure adopted by the organisers to rank all the
submissions. There were a total of 16, 30, 29, 27 and 28 submissions respectively for each batch.

Table 2
Summary of the results obtained for the document retrieval task.
    Run name           Batch 1         Batch 2         Batch 3         Batch 4         Batch 5
                    Rank MAP        Rank MAP        Rank MAP        Rank MAP        Rank MAP
    bioinfo-0        15     31.73    18     33.80    16     35.50    12     38.45    21     30.09
    bioinfo-1         8     32.96    15     35.37    13     36.22    11     38.49    15     31.86
    bioinfo-2         3     35.15     6     37.84    10     38.24     1     42.36     1     35.86
    bioinfo-3         1     35.73     4     38.13     8     39.35     7     40.42     8     34.40
    bioinfo-4         2     35.25     5     37.87     9     38.65     8     40.42     7     34.61
 Baseline (BM25)            31.03           33.20           35.94           36.86           32.89
     Median                 32.93           34.85           35.64           38.13           32.03
 Top competitor       4     34.60     1     39.90     1     40.40     2     41.92     2     35.37

   Looking at the results presented in Table 2, from a general perspective our transformer
UPWM model had top performing results, outscoring every other system in terms of MAP@10
in three batches while being competitive in the remaining two batches. On the opposite side,
the lightning UPWM model achieved results closer to the median. This observation brings to
evidence the trade off between efficacy and efficiency, since our lightning UPWM model is 63
times faster than the transformer approach with a GPU (K80) [10]. So, although it did not achieve
the same kind of retrieval performance as the transformer counterpart, the lightning UPWM
model may still be a viable or better solution depending on the perspective or requirements.
   To better understand the effects of the neural ranking solution, we added the row “baseline”
in Table 2, which represents the BM25 ranking order that all of our neural solution used for
reranking. We present in Figure 7 the comparison between all the submission scores and the
baseline score, to visualise the gains achieved by the neural reranking strategy.


Figure 7: Neural reranking gains when compared to the BM25 baseline.


   From an overall point of view, the neural reranking seems to be beneficial in achieving small
increases in MAP@10 percentage points. The lightning UPWM model on the last batch was
the only case were the neural reranking negatively influenced the baseline ranking order. This
may be a consequence of using only the top 100 documents in the reranking phase. Therefore,
we leave as future work the analysis of the impact of 𝑘 in the top-𝑘 document selection for
reranking. Figure 7 also clearly shows the difference in performance between the L-UPWM and
the T-UPWM, since the gains of the solutions that used T-UPWM are clearly higher than the
submissions that used the L-UPWM, as previously mentioned.
   For a better context regarding the overall ranking positions, we show in Figure 8 the difference
between our best system submission at each batch against the median score at that batch. As
observable, our best submission achieved results that are clearly superior when compared to
the median for all the batches. This is true even when our best submission was not in the top-5
(batch 3), which shows that the top scores were highly competitive and close.

6.2. Snippet Retrieval
The main results regarding the snippet retrieval task are shown in Table 2, together with the
median of all submissions, the top performing system in each batch. Moreover, since we only
submitted an experimental run for the fourth and fifth batches, we applied the bioinfo-3 system
submitted for the fifth batch to the remaining batches, to gain an idea of the performance
that our best snippet solution could have achieved. The results are organised according to the
harmonic mean of precision and recall at ten (F1@10), which was the official measure adopted
by the organisers. There were a total of 11, 19, 17, 18 and 19 submissions respectively for each
batch.
Figure 8: MAP@10 difference between our best run at each batch against the median score at that
batch.


Table 3
Summary of the results obtained for the snippet retrieval task.
        Run name                Batch 1        Batch 2        Batch 3        Batch 4         Batch 5
                              Rank    F1     Rank    F1     Rank    F1     Rank    F1     Rank     F1
         bioinfo-1                                                          18    10.67     5     15.57
         bioinfo-3                                                          14    14.25     1     17.68
         bioinfo-4                                                          13    15.03     2     16.83
 Post-challenge (bioinfo-3)    (3)   16.78    (1)   19.71    (3)   20.02    (2)   19.24
          Median                     11.32          14.86          16.84          17.18           14.28
        Top result             1     18.45    1     18.16    1     20.42     1    20.61     3     16.73


   As shown in in Table 3, our last approach, in batch five achieved a top scoring result. Ad-
ditionally, when applying this model to the remaining batches the performance is consistent,
with results in the top-3.
   In Figure 9, we show a comparison between the performance of the bioinfo-3 system (batch
5) in all batches against the median snippet results. Again, these results show the consistent
performance of our system.

6.3. Joint Training Benefits
In order to assess the effect of the joint training methodology, we present in Figure 10 the differ-
ence between the same model with and without using the joint training approach. When looking
at the lightning UPWM (L-UPWM) joint training lead to a marginal improvement on batch
fourth and a clear more positive impact on the fifth batch. Otherwise, the transformer UPWM
(T-UPWM) implementations clearly underperform when using the joint training methodology.
From these results joint training seems only to be beneficial to the smaller model. The main
reason to explain this behaviour may be related to the amount of training data available. Since
we build the training set to always output documents that had positive snippets, this had a
consequence of reducing the amount of training data available when compared with the version
that is only trained with document feedback. More precisely, the joint training approach uses
Figure 9: F1@10 difference between our best run at each batch against the median score of that batch.


18% less training pairs when compared to the document only training.


Figure 10: Comparison of both UPWM implementations with and without the joint train methodology.
7. Conclusion
In this paper we propose a novel sentence aggregation technique, named universal passage
weighting mechanism (UPWM), that can be combined with neural interaction based models. We
demonstrate two simple implementations, one a fast and lightweight variant called lightning
UPWM (L-UPWM) and a larger one, relying in the success of the transformer architecture, the
transformer UPWM (T-UPWM). The first uses a simple CNN and pooling architecture, while
the later uses the BERT model as the interaction based model in the UPWM architecture.
   We submitted runs with both implementations to the BioASQ 9b phase A challenge, addressing
the document and snippet retrieval tasks. For document task our best solution, T-UPWM, was
able to outperform all the other system in three of the five batches, while remaining competitive
in the others. The L-UPWM showed an inferior performance, which was expected given that is
a much lighter model that is 63 times faster than the T-UPWM. In both cases, neural reranking
was generally beneficial when looking at the performance gains against the BM25 baseline.
Finally, we also propose a joint training approach that showed encouraging results, leaving a
clear open path for future work.


Acknowledgments
This work has received support from the EU/EFPIA Innovative Medicines Initiative 2 Joint
Undertaking under grant agreement No 806968 and from National Funds through the FCT -
Foundation for Science and Technology, in the context of the grant 2020.05784.BD.


References
 [1] G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Partalas, M. Zschunke, M. Alvers, D. Weißen-
     born, A. Krithara, S. Petridis, D. Polychronopoulos, Y. Almirantis, J. Pavlopoulos, N. Bask-
     iotis, P. Gallinari, T. Artieres, A.-C. Ngonga Ngomo, N. Heino, E. Gaussier, L. Barrio-
     Alvers, G. Paliouras, An overview of the BIOASQ large-scale biomedical semantic
     indexing and question answering competition, BMC Bioinformatics 16 (2015) 138.
     doi:10.1186/s12859-015-0564-6.
 [2] T. Almeida, S. Matos, BIT.UA at BioASQ 8: Lightweight neural document ranking with
     zero-shot snippet retrieval, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.),
     Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki,
     Greece, September 22-25, 2020, volume 2696 of CEUR Workshop Proceedings, CEUR-WS.org,
     2020. URL: http://ceur-ws.org/Vol-2696/paper_161.pdf.
 [3] X. Liu, W. B. Croft, Passage retrieval based on language models, in: Proceedings of the
     Eleventh International Conference on Information and Knowledge Management, CIKM
     ’02, Association for Computing Machinery, New York, NY, USA, 2002, p. 375–382. URL:
     https://doi.org/10.1145/584792.584854. doi:10.1145/584792.584854.
 [4] Z. Dai, J. Callan, Deeper text understanding for ir with contextual neural language mod-
     eling, in: Proceedings of the 42nd International ACM SIGIR Conference on Research
     and Development in Information Retrieval, SIGIR’19, Association for Computing Machin-
     ery, New York, NY, USA, 2019, p. 985–988. URL: https://doi.org/10.1145/3331184.3331303.
     doi:10.1145/3331184.3331303.
 [5] L. Pang, Y. Lan, J. Guo, J. Xu, J. Xu, X. Cheng, Deeprank: A new deep architecture
     for relevance ranking in information retrieval, CoRR abs/1710.05649 (2017). URL: http:
     //arxiv.org/abs/1710.05649. arXiv:1710.05649.
 [6] T. Almeida, S. Matos, Calling attention to passages for biomedical question answering,
     in: J. M. Jose, E. Yilmaz, J. Magalhães, P. Castells, N. Ferro, M. J. Silva, F. Martins (Eds.),
     Advances in Information Retrieval, Springer International Publishing, Cham, 2020, pp.
     69–77.
 [7] B. Hu, Z. Lu, H. Li, Q. Chen, Convolutional neural network architectures for matching
     natural language sentences, in: Proceedings of the 27th International Conference on
     Neural Information Processing Systems - Volume 2, NIPS’14, MIT Press, Cambridge, MA,
     USA, 2014, p. 2042–2050.
 [8] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional
     transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv.
     org/abs/1810.04805. arXiv:1810.04805.
 [9] J. Guo, Y. Fan, Q. Ai, W. Croft, A deep relevance matching model for ad-hoc retrieval, 2016,
     pp. 55–64. doi:10.1145/2983323.2983769.
[10] T. Almeida, S. Matos, Benchmarking a transformer-FREE model for ad-hoc retrieval,
     in: Proceedings of the 16th Conference of the European Chapter of the Association for
     Computational Linguistics: Main Volume, Association for Computational Linguistics,
     Online, 2021, pp. 3343–3353. URL: https://www.aclweb.org/anthology/2021.eacl-main.293.
[11] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, H. Poon,
     Domain-specific language model pretraining for biomedical natural language processing,
     2020. arXiv:arXiv:2007.15779.
[12] R. Nogueira, K. Cho, Passage re-ranking with BERT, CoRR abs/1901.04085 (2019). URL:
     http://arxiv.org/abs/1901.04085. arXiv:1901.04085.
[13] S. Robertson, H. Zaragoza, The probabilistic relevance framework: Bm25 and beyond,
     Found. Trends Inf. Retr. 3 (2009) 333–389. doi:10.1561/1500000019.
[14] D. Pappas, R. McDonald, G.-I. Brokos, I. Androutsopoulos, AUEB at BioASQ 7: Document
     and snippet retrieval, in: P. Cellier, K. Driessens (Eds.), Machine Learning and Knowledge
     Discovery in Databases, Springer International Publishing, Cham, 2020, pp. 607–623.
[15] R. Řehůřek, P. Sojka, Software Framework for Topic Modelling with Large Corpora, in:
     Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, ELRA,
     Valletta, Malta, 2010, pp. 45–50. http://is.muni.cz/publication/884893/en.
[16] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed representations of
     words and phrases and their compositionality, in: Proceedings of the 26th International
     Conference on Neural Information Processing Systems - Volume 2, NIPS’13, Curran
     Associates Inc., Red Hook, NY, USA, 2013, p. 3111–3119.
[17] T. Kiss, J. Strunk, Unsupervised multilingual sentence boundary detection, Computational
     Linguistics 32 (????) 485–525. URL: https://www.aclweb.org/anthology/J06-4003. doi:10.
     1162/coli.2006.32.4.485.
[18] G. V. Cormack, C. L. A. Clarke, S. Buettcher, Reciprocal rank fusion outperforms condorcet
     and individual rank learning methods, in: Proceedings of the 32nd International ACM
SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’09,
Association for Computing Machinery, New York, NY, USA, 2009, p. 758–759. URL: https:
//doi.org/10.1145/1571941.1572114. doi:10.1145/1571941.1572114.