=Paper=
{{Paper
|id=Vol-2696/paper_161
|storemode=property
|title=BIT.UA at BioASQ 8: Lightweight Neural Document Ranking with Zero-shot Snippet Retrieval
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_161.pdf
|volume=Vol-2696
|authors=Tiago Almeida,Sérgio Matos
|dblpUrl=https://dblp.org/rec/conf/clef/AlmeidaM20
}}
==BIT.UA at BioASQ 8: Lightweight Neural Document Ranking with Zero-shot Snippet Retrieval==
<pdf width="1500px">https://ceur-ws.org/Vol-2696/paper_161.pdf</pdf>
<pre>
       BIT.UA at BioASQ 8: Lightweight neural
       document ranking with zero-shot snippet
                      retrieval

    Tiago Almeida1[0000−0002−4258−3350] and Sérgio Matos2[0000−0003−1941−3983]
                            1
                              University of Aveiro, IEETA
                              tiagomeloalmeida@ua.pt
                         2
                           University of Aveiro, DETI/IEETA
                                 aleixomatos@ua.pt


        Abstract. This paper presents the participation of the University of
        Aveiro Biomedical Informatics and Techologies (BIT) group in the eighth
        edition of the BioASQ challenge for the document and snippet retrieval
        tasks. Our system follows a two-stage retrieval pipeline, where a group
        of candidate documents is retrieved based on BM25 and reranked by
        a lightweight interaction-based model that uses the context of exact
        matches to refine the ranking. Additionally, we also show a zero-shot
        setup for snippet retrieval based on the architecture of our interaction
        based model. Our system achieved competitive results scoring at the top
        or close to the top for all the batches, with MAP values ranging from
        33.98% to 48.42% in the document retrieval task, although being less
        effective on snippet retrieval.


1     Introduction
Last year (2019), PubMed indexed almost one and a half million articles, which
is equivalent to almost three new articles indexed every minute.3 As a conse-
quence, it is continually more time consuming for a biomedical expert to suc-
cessfully search this unprecedented amount of available information. So, given
the current artificial intelligence (AI) revolution, it is clear that such systems
can be exploited to aid with this searching task and ultimately help researchers
to rapidly find consistent information about their research topic.
     The BioASQ [25] challenge provides annual competitions on document clas-
sification, retrieval and question-answering applied to the biomedical domain.
These competitions are notable for continuously pushing the development of
intelligent systems capable of tackling the previously enunciated problem.
     This paper describes the participation of the Biomedical Informatics and
Techologies (BIT) group in the eighth edition of the BioASQ challenge, specifi-
cally in the document and snippet retrieval tasks of BioASQ 8b Phase A. More
  Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
  mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
  ber 2020, Thessaloniki, Greece.
3
  https://www.nlm.nih.gov/bsd/medline pubmed production stats.html
precisely, the objective is to retrieve, from the PubMed/MEDLINE collection,
the most relevant articles and documents snippets for a given biomedical ques-
tion written in English.
    Our approach is an evolution of a previous work [1] that develops and ap-
plies a two-stage retrieval system to the biomedical searching problem. More
concretely, it uses the Elasticsearch engine with BM25 weighing scheme to re-
duce the search space and then applies a neural ranking model in this smaller
space to produce a final ranking order. In this work, we focus on improving the
neural ranking model by simplifying the previous architecture and by adopting
some modifications based on new assumptions. Furthermore, one of the enhance-
ments enables us to directly extract the importance that the model assigns to
each document passage without the need of training the model on this specific
task, which makes it a zero-shot learner. In other words, the neural ranking
model is only trained to predict the relevance of an entire document for a given
question.
    The final neural ranking model, presented here, has only 620 trainable pa-
rameters, making it an extremely lightweight approach when compared to trans-
former based models which are the current state-of the-art for NLP related tasks.
    Our submissions achieved the top and close to the top positions for ev-
ery document retrieval batch and also showed interesting results for all of the
snippet retrieval batches. These are insightful results, that show the poten-
tial of our lightweight neural ranking model and demonstrate a potential zero-
shot learning setup that can be easily extended to a snippet retrieval task.
The full network configuration is publicly available at https://github.com/
bioinformatics-ua/BioASQ_CLEF, together with code for replicating the re-
sults presented in this paper.


2   Background

In classical IR methods, a ranking function is parameterized by a set of hand-
crafted features to score the relevance of a query-document pair. Nowadays,
recent works on the application of deep learning methods to IR, and question-
answering in particular, have shown very good results. In this new perspective,
commonly referred to as neural IR, the ranking function is approximated by a
neural network that learns its parameters from large data collections. In the lit-
erature, neural models are usually subdivided into two categories based on their
architecture. In one category, the models learn a semantic representation of the
texts and use a similarity measure to score each query-document pair. Examples
in this representation-based category include the Deep Structured Seman-
tic Model (DSSM) [7] and the Convolutional Latent Semantic Model (CLSM)
[23]. On the other hand, in interaction-based approaches, query and document
matching signals are captured and then fed to a neural network that produces a
ranking score based on the extracted matching patterns over these signals. Ex-
amples include the Deep Relevance Matching Model (DRMM) [6] and DeepRank
[17].
    Since 2018, transformer-based architectures, like GPT [20] and BERT [5],
have been revolutionizing the NLP field, showing outstanding performance in
the majority of tasks. These are large models that explore transfer learning
techniques by leveraging the knowledge learned on enormous text collection.
Following this trend, some promising works show positive results when applying
this type of models for the ad-hoc retrieval task [3, 4, 12]. However, despite the
indisputable performance presented by these architectures, it is also undeniable
that the dimension of such models is a major drawback, that makes it almost
impossible for some institutions to deploy or even use these models given their
demanding computational costs.
    Endorsed by the annual BioASQ competition, biomedical IR became a chal-
lenge with a wide range of different solutions, either based on traditional IR,
neural IR, or a combination of both. For example, the system proposed by the
USTB PRIR team [9] uses query enrichment strategies, Sequential Dependence
Models (SDM) and pseudo-relevance feedback to obtain a list of relevant docu-
ments. This traditional approach scored in the top positions between the third
and fifth edition, which highlights early challenges of applying neural models to
this task. The system proposed by the AUEB team [2] was the first to show
some evidence that deep neural models are capable of outscoring the traditional
models by scoring at the top positions in the sixth and seventh editions. Their
system uses a variation of DRMM [14] or BERT [5] to rerank the top 100 docu-
ments recovered using the BM25 scheme [22]. The importance of the reranking
step is evidenced by comparing the results to another work that submitted the
top documents directly retrieved based on BM25 [13].


3   Base architecture


                Fig. 1. Overview of our two-stage retrieval system
3.1   Phase-I
The main objective of this phase is to reduce the enormous searching space by
only selecting the most top-N potential relevant documents for a given ques-
tion. Given the large dimension of the article collection (approximately 30 mil-
lion scientific articles), it is important to consider an efficient solution capable
of handling this growing collection. With this in mind, we decided to rely on
ElasticSearch (ES) with the BM25 weighting scheme described in Equation 1.
As mentioned before, only the exact matching signals are considered during this
retrieval phase.

                                            C − f (qi ) + 0.5
                       IDF (qi ) = ln(1 +                     ),
                                              f (qi ) + 0.5
                                               f (qi , D) × (k1 + 1)                   (1)
         weight(qi , D) = IDF (qi ) ×                                              .
                                        f (qi , D) + k1 (1 − b + b × avg|D|
                                                                         l (D)
                                                                               )
    Equation 1 presents the weighting scheme of each query term qi with respect
to a document D, where C corresponds to the total number of documents in the
collection, f (qi ) represents the number of documents that contain the term qi ,
f (qi , D) represents the frequency of term qi in document D, |D| corresponds to
the total number of terms in document D, i.e., its length, avgt (D) represents the
average length of the documents in the collection, and k1 , b are hyperparameters
that should be finetuned for the collection.
    At last, given the weight of each query term with respect to a document,
weight(qi , D), the final query-document score is computed by taking a summa-
tion of each query term weight, as shown in Equation 2.
                                         X
                          score(Q, D) =      weight(qi , D).                   (2)
                                         qi ∈Q


3.2   Phase-II
The second phase has the objective of reranking the previously retrieved top-N
documents by taking into consideration additional matching signals to produce
the final ranking order. The rationale here is that the previous step only considers
the exact matching signals, i.e., only the words that appear both on the query
and the document are taken into account and weighted to produce the phase-I
ranking. So a more powerful neural solution may be able to learn how to better
explore the context where these exact matches occur.
   More precisely, our model is inspired by the DeepRank [17] architecture and
represents a direct enhancement of our previous work [1], with the following
major differences:

 • Passages no longer follow the query-centric assumption and now corre-
   spond directly to entire document sentences;
 • The detection network and the measure network were simplified and now
   form the interaction network;
    • The passage position input was dropped;
    • The contributions of each passage to the final document score are now as-
      sumed to be independent, replacing the self-attention proposed in [1];
    • The pooling step now receives more operators, namely average and average
      over k-max;
    • Calculation of the passage relevance score was simplified.

    The intuition behind this model is to make a thorough evaluation of the
document passages where the exact matches occur, by taking into consideration
their context. More precisely, this model explores the interactions presented in
the entire passage of each exact match and makes a more refined judgment of
the passage relevance based on that.
    The updated architecture is depicted in Figure 2 and described here in detail
in order to keep this paper self-contained. First, let us define a query as a
sequence of terms q = {u0 , u1 , ..., uQ }, where ui is the i-th term of the query and
Q the size of the query; a document passage as p = {v0 , v1 , ..., vT }, where vk
is the k-th term of the passage and T the size of the passage; and a document
as sequence of passages D = {p0 , p1 , ..., pN }.


Fig. 2. Overview of the neural ranking model with a tensor representation of the data
flow.


     From the architecture presented in Figure 2 it is observable that a document
is first split into individual sentences, i.e., a sequence of passages. In this step,
we rely on the nltk.PunktSentenceTokenizer4 , that implements an unsupervised
algorithm for sentence splitting and shows good results on majority of European
languages. Then, passages are grouped with each query-term occurring in the
passage, and the resulting structure is fed to the interaction network together
with the full query to calculate relevance scores for each passage. The final
document score is produced in the aggregation network taking into consideration
each passage score and the relative importance of each query term.
     In more detail, the Grouping by q-term block associates each passage with
each query term that appears in the passage. Formally, this step produces a set of
document passages aggregated by each query term as D(ui ) = {pi0 , pi1 , ..., piP },
where pij corresponds to the j-th passage with respect to the query term ui .
4
    https://kite.com/python/docs/nltk.tokenize.punkt.PunktSentenceTokenizer
This aggregated flow facilitates considering the weight of each query term in
downstream calculations in a straightforward way, as proposed in DRMM [6].
    The Interaction network was designed to independently evaluate each
query-passage interaction, producing a final relevance score per sentence. In de-
tail, it receives as input the query q and the aggregated set of passages
D(ui ) and creates for each query-passage pair a similarity tensor (interaction
                             Q×T
matrix) S ∈ [−1, 1]               , where each entry Sij corresponds to the cosine simi-
larity between the embeddings of the i-th query term and j-th passage term,
          u~i T ·v~j
Sij = ku~i k×k     v~j k . Next, an x by y convolution followed by a concatenation of the
global max, average and average k-max pooling operation are applied to each
similarity tensor, to capture multiple local relevance signals from each feature
map, as described in Equation 3,

                                   y
                                 x X
                                 X
                        hm
                         i,j =
                                            m
                                           ws,t × Si+s,j+t + bm ,
                                 s=0 t=0
                          hm          m
                           max = max(h ), m = 1, ..., M ,
                                                                                     (3)
                           hm          m
                            avg = avg(h ), m = 1, ..., M ,
                   hm                     m
                    avg−kmax = avg(k-max(h )), m = 1, ..., M ,
                            h = {hmax ; havg ; havg−kmax }.

    Here, w and b are trainable parameters, the symbol ‘;’ represents the con-
catenation operator, M corresponds to the total number of filters and the vector
  ~h encodes the local relevance between each query-passage, extracted by these
3M ×1
pooling operations. At this point, the aggregated set of passages D(ui ) is now
represented by their respective vectors ~h, i.e., D(ui ) = {h~p0 , h~p1 , ..., h~pP }.
    The final step of the interaction network is to convert these passage repre-
sentations ~h to a final relevance score, for which we employed a fully connected
layer with sigmoid activation, Equation 4,

                                            ~
                             rui = σ( hui · w~ + b ).                                (4)
                           P ×3M     P ×3M 3M ×1 1×1


     The aim here is to derive a relevance score, relevant (1) or irrelevant (0),
directly from the information that was extracted by the pooling operators. So,
after this stage the aggregated set of passages D(ui ) is represented by this rele-
vance score, i.e., D(ui ) = {rp0 , rp1 , ..., rpP } = r~ui .
     The aggregation network, as already mentioned, takes into consideration
the importance of each query term by using a gating mechanism, similar to
DRMM [6], over the aggregated set of passages, as described in Equation 5. That
is, each passage score is weighted by the importance of its associated query term,
following the intuition that in a query different terms carry different importance
with respect to the final information goal.
                                      ~ · x~ui ,
                                cui = w
                                       1×E   E×1
                                          ecui
                              aui = P             cuk ,                       (5)
                                        uk ∈Q e
                                s~ui = aui × r~ui ,
                                P ×1   1×1   P ×1

    Here, w is a trainable parameter and x~ui corresponds to the embedding vector
of the ui query term. Then the distribution of the query term importance, a, is
computed as a softmax and applied to the respective passages scores, r~ui .
    To produce the final document score, a scorable vector ~s is created by per-
forming a summation alongside the query-term dimension of s~ui . Note that in
this step we could have explored other ways to produce this final vector, how-
ever, this approach seems to empirically work. Finally, this scorable vector ~s is
fed to a Multi-Layer Percepreton (MLP) to produce the final ranking score, as
summarized in Equation 6.
                                              X
                              score = M LP (      s~ui )                      (6)
                                             ui ∈Q


3.3   Snippet Retrieval
As initially stated in [1], this architecture has an interesting property that en-
ables us to directly infer the relevance of each passage according to the model
perspective, i.e., the passages scores that most contribute to the final document
score. We can therefore derive a final score for each passage as the score given
by the interaction network weighted by the query term importance, which was
already computed and corresponds to the vector s~ui .
    It is important to note that extracting the most relevant passages per docu-
ment is not the same as producing a ranked list of passage relevance as intended
in the BioASQ competition, which implies comparing passages between differ-
ent documents. In our case, however, passage scores are not directly comparable
since they are obtained with respect to their document, which involves different
distributions. So, similarly to [2], we assume that passages from documents with
higher document scores are more relevant than passages from documents with
a lower score, which seems intuitive. We therefore obtain the list of passages
by collecting, from the top ranked documents, all passages with a score above
a set threshold. However, a better approach could be explored in the future by
producing scores that take into consideration the passage itself and the score of
the respective document.
    Furthermore, it is noteworthy to reinforce that this strategy works in an un-
supervised manner, in the sense that the model does not take into consideration
the gold-standard of the passage relevance, but instead produces this relevance
based on what is important to increase the final score of a relevant document, ac-
cording to the document gold-standard. From another perspective, we can argue
that the model is pretrained on the document gold-standard and then applied
to the snippet retrieval task, making this a zero-shot learning setup since it was
never trained on the passage gold-standard.

3.4   Joint Training
Moved by the interesting results, especially in terms of snippet performance,
reported in the previous BioASQ challenge [18], we also tried to implement a joint
training methodology that explores both document and snippet gold-standards,
instead of only training with the document gold-standard. More precisely, we
compute the binary cross entropy loss over the passage relevance from Equation
4. Then we added the average cross entropy loss of each passage to the document
pairwise loss and trained the model over this combination of the two losses. Note
that the architecture for document score and snippet retrieval remained the
same, since our main idea at this point was to exploit the snippet gold-standard
to, through supervision, enforce the model to distinguish relevant from non-
relevant passages. Furthermore, as will be addressed in the following sections
and discussed in Section 5, this idea empirically failed to improve the model
performance.


4     Submission and Results
In the section, we start by detailing the data collection and some pre-processing
steps that are common to our official submissions for the 5 batches. Then we
independently show our results for each batch since we continuously refined
our base solution by better finetuning the hyperparameters and changing small
aspects of the architecture.

4.1   Collection and Pre-processing
In this edition of the BioASQ challenge, the document collection was the 2019
PubMed/MEDLINE annual baseline consisting of almost 30 million articles.
However, only roughly 66% of the articles had title and abstract, so following
previous observations [2], we decided to discard the remaining 34%, which were
rarely relevant according to the gold-standard. At this point, our collection had
approximately 20 million documents that were indexed (title and abstract) with
Elasticsearch using the english text analyzer, which automatically performs
tokenization, stemming and stopword filtering.
    We adopted a custom tokenizer that uses simple regular expressions to ex-
clude none alphanumeric characters except the hyphen, since many words in the
biomedical domain contain a hyphen, like chemical substances. This way we keep
these words intact, which enhances the detection of important exact matches.
We also trained 200-dimensional word embeddings using the GenSim [21] imple-
mentation of word2vec [15], with the 20 million documents (title and abstract)
following the described tokenization, which produced a vocabulary of approx-
imately 4 million tokens. We used the default configuration of the word2vec
algorithm and fixed the embeddings matrix during the training of the neural
ranking model.

4.2    Training Details and Hyperparameters
For training our neural ranking model, we used the gold-standard data from
the 1-7 editions of BioASQ, with the exception of one test batch of the seventh
edition that we used for validation. Contrarily to our previous work [1], we
adopted the pairwise cross-entropy loss, as suggested by Hui et al. [8] and shown
in Equation 7.
                                                     +
                                               e(score(q,d ))
                L(q, d+ , d− ) = −log( (score(q,d+ ))                   )    (7)
                                      e               + e(score(q,d− ))
    Since the BioASQ data only provides a list of relevant (positive) documents
per query, we sampled the negative documents as the documents that were re-
trieved by the ES but did not appear in the gold-standard. Another important
note is that only the top 10 documents per submission are analyzed by experts
in terms of relevance, which may produce an incomplete gold-standard, i.e., posi-
tive documents may not be judged since they were not retrieved by participating
systems and hence are taken as negative documents during training. To exacer-
bate the problem, the gold-standard was built as a concatenation of the judged
relevance of documents from different years, which implies a different snapshot
of the document collection. To alleviate this problem, we restrict the ES search
by year so that only the available documents at that time are available to the
model training.
    We gave a major emphasis to training/validation in order to gain a better
intuition of the model behavior and what configuration should be followed in
each batch. The neural ranking model was trained using the Adam [10] optimizer,
alongside with modern techniques like learning rate finder and cyclical learning
rates [24]. The finetuning of this model was a rolling process that took the
duration of the 5 batches. More concretely, we searched the kernel size for the
convolution, the total number of filters, the pooling operation, the activation
functions, and other minor details that ended up not influencing the overall
performance. To summarize, Table 1 shows the model configuration that seems
to be the strongest producing a model with only 620 trainable parameters.
    The model was implemented in TensorFlow5 and is available at https:
//github.com/bioinformatics-ua/BioASQ_CLEF. The entire training process
was conducted with the help of an in-house toolbox that implements pairwise
training in TensorFlow. 6 .

4.3    BioASQ Evaluation
The BioASQ evaluation is divided in two stages. In the first stage, the submis-
sions are evaluation against a gold-standard annotated by biomedical experts. In
5
    https://www.tensorflow.org/
6
    The toolbox is open-sourced here: https://github.com/T-Almeida/mmnrm
Table 1. List of the hyperparameters and their respective values. In some cases, the
range of tested values is listed, with the best one highlighted in bold.

             Hyperparameter                                                   Value
               BM25 k1 and b                       k1∈[0.1, ..., 0.4, ..., 1.25] and b∈[0.2, ..., 0.4, ..., 0.7].
                  ES top-N                                                250, 500, 1000
     Number maximum of query tokens, Q                                           30
    Number maximum of passage tokens, T                                          30
Maximum of passage aggregated to each q-term, P                                   5
                 Kernel size                          3 by 3, multiples (2 by 2, 3 by 3) following [8]
                   Filters                                                  16, 20 , 32
             Pooling operations                 {max}, {max and avg}, {max, avg and avg over k-max}
             Activation function                               leakyReLU, selu [11], mish [16]
               Embedding size                                                   200


a second stage, the biomedical experts will manually annotate the relevance of
the retrieved documents from each submission. At the time of writing, only the
results for the first stage are available, corresponding to the results presented in
this paper. In terms of numerical evaluation, the organizers automatically com-
pute five measures (Mean Precision, Recall, F-Measure (F1), MAP and GMAP)
over each submission given the current gold-standard. According to the chal-
lenge evaluation guidelines [19], the overall system rankings are based on the
MAP measure.
    Our group submitted five runs for each of the batches, which can be identi-
fied by the prefix “bioinfo” on the official results7 . In the following sections we
present a summary table of the results, comparing our five submissions to the
top competitor in each batch, i.e., with the top performing system excluding our
systems.

4.4     Submission and Results for Batch 1
For the first batch, the BioASQ organizers received a total of 21 submissions
from 8 teams8 . For this run, our main idea was to validate the performance of
our phase-I retrieval mechanism and to test if our phase-II reranking model was
indeed boosting the original ranking order. As a summary, we submitted one
run with the results coming from phase-I, i.e., the BM25 ranking order that
was finetuned on the validation set, while the remaining runs were produced by
reranking the phase-I results with our neural ranking model:
    • bioinfo-0: A finetuned BM25 run produced by the ElastisSearch;
    • bioinfo-1 to 4: Neural reranking of the Top-250 documents produced by the
      finetunned BM25.
    At the time of this submission, our ranking model was still in an initial phase
of development, which means that it did not completely follow the architecture
presented in Section 3.2. More concretely, the model only used the max-pooling
operator and a simple linear combination was used for producing the final doc-
ument score, instead of an MLP.
7
    http://participants-area.bioasq.org/results/8b/phaseA/
8
    This number is a speculation based on the names of the submission
Table 2. Summary of the results for the first batch of the BioASQ challenge. Our five
submission are presented at the top of the table, starting with the prefix “bioinfo”.
Additional, we highlight, in bold, the best recorded values per each metric.

      System   Rank Recall F1 MAP GMAP Rank Recall F1 MAP GMAP
                     Document Retrieval          Snippet Retrieval
   bioinfo-0    8   44.62 16.27 30.67   0.63 -   -      -      -     -
   bioinfo-1    6   44.84 16.92 32.23   1.03 7 17.77 13.86 26.32   0.05
   bioinfo-2    2   48.64 17.59 33.83 1.20   5 17.15 15.00 29.53   0.06
   bioinfo-3    1   48.20 17.48 33.98 1.20   8 18.23 15.91 24.06   0.05
   bioinfo-4    4   47.79 17.47 33.59   1.03 6 17.91 15.01 26.53   0.07
Top competitior 3   44.00 16.86 33.59   0.88 1 24.67 17.52 85.75 0.17


    Table 2 reflects the first stage of the BioASQ evaluation for our submis-
sions. In terms of document retrieval, the “bioinfo-3” submission achieved the
top score in terms of MAP, which means it was the best performing system in
this batch. Additionally, the “bioinfo-2” submission was the second-best per-
forming system and achieved the best result in terms of recall and GMAP. For
the snippet retrieval, our best performing system achieved fifth place and also
showed interesting results in terms of recall and F-measure when compared to
the top-performing system.


4.5    Submission and Results for Batch 2

The second batch received a total of 26 submissions from 9 teams. Our system
was built directly on the validation performed for the gold-standard of the pre-
vious test batch and the validation set. More precisely, we tested the addition
of more pooling operators (average and average over k-max) and the addition
of the MLP for scoring, which empirically proved to be beneficial. Additionally,
we also decided to pursue a joint training approach described in Section 3.4 and
the following list presents a summary of each submitted system:

 • bioinfo-0: Neural reranking model with joint training (snippets and docu-
   ments);
 • bioinfo-1,3 and 4: Neural reranking described in Section 3.2;
 • bioinfo-2: Neural reranking using max and average pooling operator.

    Table 3 shows the performance of the submitted systems, overall only the
system that was trained in joint fashion achieved a poor performance. For doc-
ument retrieval, our top-performing system was the “bioinfo-3”, which achieved
third place in the overall ranking and also the best score in terms of GMAP.
For the snippet retrieval, our best performing system achieved the fourth place
on the overall ranking and similarly to the previous batch showed interesting
results in terms of recall and F-measures when compared to other systems.
Table 3. Summary of the results for the second batch of the BioASQ challenge. Our
five submission are presented at the top of the table, starting with the prefix “bioinfo”.
Additional, we highlight, in bold, the best recorded values per each metric.

      System   Rank Recall F1 MAP GMAP Rank Recall F1 MAP GMAP
                     Document Retrieval           Snippet Retrieval
   bioinfo-0    8   43.41 18.30 29.10   1.17 13 16.17 11.75 18.84   0.09
   bioinfo-1    4   47.55 19.94 31.49   1.86  5 21.03 14.61 27.21   0.16
   bioinfo-2    7   46.48 19.14 30.84   1.52  6 20.18 14.08 26.37   0.11
   bioinfo-3    3   48.80 20.27 31.68 2.23    7 20.04 14.08 26.37   0.11
   bioinfo-4    5   47.87 20.02 31.20   1.61  4 20.09 14.13 27.67   0.16
Top competitior 1   45.01 23.00 33.04 1.85    1 25.31 17.73 68.21 0.15


4.6     Submission and Results for Batch 3, 4 and 5
Contrarily to the previous sections, we now present the results for the third,
fourth and fifth batches in the same section, since the submissions for the dif-
ferent batches all follow the same description:

    • bioinfo-0: Ensemble of multiple Neural reranking models;
    • bioinfo-1 to 4: Neural reranking described in Section 3.2.

    The organizers received a total of 28 submissions from 9 teams for the third
batch, 26 submissions from 11 teams for the fourth batch, and 25 submissions
from 9 teams for the last batch. Given that the proposed joint training seemed to
deteriorate the overall performance we decided to keep the focus on the current
solution and leave as future work a reformulation of the joint training idea. So,
we replaced the joint training submission with a submission that used a naive
ensemble of multiple neural reranking models that were trained during valida-
tion. Note that for the ensemble run we did not produce a ranked list of snippets
since the proposed snippet algorithm does not support multiple relevance values,
from different sources, per passage.
    Table 4 presents a summary of the results obtained for the last three batches.
Focusing now on the document retrieval task, “bioinfo-3” was our best perform-
ing system in the third batch achieving a fourth place in the overall ranking
and, additionally, “bioinfo-0” was the best system in terms of recall and GMAP.
Similarly, “bioinfo-3” was our best performing system in the fourth batch, with
a fourth place, and “bioinfo-0” achieved the best result in terms of recall. For
the fifth batch, “bioinfo-4” achieved the overall best performance, ranking first
place in both MAP and recall. We also achieved the top score in terms of GMAP
with the “bioinfo-1” submission. In terms of snippet retrieval, the best ranking
was a fifth place on the third and fifth batches.


5      Discussion
In this section we discuss the previously presented results, analyzing first the
overall performance on the document retrieval task followed by the results on
Table 4. Summary of the results for the third, fourth and fifth batches of the BioASQ
challenge. Our submissions are presented at the top of each batch, starting with the
prefix “bioinfo”. Additional, we highlight, in bold, the best recorded values per each
metric.

    System     Rank Recall F1 MAP GMAP Rank Recall F1 MAP GMAP
                                    Batch 3
                     Document Retrieval           Snippet Retrieval
   bioinfo-0     7  54.15 18.73 43.50 2.07    -   -     -       -     -
   bioinfo-1    11  52.43 18.11 42.68   1.55  7 27.01 16.70 39.10   0.37
   bioinfo-2     8  53.20 18.08 43.03   1.86  6 28.25 17.29 40.85   0.36
   bioinfo-3     4  53.65 18.20 43.69   2.04  5 29.63 17.34 41.37   0.37
   bioinfo-4    10  54.08 18.83 42.84   2.02  8 26.79 16.61 37.76   0.32
Top competitior 1   53.77 19.32 45.10 1.87    1 35.58 21.40 100.39 0.56
                                    Batch 4
                     Document Retrieval           Snippet Retrieval
   bioinfo-0     7  55.60 19.95 39.77   1.92  -   -     -       -     -
   bioinfo-1     8  55.28 19.30 39.71   2.01 10 25.29 16.62 34.55   0.13
   bioinfo-2     6  55.53 19.77 40.06   2.10  7 25.84 17.23 36.59   0.15
   bioinfo-3     4  53.92 19.38 40.24   1.31  9 26.55 17.42 35.00   0.15
   bioinfo-4    10  54.44 19.75 38.69   1.54 12 25.29 16.62 34.55   0.13
Top competitior 1   54.46 19.67 41.63 2.04    1 33.03 21.51 102.44 0.55
                                    Batch 5
                     Document Retrieval           Snippet Retrieval
   bioinfo-0     4  62.08 19.95 47.47   3.20  -   -     -       -     -
   bioinfo-1     3  62.21 19.97 47.80 3.49    8 31.94 19.46 42.97   0.24
   bioinfo-2     7  59.98 19.09 46.45   2.40 10 31.94 19.47 42.06   0.24
   bioinfo-3     5  61.54 19.67 46.65   2.88  9 32.05 19.35 42.70   0.23
  bioinfo-4      1  62.63 19.78 48.42 3.30    5 32.14 19.60 43.79   0.29
Top competitior 2   60.50 39.63 48.25   2.54  1 35.36 24.91 112.67 0.38


snippet retrieval. We complement this discussion with our considerations on
what was successful and what has failed.
    Addressing the results presented in Tables 2, 3 and 4, we consider that our
system had an extremely competitive permanence, being in the top position for
the first and fifth batch and close to the top in the remaining batches. Addition-
ally, we note that at least one of our submissions achieved the best performance
in at least one metric for all the batches. Furthermore, if we look at the GMAP
metric it is observable that our systems achieved the best results in all but the
fourth batch.
    With respect to the neural reranking performance comparative to the phase-I
ranking, we can see in Table 2 that every neural submission was able to improve
the original BM25 ranking order, what is in accord with our speculation and
validation results. So, these results also seem to be according to our proposed
idea of better exploring the context where the exact match occurs to produce a
more refined judgment that contributes to the final score.
    As previously said, after the first batch and based on some validation tests
we decided to change the model by adding more pooling operations and the
MLP. However, at a first glance, according to the results on the second, third,
and fourth batches, it seems that these changes were not beneficial, since the
system was not able to achieve the top performance, similarly to the first batch.
However, we argue that this discrepancy can also be a consequence of some
improvement of the competitors systems after the first batch. Additionally, we
experiment the updated architecture on the first batch and were able to easily
achieve a MAP score of over 35%, surpassing the previous best.
    Finally, the only metric for which our system does not seem to be able to
achieve competitive results is F-measure. However, as noted previously [18], a
system that outputs confidence scores instead of ranking scores seems to be able
to achieve higher performance in terms of this metric. A possible explanation
relies on the BioASQ data and more properly on the questions that have only
a few true positive documents9 in the entire collection. In this case, a system
based on confidence scores can easily create a ranked list with fewer than 10
documents (the maximum considered per question), since it selects the relevant
documents based on a threshold value over the confidence scores. So, for this
type of questions, a system based on confidence is more likely to achieve higher
values of Precision and Recall (resulting in a higher F1 measure) when compared
to a ranking system that will obtain a higher Recall but lower Precision since it
always outputs the top 10 documents.
    In terms of snippet retrieval, the submitted system did not present competi-
tive results when compared to the top submissions, being the best performance
a fourth place in the second batch. However, given that our method does not use
the snippet gold-standard for training and follows a naive ranking approach, we
consider these results encouraging, especially in terms of recall and F-measure,
and with the potential to be better explored in future work.
    Concerning the joint training approach, we considered that it has empiri-
cally failed. More precisely, it seems that our intuition to improve the passage
relevance with supervision may be more challenging to achieve. One problem
is the notion of passage relevance since most of the time a relevant snippet in
the gold-standard encompasses multiple sentences, which the model will see and
score as independent. So, this supervision may be forcing the model to boost the
relevance score of sentences that in isolation carry week matching signals, end-
ing up hindering the overall matching signal extraction. Another problem lies on
the naive implementation of the snippet retrieval algorithm. Another idea is to
directly use the snippet gold-standard to produce ranking scores, more similar
to the winning approach in [18].


6     Conclusion and Future Work

In this paper, we propose a two-stage retrieval pipeline to address the biomedical
retrieval problem. Our system first uses BM25 to selected a pool of potential
relevant candidates that are then reranked by a neural ranking model. Contrarily
to the NLP trend, we focused on building a lightweight interaction based model,
9
    less than 10 documents
which yields a final model with only 620 trainable parameters. The proposed
architecture can also be used to produce relevance scores for each document
passage according to the model perspective of relevance. This property enables
us to perform passage retrieval in a zero-shot learning setup.
    The proposed pipeline was evaluated on the eighth edition of BioASQ, were
it achieved competitive results for the document retrieval task, being on top and
close to the top in all batches. In the snippet retrieval task, it showed interesting
results given that they were produced by a naive algorithm in a zero-shot learning
setup.
    As future work, there are minor questions still open especially on the aggrega-
tion network configuration. Additionally, an interesting route is to compare the
current architecture with a direct, but parameter-greedy, extension that uses a
state-of-the-art transformer based model, such as BERT, which are well suited to
our objective of better evaluating the passage context. This may be achieved by
replacing the word2vec embeddings by the context-aware embeddings produced
by these models or by completely replacing the interaction network.


References

 1. Almeida, T., Matos, S.: Calling attention to passages for biomedical question an-
    swering. In: Jose, J.M., Yilmaz, E., Magalhães, J., Castells, P., Ferro, N., Silva,
    M.J., Martins, F. (eds.) Advances in Information Retrieval. pp. 69–77. Springer
    International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-45442-
    59
 2. Brokos, G.I., Liosis, P., McDonald, R., Pappas, D., Androutsopoulos, I.: AUEB at
    BioASQ 6: Document and Snippet Retrieval (sep 2018), http://arxiv.org/abs/
    1809.06366
 3. Dai, Z., Callan, J.: Deeper text understanding for ir with contextual neural lan-
    guage modeling. In: Proceedings of the 42nd International ACM SIGIR Con-
    ference on Research and Development in Information Retrieval. p. 985–988. SI-
    GIR’19, Association for Computing Machinery, New York, NY, USA (2019).
    https://doi.org/10.1145/3331184.3331303
 4. Dai, Z., Callan, J.: Context-aware document term weighting for ad-hoc
    search. In: Proceedings of The Web Conference 2020. p. 1897–1907. WWW
    ’20, Association for Computing Machinery, New York, NY, USA (2020).
    https://doi.org/10.1145/3366423.3380258
 5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi-
    rectional transformers for language understanding (2018)
 6. Guo, J., Fan, Y., Ai, Q., Croft, W.B.: A deep relevance matching
    model for ad-hoc retrieval. Proceedings of the 25th ACM International
    on Conference on Information and Knowledge Management (Oct 2016).
    https://doi.org/10.1145/2983323.2983769
 7. Huang, P.S., He, X., Gao, J., Deng, L., Acero, A., Heck, L.: Learning deep struc-
    tured semantic models for web search using clickthrough data. In: Proceedings of
    the 22nd ACM international conference on Conference on information & knowl-
    edge management - CIKM ’13. pp. 2333–2338. ACM Press, New York, New York,
    USA (2013). https://doi.org/10.1145/2505515.2505665
 8. Hui, K., Yates, A., Berberich, K., de Melo, G.: Co-pacrr: A context-
    aware neural ir model for ad-hoc retrieval. pp. 279–287 (02 2018).
    https://doi.org/10.1145/3159652.3159689
 9. Jin, Z.X., Zhang, B.W., Fang, F., Zhang, L.L., Yin, X.C.: A multi-strategy query
    processing approach for biomedical question answering: Ustb prir at bioasq 2017
    task 5b. In: BioNLP (2017)
10. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio,
    Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations,
    ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings
    (2015), http://arxiv.org/abs/1412.6980
11. Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self-normalizing neural
    networks (06 2017)
12. MacAvaney, S., Yates, A., Cohan, A., Goharian, N.: Cedr: Contextualized embed-
    dings for document ranking. In: Proceedings of the 42nd International ACM SIGIR
    Conference on Research and Development in Information Retrieval. p. 1101–1104.
    SIGIR’19, Association for Computing Machinery, New York, NY, USA (2019).
    https://doi.org/10.1145/3331184.3331317
13. Mateus, A., González, F., Montes, M.: Mindlab neural network approach at bioasq
    6b (11 2018). https://doi.org/10.18653/v1/W18-5305
14. McDonald, R., Brokos, G.I., Androutsopoulos, I.: Deep Relevance Ranking Us-
    ing Enhanced Document-Query Interactions (sep 2018), http://arxiv.org/abs/
    1809.01682
15. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representa-
    tions of words and phrases and their compositionality. In: Proceedings of the 26th
    International Conference on Neural Information Processing Systems - Volume 2.
    p. 3111–3119. NIPS’13, Curran Associates Inc., Red Hook, NY, USA (2013)
16. Misra, D.: Mish: A self regularized non-monotonic neural activation function (2019)
17. Pang, L., Lan, Y., Guo, J., Xu, J., Xu, J., Cheng, X.: Deeprank. Proceedings of
    the 2017 ACM on Conference on Information and Knowledge Management (Nov
    2017). https://doi.org/10.1145/3132847.3132914
18. Pappas, D., McDonald, R., Brokos, G.I., Androutsopoulos, I.: Aueb at bioasq 7:
    Document and snippet retrieval. In: Cellier, P., Driessens, K. (eds.) Machine Learn-
    ing and Knowledge Discovery in Databases. pp. 607–623. Springer International
    Publishing, Cham (2020)
19. Prodromos Malakasiotis, Ioannis Pavlopoulos, I.A.d., Nentidis, A.: Evaluation mea-
    sures for task b, http://participants-area.bioasq.org/Tasks/b/eval_meas_
    2020/
20. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language un-
    derstanding by generative pre-training (2018)
21. Řehůřek, R., Sojka, P.: Software Framework for Topic Modelling with Large Cor-
    pora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP
    Frameworks. pp. 45–50. ELRA, Valletta, Malta (May 2010), http://is.muni.cz/
    publication/884893/en
22. Robertson, S., Zaragoza, H.: The probabilistic relevance framework:
    Bm25 and beyond. Found. Trends Inf. Retr. 3(4), 333–389 (Apr 2009).
    https://doi.org/10.1561/1500000019
23. Shen, Y., He, X., Gao, J., Deng, L., Mesnil, G.: A Latent Semantic Model with
    Convolutional-Pooling Structure for Information Retrieval. In: Proceedings of the
    23rd ACM International Conference on Conference on Information and Knowledge
    Management - CIKM ’14. pp. 101–110. ACM Press, New York, New York, USA
    (2014). https://doi.org/10.1145/2661829.2661935
24. Smith, L.N.: Cyclical learning rates for training neural networks (2015)
25. Tsatsaronis, G., Balikas, G., Malakasiotis, P., Partalas, I., Zschunke, M., Alvers,
    M., Weißenborn, D., Krithara, A., Petridis, S., Polychronopoulos, D., Almiran-
    tis, Y., Pavlopoulos, J., Baskiotis, N., Gallinari, P., Artieres, T., Ngonga Ngomo,
    A.C., Heino, N., Gaussier, E., Barrio-Alvers, L., Paliouras, G.: An overview of the
    bioasq large-scale biomedical semantic indexing and question answering competi-
    tion. BMC Bioinformatics 16, 138 (04 2015). https://doi.org/10.1186/s12859-015-
    0564-6

</pre>