=Paper=
{{Paper
|id=Vol-2936/paper-12
|storemode=property
|title=BioASQ Synergy: A strong and simple baseline rooted in relevance feedback
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-12.pdf
|volume=Vol-2936
|authors=Tiago Almeida,Sérgio Matos
|dblpUrl=https://dblp.org/rec/conf/clef/AlmeidaM21
}}
==BioASQ Synergy: A strong and simple baseline rooted in relevance feedback==
<pdf width="1500px">https://ceur-ws.org/Vol-2936/paper-12.pdf</pdf>
<pre>
BioASQ Synergy: A strong and simple baseline
rooted in relevance feedback
Tiago Almeida1 , Sérgio Matos1
1
    University of Aveiro, IEETA


                                         Abstract
                                         This paper presents the participation of the University of Aveiro Biomedical Informatics and Techologies
                                         (BIT) group in the Synergy task of the ninth edition of the BioASQ challenge. Given availability of
                                         feedback data between rounds, we explored a traditional relevance feedback approach. More precisely,
                                         we performed query expansion by selecting the highest tf-idf terms from snippets judged as relevant by
                                         experts. Then, the revised query is processed by our BioASQ-8b pipeline consisting of BM25 followed by
                                         a lightweight neural reranking model. Our system achieved results above the median, which given its
                                         simplicity can be considered satisfactory. Furthermore, in two batches our best results were only second
                                         to the runs submitted by the top performing team. Code to reproduce our submissions are available on
                                         https://github.com/bioinformatics-ua/BioASQ9-Synergy.

                                         Keywords
                                         Relevance Feedback, BM25, Neural ranking, Covid-19, Document Retrieval, BioASQ Synergy


1. Introduction
In January 2020 the World Health Organization (W.H.O.) declared the 2019 corona virus as a
global health emergency. More than one year later, and even with the existence of vaccines,
the virus still affects the majority of the world population. Furthermore, studies are still being
conducted and new material about the virus is published every day. This causes a wave of
knowledge, firstly available through scientific articles, which without effective searching tools
ends up deprecating precious research time. So, it becomes imperative to improve the access to
this type of unstructured information in order to foster further research about the novel corona
virus.
   TREC was the first institution to launch a global challenge, TREC-Covid [1], to push the
research on search tools for dealing with the exponential growth of the literature about the
novel corona virus. The BioASQ organization followed the same path and launched the Synergy
task, where the aim is to retrieve the most relevant answers to biomedical questions about this
corona virus.
   This paper describes the participation of the Biomedical Informatics and Techologies (BIT)
group of the Aveiro University in the Synergy challenge, which consisted in retrieving, from


CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" tiagomeloalmeida (T. Almeida); aleixomatos@ua.pt (S. Matos)
~ https://t-almeida.github.io/online-cv/ (T. Almeida)
 0000-0002-4258-3350 (T. Almeida); 0000-0003-1941-3983 (S. Matos)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
the CORD-19 [2] collection, documents and snippets that are relevant for a given biomedical
question related to the novel corona virus.
   Our approach builds on the lessons learned from our participation in the TREC-Covid chal-
lenge [3]. In TREC-Covid, due to the nature of the residual evaluation, we observed that
relevance feedback approaches drastically benefit from this setup. So, we decided to constructed
a strong baseline based on relevance feedback techniques and then tried to rerank this to achieve
further improvements.
   We achieved satisfying results with our simple approach, losing only to the first place team.
In the remaining of the paper we describe in more detail our relevance feedback approach. We
then describe the submissions and the results obtained, followed by a general discussion.


2. Relevance Feedback
In this section, to make the paper self-contained, we first made a briefly introduction to the
topic of relevance feedback, a well known technique studied in the field of information retrieval
in which the main idea is to directly include the user feedback into the retrieval process. In
other words, the user will refine the quality of the results by selecting positives example from
the initial retrieval order. With more detail, the basic procedure of relevance feedback can be
summarize by the following steps:

   1. A simple query, encoding the information need, is processed by the system.
   2. The results are returned to the user.
   3. From that initial list, the user selects some positive and negatives examples.
   4. The system creates a new representation of the information need by using the query, the
      positive examples and the negatives examples.
   5. The final retrieved documents are returned to the user.

   The main concern when implementing a relevance feedback algorithm regards creating a new
representation of the information need from the original query, positives and negatives examples.
Following the literature, the most well known method is the Rocchio [4] algorithm. This
algorithm operates in the vector space model, where document and queries are represented as
vectors. The main idea is to produce a new query vector by combining the original query vector,
plus a weighted representation of the positive documents minus a weighted representation of
the negative documents. Then the retrieval is done by projecting the new query to the vector
space and retrieving by cosine similarity the closest documents. The intuition is to modify the
original query in order to move it closer to the positive examples and farther away from the
negative examples.


3. Methodology
In this section we describe our main solution, consisting of a combination of BM25 with relevance
feedback, and explain the intuition behind this approach. To better understand our rational, we
first analyze the format of the Synergy task.
   The Synergy task appeared as an effort to help finding answers to biomedical questions
about the 2019 novel coronavirus. Unlike the usual BioASQ format, the Synergy task presented
a fundamental change concerning its evaluation and flow. More precisely, the Synergy task
followed a residual type of evaluation, similar to TREC-Covid, where the test set is reused
through all the batches. Additionally, in between batches the golden feedback data, i.e., the
relevance information for each question, was made available to the participants. This ends up
changing the usual retrieval paradigm, in which one is expected to apply a retrieval system
on an unknown question. So, according to the literature, the Synergy tasks becomes suitable
to relevance feedback techniques, since some relevant examples were available for a majority
of the questions, which satisfies the points 2 and 3 of the relevance feedback procedure. This
observation can be confirmed by the TREC-Covid challenge results, where relevance feedback
runs were able to achieve top scoring positions, outscoring traditional, neural and transformer
based retrieval approaches.
   Based on these observations and also inspired by our previous submission to the TREC-
Covid challenge [3], we adopted the traditional BM25 ranking function combined with a simple
relevance feedback method for constructing a strong baseline for this challenge. Then, we also
tried to employ our existing BioASQ 8b neural ranking model to further rerank our baseline.

3.1. Baseline - BM25 with Relevance Feedback
As previously mentioned, we adopted the BM25 ranking function as our retrieval function,
since it is known to produce close to state of the art results when well fine-tuned. In order
to include relevance feedback in the BM25 algorithm, we follow a similar intuition from the
Rocchio algorithm of adding a representation of the positive documents to the query. However,
since the BM25 is a probabilistic model and not a vector model, we employed a query expansion
technique based on the most representatives terms of each positive document. This new query
is then processed by the BM25, hopefully returning a new list of documents that are more
similar to the positive documents.
   The representative terms of each document were selected as the top-𝑘 terms with higher
tf-idf score. The intuition behind this assumption is that the terms with higher tf-idf score will
largely contribute to the final ranking score. Thus by including them in the new query we are
boosting the documents that are most similar, in terms of tf-idf terms, to our positives examples.


Figure 1: Summary of the procedure to combine relevance feedback with the BM25 ranking function.
   After selecting the 𝑘 most important terms from the collection of positive examples, these are
added to the original query in an disjunctive form. Then the normal BM25 ranking function is
applied over this new generated query. The overall procedure of relevance feedback is illustrated
in Figure 1.

3.1.1. Impact of the Source of Relevance
Another important detail is the source of positive examples that we feed as feedback data. For
that there are two alternatives, the text from the list of positive documents (1) and the text from
the list of positive snippets (2). In order to chose the best candidate we performed an empirical
evaluation using the first round feedback data. More precisely, we performed a 60% random split
of the questions, resulting in 60 queries for validation and 41 queries for testing. The validation
set was used to finetune the relevance feedback and BM25 parameters, and the final results are
reported over the test set. The parameters that were finetuned were the 𝑘1 and 𝑏 parameters for
BM25, the number of terms to add to the query, 𝑘, the maximum number of positive samples
per question, 𝑆𝑚𝑎𝑥 , and minimum frequency for the query expansion, 𝐹𝑚𝑖𝑛 . Additionally, we
also finetuned a boost parameter that multiplies the contribution of the original query terms
with respect to the added terms. Table 1 shows the range and best value for the parameters.
For both experiments we first used random search over a large space of parameters and then
proceeded with grid search that best fit each experiment.

Table 1
Set of parameters that were finetuned. In bold we report the best values found for the second round of
the BioASQ-Synergy. The notation { X to Y, Z} means that we search between X and Y in Z increments.
    Type of search                      BM25                            RF: Query expansion                Boost
                                 𝑘1                𝑏                 𝑘            𝑆𝑚𝑎𝑥         𝐹𝑚𝑖𝑛
 Random Search (both)    {0.1 to 1.2, 0.1} {0.1 to 1, 0.1}     {15 to 80, 5}   {5 to 50, 5} {5 to 50, 5}   [1,2,4]
  Gird Search - Docs        [.4,.6,.8,1]      [.4,.6,.8]     [5,10,15,20,30] [15,30,40,50] [30,40,50]       [2,4]
 Gird Search - Snippet   [.6,.8,1,1.2,1.4 ]   [.4,.6,.8]        [70,75,80]     [30,40,50]       [1]         [2,4]

   In Table 2 we show the performance of the best and worst set of parameters when using
documents and snippets as a source of positive examples. From the experiment, it is clear
that the list of snippets are far better candidates than the list of document to extract the most
representatives terms to expand the query. We believe that this discrepancy is related to the
scope hypothesis [5], that says that a document can address several topics. This will result in
the extraction of terms unrelated with the question topic, hence causing query drift.
   Furthermore, the snippet hyper-parameter search is also more reliable, with a much smaller
difference between the best and worst parameters.

3.2. Neural Rerank
Since it is expected that neural reranking models will bring some improvements over traditional
baselines, we also included some runs where we tried to rerank our baseline produced by the
previous approach.
  For this neural reranking, we relied on our neural architecture that was used in the BioASQ
8b challenge [6]. Following the lessons learned from TREC-Covid [3], we found that reranking
Table 2
Comparison between the source of positives examples.
              Positive Examples       Validation Set              Test Set
                                    MAP@10 Recall@10       MAP@10 Recall@10
                  Docs𝑏𝑒𝑠𝑡 (1)       19.00      25.66       18.68        24.05
                 Docs𝑤𝑜𝑟𝑠𝑡 (1)       5.37        9.12        4.67          7.32
                 Snippet𝑏𝑒𝑠𝑡 (2)     46.77      46.71       46.30        47.34
                Snippet𝑤𝑜𝑟𝑠𝑡 (2)     41.69      44.47       44.73        46.10


over relevance feedback runs is more effective when the number of candidate documents is
small.


4. Submission
In this section, we start by describing the data collection and some pre-processing steps. Then
we detail each run that was submitted on each batch. Note that all the runs submitted and the
results presented are with respect to the document retrieval task.

4.1. Collection and Pre-processing
The Synergy task used the CORD-19 [2] collection, which is a open collection of scientific
articles about the 2019 novel coronavirus. Currently, it is updated on a weekly basis and has
more than 550 thousand articles gathered from peer-reviewed publications and open archives
such as bioRxiv and medRxiv. For the task, only documents that had pmid, abstract and title
were considering, meaning that roughly 60% of the articles were discarded.
  At each round we indexed the valid set of articles with Elasticsearch using the english
text analyzer, which automatically performs tokenization, stemming and stopword filtering.
Additionally, we also included an analyzer to perform expansion of Covid-19 related terms by
using a synonym expansion list.
  Regarding the neural ranking model, we kept the same model architecture described in
the 2020 BioASQ 8b challenge [6]. Additionally, we trained 200-dimensinal word embeddings
using the GenSim [7] implementation of word2vec [8], with the combination of PUBMED plus
CORD-19.

4.2. Runs
The first version of the Synergy task had four rounds, with no feedback data available for the
first round. Therefore, we could not apply our relevance feedback baseline for the first round,
and used instead a BM25 baseline with neural reranking.
   Table 3 presents the summary of all the submissions, where RF stands for relevance feedback
and NN for reranking with a neural network that was trained on the feedback data, with NN
(TREC-Covid) meaning that the neural network was trained with the trec-covid data and NN
(BioASQ) meaning it was trained with the bioASQ data. Furthermore, BM25 was fine-tuned for
Table 3
Summary of the submitted runs for each round of the 2020 BioASQ Task Synergy. RF: relevance feed-
back; NN: neural network reranking.
         Run name                                Description
                              Round 1                 Round 2         Round 3 and 4
           bioinfo-0           BM25                  BM25 + RF          BM25 + RF
           bioinfo-1     BM25 + Synonyms             BM25 + RF        BM25 + RF + NN
           bioinfo-2   BM25 + NN (TREC-Covid)      BM25 + RF + NN     BM25 + RF + NN
           bioinfo-3   BM25 + NN (TREC-Covid)      BM25 + RF + NN     BM25 + RF + NN
           bioinfo-4    BM25 + NN (BioASQ)         BM25 + RF + NN     BM25 + RF + NN


each round and we set the parameter 𝑘 to 75, which means that a maximum of 75 new terms
were added to the revised query.


5. Results
The overall results are shown in Table 4, together with the median of all submissions and the
result of the top performing system in each batch. The results are organized according the Mean
Average Precision at ten (MAP@10), which was the measure adopted by organizers to rank all
the received submissions. There were a total of 20, 21, 23 and 24 submissions respectively for
each round.

Table 4
Summary of the results obtained
          Run name       Round 1          Round 2         Round 3         Round 4
                       Rank MAP         Rank MAP        Rank MAP        Rank MAP
          bioinfo-0     13   22.28        7   31.93      15   18.08       6   23.13
          bioinfo-1     14   22.08        6   32.59       9   21.26      10   21.44
          bioinfo-2     16   18.60       13   27.58      16   18.05      15   20.09
          bioinfo-3     18   15.37       15   26.48      13   19.84      11   21.44
          bioinfo-4     12   22.52       14   26.58      10   20.80      12   20.91
           Median            27.35            28.45           21.26           23.13
          Top result         33.75            40.69           32.57           29.83

   When looking at the results presented in Table 4, it is important to notice that the main
method presented in this paper was only used in rounds 2, 3 and 4. Nonetheless, from the first
round results it is possible to observe that the runs that used the TREC-Covid data resulted in
the worst performance, below the normal baseline and the run trained with BioASQ Task b data.
This is an interesting behavior, since the model that was trained with domain data (TREC-Covid)
had worst performance against the model that was trained in a more generic domain (BioASQ).
We theorize that this may be related to the differences in the query structure from TREC-Covid,
also known as topics, and the more human like questions used in the Synergy task. Another
aspect is related to the differences in terms of feedback data. More precisely, TREC-Covid has a
very low number of questions but higher number of feedback documents per question, while
BioASQ has a compatible larger number of queries and lower number of feedback documents
per question.
   Regarding rounds 2, 3 and 4 we achieved competitive performance taking into consideration
the simple approach. Notably our best scores correspond to submissions that just used BM25
with relevance feedback, in round 2 and 4, which means that the neural reranking in those
rounds lowered the overall performance. However, in round 3 our best performance was
achieved with a reranking strategy, making it inconclusive if our reranking technique over the
relevance feedback baseline is beneficial or detrimental. In terms of team ranking positions, our
technique achieved two second places in rounds 2 and 4, scoring below the strong submissions
of the first place team, as well as a third place in round 3.


Figure 2: MAP@10 difference between our best run at each round against the median score at that
round.

   To get a better context of the overall performance in relation to all the submissions we show
in Figure 2 the difference in terms of MAP@10 of our best submissions against the median score
presented in the leaderboards. Notably, the relevance feedback solution performed as expected
and gave us a simple solution that managed to consistently achieve above average results.


6. Conclusion
In this paper we present a simple but strong baseline rooted in a relevance feedback technique.
More precisely, we combined the traditional BM25 ranking function with a tf-idf based query
expansion, that will add the relevance feedback to the ranking function.
   From the results obtained our relevance feedback manage to perform well above average,
supporting our initial idea that relevance feedback runs prevail in residual type of evaluations.


Acknowledgments
This work has received support from the EU/EFPIA Innovative Medicines Initiative 2 Joint
Undertaking under grant agreement No 806968 and from National Funds through the FCT -
Foundation for Science and Technology, in the context of the grant 2020.05784.BD.
References
[1] E. M. Voorhees, T. Alam, S. Bedrick, D. Demner-Fushman, W. R. Hersh, K. Lo, K. Roberts,
    I. Soboroff, L. L. Wang, TREC-COVID: Constructing a pandemic information retrieval test
    collection, ArXiv abs/2005.04474 (2020).
[2] L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Burdick, D. Eide, K. Funk, Y. Katsis,
    R. M. Kinney, Y. Li, Z. Liu, W. Merrill, P. Mooney, D. A. Murdick, D. Rishi, J. Sheehan, Z. Shen,
    B. Stilson, A. D. Wade, K. Wang, N. X. R. Wang, C. Wilhelm, B. Xie, D. M. Raymond, D. S. Weld,
    O. Etzioni, S. Kohlmeier, CORD-19: The COVID-19 open research dataset, in: Proceedings
    of the 1st Workshop on NLP for COVID-19 at ACL 2020, Association for Computational
    Linguistics, Online, 2020. URL: https://www.aclweb.org/anthology/2020.nlpcovid19-acl.1.
[3] T. Almeida, S. Matos, Frugal neural reranking: evaluation on the covid-19 literature, in:
    Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020, Association
    for Computational Linguistics, Online, 2020. URL: https://www.aclweb.org/anthology/2020.
    nlpcovid19-2.3. doi:10.18653/v1/2020.nlpcovid19-2.3.
[4] J. J. Rocchio, Relevance Feedback in Information Retrieval, Prentice Hall, Englewood, Cliffs,
    New Jersey, 1971. URL: http://www.is.informatik.uni-duisburg.de/bib/docs/Rocchio_71.
    html.
[5] S. E. Robertson, S. Walker, Some simple effective approximations to the 2-poisson model
    for probabilistic weighted retrieval, in: B. W. Croft, C. J. van Rijsbergen (Eds.), SIGIR ’94,
    Springer London, London, 1994, pp. 232–241.
[6] T. Almeida, S. Matos, BIT.UA at BioASQ 8: Lightweight neural document ranking with
    zero-shot snippet retrieval, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working
    Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece,
    September 22-25, 2020, volume 2696 of CEUR Workshop Proceedings, CEUR-WS.org, 2020.
    URL: http://ceur-ws.org/Vol-2696/paper_161.pdf.
[7] R. Řehůřek, P. Sojka, Software Framework for Topic Modelling with Large Corpora, in:
    Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, ELRA,
    Valletta, Malta, 2010, pp. 45–50. http://is.muni.cz/publication/884893/en.
[8] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed representations of words
    and phrases and their compositionality, in: Proceedings of the 26th International Conference
    on Neural Information Processing Systems - Volume 2, NIPS’13, Curran Associates Inc.,
    Red Hook, NY, USA, 2013, p. 3111–3119.

</pre>