Overview of BioASQ Tasks 9a, 9b and Synergy in
CLEF2021
Anastasios Nentidis1,2 , Georgios Katsimpras1 , Eirini Vandorou1 , Anastasia Krithara1
and Georgios Paliouras1
1
    NCSR Demokritos, Athens, Greece
2
    Aristotle University of Thessaloniki, Thessaloniki, Greece


                                         Abstract
                                         In this paper, we present an overview of the tasks a and b of the ninth edition of BioASQ challenge,
                                         together with a newly introduced task on question answering for developing problems called Synergy.
                                         All these tasks ran as part of the BioASQ challenge lab in the Conference and Labs of the Evaluation
                                         Forum (CLEF) 2021. The main focus of BioASQ is to promote methodologies and systems for large-scale
                                         biomedical semantic indexing and question answering. This is achieved through the organization of
                                         yearly challenges which enable the participation of teams around the world in developing and compar-
                                         ing their methods on the same benchmark datasets. This year, 42 teams with more than 170 systems
                                         participated in the four tasks of the challenge, with six of them focusing on task 9a, 24 on task 9b and
                                         15 on task Synergy. Correspondingly to the previous years, the participation has increased, indicating
                                         the established presence of BioASQ challenge in the field.

                                         Keywords
                                         Biomedical knowledge, Semantic Indexing, Question Answering


1. Introduction
In this paper we give an overview of the shared tasks 9a and 9b of the ninth edition of the
BioASQ challenge in 2021, as well as of the new task of BioASQ challenge called Synergy. In
addition, we present in detail the datasets that were used in each task. In section 2, we provide
an overview the shared tasks 9a and 9b, that took place from February to May 2021, the newly
introduced task Synergy, which took place from December 2020 to February 2021 and from
May to June 2021, as well as the corresponding datasets developed for training and testing
the participating systems. In section 3, we summarize the participation in these three tasks.
Detailed descriptions for some of the systems will be available in the proceedings of the BioASQ
lab. Our conclusions are presented in section 4, along with a brief discussion about the ninth
version of the BioASQ tasks a, b and Synergy.


CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" tasosnent@iit.demokritos.gr (A. Nentidis); gkatsibras@iit.demokritos.gr (G. Katsimpras);
evandorou@iit.demokritos.gr (E. Vandorou); akrithara@iit.demokritos.gr (A. Krithara);
paliourg@iit.demokritos.gr (G. Paliouras)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
2. Overview of the Tasks
Taken together, in the ninth version of the BioASQ challenge were offered four tasks: (1) a
large-scale biomedical semantic indexing task (task 9a), (2) a biomedical question answering task
(task 9b), both considering documents in English, (3) a medical semantic indexing in Spanish
(task MESINESP9), and (4) a new task on biomedical question answering (task Synergy). In this
section, apart from providing a brief description of the two established tasks (9a and 9b) with
focus on differences from previous versions of the challenge [1], we also concisely outline the
Synergy task. For tasks 9a and 9b, a detailed overview of the initial tasks can be found in [2],
which also describes the general structure of BioASQ.

2.1. Large-scale semantic indexing - Task 9a

Table 1
Statistics on test datasets for Task 9a.
     Batch           Articles            Annotated Articles            Labels per Article
                        7967                    7808                         12.61
                       10053                    9987                         12.40
         1              4870                    4854                         12.16
                        5758                    5735                         12.34
                        5770                    5666                         12.49
      Total            34418                   34050                         12.42
                        6376                    6374                         12.39
                        9101                    6403                         11.76
         2              7013                    6590                         12.15
                        6070                    5914                         12.62
                        6151                    5904                         12.63
      Total            34711                   31185                         12.30
                        5890                    5730                         12.81
                       10818                    9910                         13.03
         3              4022                    3493                         12.21
                        5373                    4005                         12.62
                        5325                    2351                         12.97
      Total            31428                   25489                         12.71

   Task 9a focuses on classifying articles from the PubMed/MedLine1 digital library into concepts
of the MeSH hierarchy. Specifically, the test sets for the evaluation of the competing systems
consist of new PubMed articles that are not yet annotated by the indexers in NLM. A more
detailed view of each test set can be seen in Table 1. Similarly to the previous years, the
task is divided into three independent batches of 5 weekly test sets each. Two scenarios are
provided: i) on-line and ii) large-scale. The test sets is a collection of new articles without any
restriction on the journal published. To evaluate the participating systems we use standard flat
and hierarchical information retrieval measures as in previous versions of the task [3]. In the
case the annotations from the NLM indexers are available, hierarchical measures are used as
   1
       https://pubmed.ncbi.nlm.nih.gov/
well. As before, for each test set, participants are required to submit their answers in 21 hours.
Also, there was a training dataset available for Task 9a that is composed of 15,559,157 articles
with 12.68 labels per article, on average, and covering 29,369 distinct MeSH labels in total.

2.2. Biomedical semantic QA - Task 9b
The aim of Task 9b is to enable the competing teams to develop systems for all the stages of
question answering in the biomedical domain by introducing a large-scale question answering
challenge. Akin to the previous versions of the task, four types of questions are considered:
“yes/no”, “factoid”, “list” and “summary” questions [3]. For this task, the available training dataset
contains 3,742 questions which are annotated with golden relevant elements and answers from
previous versions of the task. The dataset is used by the participating teams to develop their
systems. The details of both training and testing sets are depicted in Table 2.

Table 2
Statistics on the training and test datasets of Task 9b. The numbers for the documents and snippets
refer to averages per question.
  Batch      Size      Yes/No       List      Factoid    Summary       Documents       Snippets
  Train      3,743      1033        719        1092        899           9.43            12.32
  Test 1      100        27          21         29          23           3.40            4.66
  Test 2      100        22          20         34          24           3.43            4.88
  Test 3      100        26          19         37          18           3.21            4.29
  Test 4      100        25          19         28          28           3.10            4.01
  Test 5      100        19          18         36          27           3.59            4.69
  Total      4,243      1152        816        1256        1019          8.71            11.40

   Task 9b is divided into two phases: (phase A) the retrieval of the required information
and (phase B) answering the question. Moreover, it is split into five independent bi-weekly
batches and the two phases for each batch run during two consecutive days. In each phase,
the participants receive the corresponding test set and have 24 hours to submit the answers
of their systems. More precisely, in phase A, a test set of 100 questions written in English
is released and the participants are expected to identify and submit relevant elements from
designated resources, including PubMed/MedLine articles, snippets extracted from these articles,
concepts and RDF triples. In phase B, the manually selected relevant articles and snippets for
these 100 questions are also released and the participating systems are asked to respond with
exact answers, that is entity names or short phrases, and ideal answers, that is natural language
summaries of the requested information.

2.3. Synergy Task
The current BioASQ task B is structured in a sequence of phases. First comes the annotation
phase; then with a partial overlap runs the challenge; and only when this is finished does
the assessment phase start. This leads to minimal interaction between the experts and the
participating systems, which is acceptable due to the nature of the questions that are generated.
Namely, we are looking for interesting research questions that have a clear, undisputed answer.
   This model is less suitable to developing biomedical research topics, such as the case of
COVID-19, where new issues appear every day and most of them remain open for some time. A
more interactive approach is needed for such cases, aiming at a synergy between the biomedical
experts and the automated question answering systems. We envision such an approach as a
continuous dialog, where experts issue open questions to the systems and the systems respond
to the questions. Then, the experts assess the responses, and their assessment is fed back to the
systems, in order to help improving them. Then, the process continues iteratively with new
feedback and new system predictions.


Figure 1: The iterative dialogue between the experts and the systems in the BioASQ Synergy task on
question answering for COVID-19.


   Fig. 1 sketches this vision, which motivates the new BioASQ Synergy task. This new task
allows biomedical experts to pose unanswered questions for developing problems, such as
COVID-19. Participating systems attempt to provide answers, together with supporting material
(relevant documents and snippets), which in turn are assessed by the experts and fed back
to the systems, together with new questions. At the same time, we are adapting the BioASQ
infrastructure and expand the community to address new developing public health issues in the
future. In this introductory year, Task Synergy took place in two versions. Each version was
structured into four rounds, of systems responses and expert feedback for the same questions.
However, some new questions or new modified versions of some questions could be added to
the test sets. The details of the datasets used in task Synergy are available in Table 3.
   Contrary to the task B, this task was not structured into phases, but both relevant material and
answers were received together. However, for new questions only relevant material (documents
and snippets) is required until the expert considers that enough material has been gathered
Table 3
Statistics on the datasets of Task Synergy. “Answer” stands for questions marked as having enough
relevant material from previous rounds to be answered".
  Version    Round     Size   Yes/No    List   Factoid   Summary      Answered      Feedback
     1         1       108      33       22      17         36           0              0
     1         2       113      34       25      18         36           53            101
     1         3       113      34       25      18         36           80            97
     1         4       113      34       25      18         36           86            103
     2         1        95      31       22      18         24           6             95
     2         2        90      27       22      18         23           10            90
     2         3        66      17       14      18         17           25            66
     2         4        63      15       14      17         17           33            63


during the previous round and mark the questions as “ready to answer". When a question
receives a satisfactory answer that is not expected to change, the expert can mark the question
as “closed", indicating that no more material and answers are needed for it.
   In each round of this task, we consider material from the current version of the COVID-19
Open Research Dataset (CORD-19) [4] to reflect the rapid developments in the field. As in task
B, four types of questions are supported, namely yes/no, factoid, list, and summary, and two
types of answers, exact and ideal. The evaluation of the systems is based on the measures used
in Task 9b. Nevertheless, for the information retrieval part we focus on new material. Therefore,
material already assessed in previous rounds, available in the expert feedback, should not be
re-submitted. Overall, through this process, we aim to facilitate the incremental understanding
of COVID-19 and contribute to the discovery of new solutions.


3. Overview of participation
Overall, 37 teams from institutes around the world participated in the tasks 9a, 9b and Synergy
of the challenge with more than 120 distinct systems. Particularly, six of these teams submitted
on task 9a, 24 on task 9b and 15 on task Synergy. Furthermore, we can see from Fig. 2, that
the participating teams in tasks 9a, 9b and Synergy are originating from various countries
around the world, indicating the international interest in the challenge. We observe that a
shift towards the most complex question answering task b, already observed in previous years
of the challenge, is still apparent this year, as the number of participating teams is slightly
increased as shown in Fig. 3. Detailed descriptions for some of the systems will be available at
the proceedings of the workshop.

3.1. Task 9a
In this year’s Task 9a, 6 teams competed with a total of 21 different systems. Teams that have
already participated in previous versions of the task include the National Library of Medicine
(NLM) team that submitted predictions with five different systems, the Fudan University &
Atypon team that participated with 4 systems, and the team from the University of Vigo and
the University of A Coruña that participated with two systems. On the other hand, two new
Figure 2: The world-wide distribution of teams participating in the tasks 9a, 9b and Synergy (S), based
on institution affiliations.


Figure 3: The evolution of participant teams in the BioASQ task a and b in the nine years of BioASQ.


teams from Roche and Atypon competed for the first time, submitting results with five and
three systems respectively.
3.2. Task 9b
This year, 90 different systems have submitted predictions for Task 9b in total, for both phases A
and B. These systems were developed by 24 teams. In phase A, 9 teams participated, submitting
results from 34 systems. In phase B, the numbers of participants and systems were 20 and 70
respectively. There were only three teams that engaged in both phases.


Figure 4: The distribution of participant teams in the BioASQ task 9b into phases.


3.3. Synergy Task
In the first two versions of the new task Synergy, introduced this year, 15 teams participated
submitting the results from 39 distinct systems. Although significantly different from task b,
this task is still about biomedical information retrieval and question answering, therefore the
systems for both tasks are expected to share some common ideas and techniques. Therefore,
some teams participated in both tasks. In particular, 8 teams participated both task 9b and
Synergy as shown in Fig. 5.


Figure 5: The overlap of participant teams in the BioASQ task 9b and Synergy.
4. Conclusions
This paper provides an overview of the ninth version of the BioASQ tasks a and b, along with the
newly introduced task Synergy. Tasks 9a and 9b, are already established through the previous
eight years of the challenge, and together with the MESINESP9 task on semantic indexing of
medical content in Spanish, and the Synergy task, which ran for the first time, consisted the
ninth edition of the BioASQ challenge.
   Overall, the BioASQ challenge has been matured and established its presence through these
years. Besides continuing the annual tasks a and b, this year we offered a new biomedical
question answering task, Synergy. Similar to previous years, the participation of teams increased
and therefore, we consider that the challenge keeps meeting its goal to push the research frontier
in biomedical semantic indexing and question answering.


Acknowledgments
Google was a proud sponsor of the BioASQ Challenge in 2020. The ninth edition of BioASQ
is also sponsored by the Atypon Systems inc. BioASQ is grateful to NLM for providing the
baselines for task 9a and to the CMU team for providing the baselines for task 9b. The MESINESP
task is sponsored by the Spanish Plan for advancement of Language Technologies (Plan TL)
and the Secretaría de Estado para el Avance Digital (SEAD). BioASQ is also grateful to LILACS,
SCIELO and Biblioteca virtual en salud and Instituto de salud Carlos III for providing data for
the BioASQ MESINESP task.


References
[1] A. Nentidis, A. Krithara, K. Bougiatiotis, G. Paliouras, Overview of bioasq 8a and 8b: Results
    of the eighth edition of the bioasq tasks a and b (2020).
[2] G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Partalas, M. Zschunke, M. R. Alvers, D. Weis-
    senborn, A. Krithara, S. Petridis, D. Polychronopoulos, Y. Almirantis, J. Pavlopoulos, N. Baski-
    otis, P. Gallinari, T. Artieres, A. Ngonga, N. Heino, E. Gaussier, L. Barrio-Alvers, M. Schroeder,
    I. Androutsopoulos, G. Paliouras, An overview of the bioasq large-scale biomedical se-
    mantic indexing and question answering competition, BMC Bioinformatics 16 (2015) 138.
    doi:10.1186/s12859-015-0564-6.
[3] G. Balikas, I. Partalas, A. Kosmopoulos, S. Petridis, P. Malakasiotis, I. Pavlopoulos, I. An-
    droutsopoulos, N. Baskiotis, E. Gaussier, T. Artieres, P. Gallinari, Evaluation Framework
    Specifications, Project deliverable D4.1, UPMC, 2013.
[4] L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Eide, K. Funk, R. Kinney, Z. Liu,
    W. Merrill, et al., Cord-19: The covid-19 open research dataset, ArXiv (2020).