Overview of BioASQ Tasks 12b and Synergy12 in CLEF2024
                         Anastasios Nentidis1,2 , Georgios Katsimpras1 , Anastasia Krithara1 and Georgios Paliouras1
                         1
                             NCSR Demokritos, Athens, Greece
                         2
                             Aristotle University of Thessaloniki, Thessaloniki, Greece


                                         Abstract
                                         This paper presents an overview of the twelfth edition of BioASQ challenge, which is part of the Conference and
                                         Labs of the Evaluation Forum (CLEF) 2024. BioASQ serves as a key platform for advancing large-scale biomedical
                                         information retrieval and question-answering (QA) systems and includes a variety of tasks. In this paper, we
                                         present an overview of the QA tasks b and Synergy of the BioASQ 12 challenge. Notably, BioASQ 12 introduces
                                         an additional phase (Phase A+) for task b, further expanding the challenge’s scope. This year, 27 teams with more
                                         than 100 systems participated in the two tasks of the challenge, with 26 of them focusing on task 12b, and 4 on
                                         task Synergy. While the total number of participating teams varies year-to-year, the increasing rate of new team
                                         participation, as observed in previous editions, highlights the impact of BioASQ in fostering robust biomedical
                                         QA solutions.

                                         Keywords
                                         Biomedical knowledge, Semantic Indexing, Question Answering


                         1. Introduction
                         This paper gives a brief overview of the twelfth edition of the BioASQ challenge (2024), focusing on
                         shared tasks 12b and Synergy12. Furthermore, we describe the corresponding datasets used to train and
                         evaluate participating systems. Details of tasks 12b and Synergy12, which ran from March to May and
                         January to February 2024, respectively, are provided in Section 2. Section 3 provides a brief overview
                         of the participation in these two tasks. A comprehensive analysis of the methodologies employed by
                         participating systems will be included in the BioASQ workshop proceedings in [1]. We conclude the
                         paper with a brief discussion and our key findings.


                         2. Overview of the Tasks
                         The twelfth edition of the BioASQ challenge consisted of four tasks: (1) a biomedical question answering
                         task (task b), (2) a task on biomedical question answering for open developing issues (task Synergy),
                         both tasks considering documents in English, (3) a new task focused on the automatic detection and
                         normalization of mentions of four clinical entity types (task MultiCardioNER), considering cardiology
                         clinical case documents in Spanish, English, and Italian, and (4) a new task on NLP challenges on
                         biomedical nested named entity recognition (NER) systems for English and Russian languages (task
                         BIONNE) [2].
                            In this paper, we describe the current versions of the first two established tasks, referring to them as
                         Task 12b and Task Synergy12 within the context of the twelfth BioASQ edition. Detailed descriptions
                         of the MultiCardioNER and BIONNE tasks can be found in [3] and [4], respectively. Additionally, a
                         detailed introduction to the BioASQ challenge and its initial task structure is available in [5].


                         CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                         †
                           These authors contributed equally.
                         $ tasosnent@iit.demokritos.gr (A. Nentidis); gkatsibras@iit.demokritos.gr (G. Katsimpras); akrithara@iit.demokritos.gr
                         (A. Krithara); paliourg@iit.demokritos.gr (G. Paliouras)
                                      © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2.1. Biomedical semantic QA - Task 12b
Task 12b introduces a comprehensive question-answering challenge in the biomedical field. Participants
are required to create systems that address all stages of question-answering. Similar to previous editions,
the task focuses on four question types: ‘yes/no,’ ‘factoid,’ ‘list,’ and ‘summary’ questions [6].
   In the twelfth edition of the BioASQ Challenge, participating teams were provided with a new version
of the BioASQ QA training dataset, containing 5,046 questions that had been annotated with relevant
golden elements and answers from previous task versions [7]. These questions served as the basis for
developing their systems. The details of both the training and testing sets for task 12b are outlined in
Table 1. These statistics reveal that the average number of documents and snippets in training data is
significantly larger than in the test batches. This can be attributed to two main factors. First, in the early
years of BioASQ the annotation with relevant documents and snippets by the experts was exhaustive,
in an attempt to identify as many relevant items as possible in the corpus. These questions are part of
the training datasets affecting the average number of relevant items per question. Currently, only a
sufficient number of relevant answers is required when the initial version of the data is developed. Still,
when the participants submit their responses, the experts assess the submitted items and enrich the
ground-truth data with potential additional relevant items detected by the participants. The numbers
of relevant items for the test sets in Table 1 are preliminary, before the enrichment by the assessment
process which is still in progress. The final evaluation of the participants will be against these enriched
relevant items, ensuring that all the submitted items that are relevant are indeed handled as such.

Table 1
Statistics on the training and test datasets of Task 12b. The numbers for the documents and snippets refer to
averages per question.
   Batch       Size      Yes/No        List       Factoid     Summary       Documents        Snippets
   Train       5,049      1,357         967        1,515        1,210         9.06             11.91
   Test 1        85         25           21          21           18          3.20             4.36
   Test 2        85         26           18          19           22          2.72             3.69
   Test 3        85         24           19          26           16          2.45             3.36
   Test 4        85         27           22          19           17          2.18             3.44
   Total       5,389      1,459        1,047       1,600        1,283         8.65             11.4

   Unlike previous challenges, task 12b consisted of three phases. An additional phase (Phase A+) of
submitting answers (exact and/or ideal), before the golden documents and snippets become available,
i.e. answers based on documents identified by participant systems, was provided. The goal of this
additional phase is to compare the performance of the competing systems with and without golden
feedback. Task 12b was divided into four independent bi-weekly batches and the three phases for each
batch run for two consecutive days. The three phases of task 12b consist of: (phase A) the retrieval of
the required information, (phase A+) answering the question without golden feedback and (phase B)
answering the question with golden feedback, which run for two consecutive days for each batch. In
each phase, the participants receive the corresponding test set and have 24 hours to submit the answers
of their systems. In the current year, the test sets comprised 85 questions each. For each test set, the
respective questions, written in English, were released for phase A and the participants were expected
to identify and submit relevant elements from designated resources, including PubMed/MedLine articles
and snippets extracted from these articles. Then, these questions were also released in phase A+ and the
participating systems were asked to respond with exact answers, that is entity names or short phrases,
and ideal answers, that is natural language summaries of the requested information. Finally, during
phase B, manually selected relevant articles and snippets related to these questions were also made
available, and participating systems were once again asked to provide exact answers and ideal answers.
2.2. Synergy12 Task
In the BioASQ challenge, the Synergy task was introduced in its ninth edition to foster collaboration
between biomedical experts studying COVID-19 and automated question-answering systems participat-
ing in BioASQ. The goal is to create a synergy where experts assess system responses, and this feedback
is used to iteratively improve the systems.
   In the process depicted in Figure 1, competing systems provide their initial responses to open
questions related to emerging problems. These responses, along with relevant documents and snippets,
are evaluated by experts. Subsequently, the experts provide feedback to the systems and address any
new or pending questions.


        Figure 1: The iterative dialogue between the experts and the systems in the BioASQ Synergy12 task on
        question answering for open developing problems.


   This version of the Synergy task (Synergy12) involved a series of four rounds, with a two-week
interval between each round. The task focused on emerging issues, drawing from relevant documents
in the current PubMed version. As with earlier versions, the questions posed were open-ended, allowing
for dynamic responses.
   In the Synergy task, during each round, the system responses and expert feedback address the same
questions, unless those questions have already been closed by experts due to receiving a comprehensive
and definite answer. Specifically, in Synergy12, a group of six biomedical experts contributed a total
of 72 open biomedical questions. They evaluated the retrieved material (including documents and
snippets) and the responses submitted by participating systems in all four rounds. Table 2 shows the
details of the datasets used in task Synergy12.

Table 2
Statistics on the datasets of Task Synergy12. “Answer ready” stands for questions marked as having enough
relevant material to be answered after the assessment of material submitted by the systems in the respective
round.
                  Round Size Yes/No List Factoid Summary Answer ready
                     1        72      11       29      17          15             33
                     2        72      11       29      18          14             46
                     3        64      10       24      16          14             50
                     4        64      10       24      17          13             57
  Synergy12, similar to task 12b, explores four question types: yes/no, factoid, list, and summary, and
two types of answers, exact and ideal. Moreover, the evaluation of systems relies on the same measures
used in task 12b. Upon completing the Synergy12 task, relevant material was identified for answering
roughly 78% of the questions. Additionally, around 51% of the questions had at least one ideal answer
submitted by the systems, which was deemed satisfactory by the expert who posed the question.


3. Overview of participation
In this year’s BioASQ challenge, over 100 distinct systems engaged in tasks 12b and Synergy12 with
a total of 27 teams. Specifically, 26 of these teams submitted on task 12b and 4 on task Synergy12.
Furthermore, Figure 2 demonstrates the global interest in the challenge, with participating teams
representing various countries worldwide.


        Figure 2: The world-wide distribution of teams participating in the tasks 12b and Synergy12, based on
        institution affiliations. A red circle indicates a newly registered team.


   In line with previous years, task b attracted more participants than Synergy. Furthermore, Figure 3
illustrates a considerable increase in the total number of participating teams this year in comparison to
last year. Additionally, the high percentage of teams joining the BioASQ challenge for the first time
(indicated by red circles in Figure 2), indicates the enduring interest of the community in large-scale
biomedical semantic indexing and question answering. Specifically, 16 new teams participated in this
year’s BioASQ tasks b and Synergy.

3.1. Task 12b
In task 12b, a total of 26 teams participated this year, contributing 89 different systems across all three
phases A, A+, and B. Specifically, 18 teams with 64 systems competed in phase A, 8 teams with 34
systems participated in A+, and phase B saw 16 participants with 54 systems. Notably, 8 teams were
involved in all three phases, as depicted in Figure 4.

3.2. Synergy Task
In task Synergy12, 4 teams participated this year contributing a total of 16 distinct systems. Since
Synergy12 shares some common concepts with task 12b, a few teams participated in both tasks.
       Figure 3: The evolution of participation in the BioASQ task b and Synergy in the twelve years of
       BioASQ.


       Figure 4: The distribution of participant teams in the BioASQ task 12b into phases.


Specifically, 3 teams engaged in both task 12b and Synergy12, as depicted in Figure 5. However,
consistent with previous versions of the tasks, fewer teams participated in Synergy12 compared to
task 12b. This could be due to the particularities of open questions in Synergy, such as the volatility
of answers and the evolving nature of the relevant knowledge which pose greater challenges than
traditional question answering.


4. Conclusions
In this paper, we introduced the twelfth version of the BioASQ tasks b and Synergy. Both tasks are
already established through the previous versions of the challenge. The participation of teams was
comparable to last year’s version of these tasks with a slight decrease. On the other hand, we noticed
        Figure 5: The overlap of participant teams in the BioASQ task 12b and Synergy12.


a high number of newly registered teams. Therefore, we believe that the challenge and the datasets
developed for its tasks increase the research community’s interest in question answering
  In this paper, we introduced the twelfth version of the BioASQ challenge, focusing on tasks b and
Synergy. These tasks have been well-established through previous versions of the challenge. Notably,
team participation has grown and we observed a significant increase in newly registered teams. As a
result, we consider that the challenge, along with the associated datasets, has sparked greater interest
within the research community and continues to advance the field of biomedical semantic indexing and
question answering.


Acknowledgments
Google was a proud sponsor of the BioASQ Challenge in 2023. The twelfth edition of BioASQ is also
sponsored by Ovid Technologies, Inc., Elsevier, and Atypon Systems inc. The MEDLINE/PubMed data
resources considered in this work were accessed courtesy of the U.S. National Library of Medicine.


References
[1] A. Nentidis, G. Katsimpras, A. Krithara, S. Lima-López, E. Farré-Maduell, M. Krallinger,
    N. Loukachevitch, V. Davydova, E. Tutubalina, G. Paliouras, Overview of BioASQ 2024: The
    twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering,
    in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. Maria Di Nunzio, P. Galuščáková,
    A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality,
    Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF
    Association (CLEF 2024), 2024.
[2] A. Nentidis, A. Krithara, G. Paliouras, M. Krallinger, L. G. Sanchez, S. Lima, E. Farre,
    N. Loukachevitch, V. Davydova, E. Tutubalina, BioASQ at CLEF2024: The Twelfth Edition of
    the Large-Scale Biomedical Semantic Indexing and Question Answering Challenge, in: European
    Conference on Information Retrieval, Springer, 2024, pp. 490–497.
[3] S. Lima-López, E. Farré-Maduell, J. Rodríguez-Miret, M. Rodríguez-Ortega, L. Lilli, J. Lenkowicz,
    G. Ceroni, J. Kossoff, A. Shah, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras, M. Krallinger,
    Overview of MultiCardioNER task at BioASQ 2024 on Medical Speciality and Language Adaptation
    of Clinical NER Systems for Spanish, English and Italian, in: G. Faggioli, N. Ferro, P. Galuščáková,
    A. García Seco de Herrera (Eds.), CLEF Working Notes, 2024.
[4] V. Davydova, N. Loukachevitch, E. Tutubalina, Overview of BioNNE Task on Biomedical Nested
    Named Entity Recognition at BioASQ 2024, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García
    Seco de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum,
    2024.
[5] G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Partalas, M. Zschunke, M. R. Alvers, D. Weissenborn,
    A. Krithara, S. Petridis, D. Polychronopoulos, Y. Almirantis, J. Pavlopoulos, N. Baskiotis, P. Gallinari,
    T. Artieres, A. Ngonga, N. Heino, E. Gaussier, L. Barrio-Alvers, M. Schroeder, I. Androutsopoulos,
    G. Paliouras, An overview of the BIOASQ large-scale biomedical semantic indexing and question
    answering competition, BMC Bioinformatics 16 (2015) 138. doi:10.1186/s12859-015-0564-6.
[6] G. Balikas, I. Partalas, A. Kosmopoulos, S. Petridis, P. Malakasiotis, I. Pavlopoulos, I. Androutsopou-
    los, N. Baskiotis, E. Gaussier, T. Artieres, P. Gallinari, Evaluation Framework Specifications, Project
    deliverable D4.1, UPMC, 2013.
[7] A. Krithara, A. Nentidis, K. Bougiatiotis, G. Paliouras, BioASQ-QA: A manually curated corpus for
    Biomedical Question Answering, Scientific Data 10 (2023) 170.