1. Introduction

10.18653/v1/N19-1423

Can Language Models Critique Themselves? Investigating Self-Feedback for Retrieval Augmented Generation at BioASQ 2025

Samy Ateia

Samy.Ateia@stud.uni-regensburg.de 0

Udo Kruschwitz

udo.kruschwitz@ur.de 0 0 Information Science, University of Regensburg , Universitätsstraße 31, 93053, Regensburg , Germany

2019

1 6000 6010

Agentic Retrieval Augmented Generation (RAG) and 'deep research' systems aim to enable autonomous search processes where Large Language Models (LLMs) iteratively refine outputs. However, applying these systems to domain-specific professional search, such as biomedical research, presents challenges, as automated systems may reduce user involvement and misalign with expert information needs. Professional search tasks often demand high levels of user expertise and transparency. The BioASQ CLEF 2025 challenge, using expert-formulated questions, can serve as a platform to study these issues. We explored the performance of current reasoning and nonreasoning LLMs like Gemini-Flash 2.0, o3-mini, o4-mini and DeepSeek-R1. A key aspect of our methodology was a self-feedback mechanism where LLMs generated, evaluated, and then refined their outputs for query expansion and for multiple answer types (yes/no, factoid, list, ideal). We investigated whether this iterative self-correction improves performance and if reasoning models are more capable of generating useful feedback. Preliminary results indicate varied performance for the self-feedback strategy across models and tasks. This work ofers insights into LLM self-correction and informs future work on comparing the efectiveness of LLM-generated feedback with direct human expert input in these search systems.

eol>Retrieval Augmented Generation Large Language Models Biomedical Question Answering Professional Search Self-Feedback Mechanisms Query Expansion BioASQ

1. Introduction 1.1. BioASQ Challenge

The BioASQ challenge provides a long-running platform for evaluating systems on large-scale biomedical semantic indexing and question answering [ 8 ]. Participants are tasked with retrieving relevant documents and snippets from biomedical literature (PubMed4) and generating precise answers to expertformulated questions, which can be in yes/no, factoid, list, or ideal summary formats. The structured, domain-specific nature of the BioASQ challenge makes it especially suitable for assessing advanced RAG methods for expert information needs.

1.2. Our Contribution

Our team has participated in previous iterations of the BioASQ challenge, examining the performance of various commercial and open-source LLMs, the impact of few-shot learning, and the efects of additional context from knowledge bases [ 9, 10 ]. In this year’s challenge (CLEF 2025), we continued our participation across Task A (document and snippet retrieval), Task A+ (Q&A with own retrieved documents), and Task B (Q&A with retrieved and gold documents). Our primary investigation centered on the efectiveness of a self-feedback loop implemented with current LLMs, including Gemini-Flash 2.0, o3-mini, o4-mini, and DeepSeek Reasoner, to evaluate if models can improve their own generated query expansions and answers through self-critique.

2. Related Work

This work builds upon recent advancements in Large Language Models (LLMs), few-shot and zero-shot learning, Retrieval Augmented Generation (RAG), and their applications to professional search.

2.1. Large Language Models

The field of Natural Language Processing (NLP) has been significantly advanced by Large Language Models, mostly based on the transformer architecture [ 11 ]. Early influential models like BERT (Bidirectional Encoder Representations from Transformers) [12] demonstrated the power of pre-training on large text corpora. Parallel developments led to autoregressive models such as the GPT (Generative Pre-trained Transformer) series [13, 14]. The capabilities of these models were further improved through techniques like Reinforcement Learning from Human Feedback (RLHF), which helps align LLM outputs with human preferences and instructions, making them better at following prompts [15].

Recent months have seen the emergence of numerous so-called reasoning models from various developers, including Google’s Gemini 2.55, OpenAI’s o16 to o4-mini model series and models like DeepSeek R1 [16]. These models build on the idea of Chain of Thought (CoT) prompting [17] that showed that models perform better when they are prompted to generate additional tokens in their output that mimic reasoning or thinking steps. Fine-tuning models for this reasoning process and therefore enabling variable scaling of test-time compute [18] enabled further advances in model performance on popular benchmarks. Reinforcement learning on math and coding related datasets called Reinforcement Learning with Verifiable Reward (RLVR) [ 19] seems to be a current approach to enable models to find useful reasoning strategies.

Our work uses several of these current reasoning and non-reasoning models to compare their performance in a biomedical RAG setting and to see if these reasoning models are better at generating self-feedback.

4https://pubmed.ncbi.nlm.nih.gov/download/#annual-baseline

5https://web.archive.org/web/20250518193243/https://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/ 6https://web.archive.org/web/20250518101415/https://openai.com/index/learning-to-reason-with-llms/

2.2. Few and Zero-Shot Learning

A key characteristic of modern LLMs is their ability to perform tasks with minimal or no task-specific training data, often referred to as In-Context Learning (ICL). Few-shot learning allows LLMs to learn a new task by conditioning on a few input-output examples provided directly in the prompt. This approach removes the need for extensive, curated training datasets, a concept popularized by models like GPT-3 [14]. Zero-shot learning takes this further, enabling LLMs to perform tasks based solely on a natural language description or a direct question, without any preceding examples.

Our previous work has demonstrated the competitive performance of both zero-shot and few-shot approaches in the BioASQ challenge [ 9, 10 ]. These techniques are fundamental to the prompting strategies used in our current experiments, forming the basis for initial query/answer generation before any self-feedback loop.

2.3. Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) combines the generative capabilities of LLMs with information retrieved from external knowledge sources [20]. This approach aims to ground LLM responses in factual data, thereby reducing the likelihood of hallucinations and improving the reliability and verifiability of generated content [21]. A typical RAG pipeline involves a retriever that fetches relevant documents or snippets, and a generator LLM that synthesizes an answer based on the prompt and the retrieved context. The BioASQ challenge itself can be considered an example of a RAG setup in a specialized domain.

The RAG concept is evolving towards more dynamic and autonomous systems, sometimes termed Agentic RAG or ’deep research’ systems [ 22 ]. One of the first of such systems was WebGPT a fine-tuned version of GPT-3 published by a team at OpenAI in 2021 [ 23 ]. It took OpenAI another 3 years7 to roll out a similar system to their ChatGPT user base 8. Their newest models, o3 and o4-mini are trained via reinforcement learning to decide autonomously when and how long to search among using other tools 9. These advanced systems may involve LLM-powered agents performing multistep retrieval, reasoning over the retrieved information, and iteratively refining their outputs or search strategies. The deep research modes ofered by both OpenAI and Google take these concepts even further and let the models search for over 5 minutes through up to hundreds of websites before synthesizing a multipage report.

Our test of a self-feedback mechanism, where an LLM critiques and revises its own generated queries and answers, is intended to analyze the abilities of of the shelf LLMs on such tasks. In future work, we plan to switch out the LLM generated feedback with feedback from human experts to compare the efectiveness of human and AI guided search processes.

2.4. Professional Search

Professional search refers to information seeking conducted in a work-related context, often by specialists who require high precision, control, and the ability to formulate complex queries [ 4, 24 ]. Domains such as biomedical research demand robust evidence-based answers, making transparency and the ability to trace information back to source documents crucial [25]. LLMs are increasingly being explored for professional search applications, ofering potential benefits like advanced query understanding and generation of evidence-based summaries [ 3 ]. However, challenges such as LLM hallucinations and the need to align with expert workflows remain significant.

Our previous work, the BioRAGent system, has focused on making LLM-driven RAG accessible and transparent for biomedical question answering, enabling users to review and customize generated boolean queries in the search process [26]. This study builds upon this work by exploring the impact of generated critical feedback on query generation and answer generation, which will be compared against human feedback in future work. 7https://web.archive.org/web/20250516083609/https://openai.com/index/webgpt/ 8https://web.archive.org/web/20250511211101/https://openai.com/index/introducing-chatgpt-search/ 9https://web.archive.org/web/20250514114152/https://openai.com/index/introducing-o3-and-o4-mini/

3. Methodology

We evaluated several Large Language Models (LLMs) in the context of the BioASQ CLEF 2025 Challenge, specifically in Task 12 B, which is structured into Phase A (retrieval), Phase A+ (Q&A based on retrieved snippets), and Phase B (Q&A based on additional gold-standard snippets).

3.1. Models

The models used were grouped into two categories: • Non-reasoning models: • Reasoning models: – Gemini Flash 2.0 – Gemini 2.5 Flash (used without explicit reasoning mode) – o3-mini – o4-mini (introduced mid-challenge and used in later batches) – DeepSeek Reasoner (initially used but replaced due to slow API)

3.2. Task 12 B Experimental Setup

We participated in all four batches of Task 13 B and submitted several systems under diferent conifgurations. Each batch comprised five runs, covering combinations of baseline prompting, feedbackaugmented prompting, and few-shot learning (10-shot). 3.2.1. Phase A: Document and Snippet Retrieval Each model configuration involved one of the following strategies: • Baseline: Direct prompt-based query generation without iteration. • Feedback (FB): Prompt refinement using self-generated feedback.

• Few-shot: Prompting the model with 10 examples of successful queries.

UR-IW-1 and UR-IW-3 are paired for comparison, both using the same non-reasoning model (Gemini) with and without feedback, respectively. Similarly, UR-IW-2 and UR-IW-4 form a second pair using a reasoning model with and without feedback. While UR-IW-5 is always configured as a non-reasoning few-shot baseline.

The following table summarizes the configurations for Phase A across all four batches:

The non-feedback and few-shot approaches were mostly identical to our last years’ participation, feedback in phase-A was only used for the query generation and refinement step. The top 10 results from the initial query were passed on to the feedback generating model as additional context. For snippet extraction and reranking no feedback was used. 3.2.2. Phase A+ and Phase B: Answer Generation The system configurations used for Phase A+ and Phase B were similar to those of Phase A. However, the source of contextual snippets to ground the answer generation difered: • Phase A+: Used the top-20 snippets retrieved by the corresponding model in Phase A. • Phase B: Used a merged set combining the top-20 retrieved snippets from Phase A and the gold-standard snippets provided by the organizers.

As in Phase A, UR-IW-1/3 and UR-IW-2/4 are grouped to compare feedback vs. non-feedback performance for non-reasoning and reasoning models, respectively. UR-IW-5 serves as a consistent few-shot baseline using non-reasoning models. • Yes/No questions: “Evaluate the draft answer (’yes’ or ’no’) against the provided snippets and the question. Indicate explicitly if it should change, with brief reasoning.” • Factoid questions: “Evaluate the draft JSON entity list answer against the provided snippets and the question. Clearly suggest corrections, removals, or additions.” • List questions: Same as factoid prompt. • Ideal answer (summary): “Evaluate the provided summary answer for accuracy, clarity, and completeness against the provided snippets and the question. Clearly suggest improvements.” The generated feedback was then injected into a fixed refinement prompt to guide the model toward a final improved answer:

Expert Feedback: {feedback_response} Revise and provide the final improved answer strictly following the original instructions.

This two-step feedback-refinement process aimed to simulate expert review and enforce more robust quality control over generated answers.

3.3. Technical Implementation

All pipelines were implemented using Python notebooks and the OpenAI, Google and DeepSeek APIs. Query expansion used the query_string syntax of Elasticsearch. The PubMed annual baseline of 2024 was indexed (title, abstract only) on an Elasticsearch index using the standard English analyzer. Snippet extraction and reranking were performed via LLM prompts. Code and notebooks are publicly available on GitHub10 to ensure full reproducibility.

4. Results

As the final results of the BioASQ 2025 Challenge are still being rated by experts and won’t be released before September, we can only report on the preliminary results published on the BioASQ website. We participated in Task A (document and snippet retrieval), Task A+ (question answering with own retrieved documents), and Task B (question answering with gold standard documents). The experiments were designed to evaluate the eficacy of diferent large language models (LLMs) and the impact of self-generated feedback. All results are preliminary and subject to change following manual expert evaluation.

4.1. Model Selection

We tested multiple of the current available models with diferent settings on a small subset of the BioASQ training set [27] from last year, specifically the fourth batch of BioASQ 12 Task B Phase B. These models included: • deepseek-reasoner • deepseek-chat • gemini-2.5-pro-exp-02-05 • gemini-2.0-flash-thinking-exp-01-21 • gemini-2.0-pro-exp-02-05 • gemini-2.0-flash-lite • gemini-2.0-flash • claude-3-5-haiku-2024102 • claude-3-7-sonnet-20250219 • gpt-4.5-preview-2025-02-27 • o3-mini-2025-01-31 • gpt-4o-mini-2024-07-18

Key observations from these preliminary tests include:

gemini-2.0-flash: Demonstrated strong performance across multiple metrics, particularly in yesno_macro_f1 (0.954), and factoid_mrr (0.684), while being competitively priced.

deepseek-reasoner: Achieved high yesno_accuracy (0.962963) and yesno_macro_f1 (0.957075), comparable to gemini-2.0-flash, though with slightly lower performance in factoid and list question types in these preliminary tests.

We decided to choose gemini-2.0-flash as the non-reasoning LLM and also for our 10-shot baseline, as it was both competitive, fast and cheap. For the reasoning model we chose deepseek-reasoner because it is an open-weight model, cheaper to use via the oficial API and competitive with the other alternative reasoning models (o3-mini, gemini-2.0-flash-thinking).

4.2. Task A: Document and Snippet Retrieval

In Task A, systems were evaluated on their ability to retrieve relevant documents and snippets for given biomedical questions. Our systems were compared against other participating systems, with the "Top Competitor" representing the leading system in each batch.

Detailed preliminary result tables are available in Appendix A.

Document Retrieval: Across the four test batches, our systems demonstrated varied performance. • Batch 1: UR-IW-5 (gemini flash 2.0 + 10-shot) was our top performer, ranking 22nd with a MAP of 0.2865, compared to the Top Competitor’s MAP of 0.4246. Our other systems followed, with UR-IW-4 (deepseek-reasoner + feedback) having the lowest MAP (0.1739) among our submissions in this batch. • Batch 2: UR-IW-5 again led our systems (25th, MAP 0.2634), with UR-IW-4 (o3-mini + feedback) closely following (26th, MAP 0.2601). The Top Competitor achieved a MAP of 0.4425. • Batch 3: UR-IW-5 (gemini-2.5-flash-preview + 10-shot) was our best system (24th,

MAP 0.1834). The Top Competitor’s MAP was 0.3236. • Batch 4: UR-IW-5 (gemini-2.5-flash-preview + 10-shot) ranked 27th with a MAP of 0.0794, while the Top Competitor had a MAP of 0.1801.

Snippet Retrieval: Similar trends were observed in snippet retrieval performance. • Batch 1: UR-IW-5 (gemini flash 2.0 + 10-shot) performed best among our systems (8th,

MAP 0.2768), with the Top Competitor achieving a MAP of 0.4535. • Batch 2: UR-IW-5 again led our entries (12th, MAP 0.3080). The Top Competitor’s MAP was 0.5522. • Batch 3: UR-IW-1 (gemini flash 2.0) and UR-IW-5 (gemini-2.5-flash-preview + 10-shot) were our strongest performers, ranking 15th (MAP 0.1534) and 18th (MAP 0.1488) respectively. The Top Competitor had a MAP of 0.4322. • Batch 4: UR-IW-5 (gemini-2.5-flash-preview + 10-shot) was our top system (18th,

MAP 0.0511). The Top Competitor achieved a MAP of 0.1634.

Generally, the 10-shot run with Gemini Flash 2.0 or 2.5 (UR-IW-5) tended to perform better in document and snippet retrieval tasks compared to our other configurations. The impact of feedback on retrieval tasks (UR-IW-3 and UR-IW-4) varied across batches and didn’t consistently outperform the base models or the 10-shot variants in MAP scores.

4.3. Task A+: Question Answering (Own Retrieved Documents)

Task A+ required systems to answer questions based on the documents and snippets they retrieved in Phase A.

Yes/No Questions: Factoid Questions: • UR-IW-1 (gemini flash 2.0) and UR-IW-2 (o3-mini) often performed strongly. In Batch 1, UR-IW-1, UR-IW-2 and UR-IW-4 (o3-mini + feedback) achieved perfect accuracy and Macro F1 scores. • UR-IW-5 (gemini flash 2.0 + 10-shot) achieved a perfect score in Batch 2. • In Batch 4, UR-IW-2 (o4-mini) was our top performer (2nd, Macro F1 0.9097). • The feedback mechanism (UR-IW-3, UR-IW-4) showed mixed results, sometimes improving (e.g.,

UR-IW-4 in Batch 3) and sometimes underperforming compared to non-feedback versions. • In Batch 1, UR-IW-2 (o3-mini) and UR-IW-5 (gemini flash 2.0 + 10-shot) were our best systems (7th and 8th, MRR 0.3782 and 0.3750 respectively). • UR-IW-4 (o3-mini + feedback) performed well in Batch 2 (2nd, MRR 0.5370). • UR-IW-5 (gemini flash 2.0 + 10-shot) took the top position in Batch 4 with an MRR of 0.5606. • Feedback versions (UR-IW-3, UR-IW-4) had variable performance. For example, UR-IW-3 (gemini flash 2.0 + feedback) ranked 7th in Batch 3 (MRR 0.3100). • Our systems achieved several top rankings in this category. In Batch 1, UR-IW-2 (o3-mini), UR-IW-1 (gemini flash 2.0), UR-IW-5 (gemini flash 2.0 + 10-shot), and UR-IW-4 (o3-mini + feedback) secured the top 4 positions with F-Measures of 0.2567, 0.2411, 0.2395, and 0.2357 respectively. • In Batch 2, UR-IW-2 (o3-mini) was again a strong performer (2nd, F-Measure 0.3805). • The efect of feedback and few-shot learning varied. For instance, in Batch 3, UR-IW-5 ( gemini flash 2.0 + 10-shot) ranked 13th (F-Measure 0.3618).

4.4. Task B: Question Answering (Gold Standard Documents)

Task B involved answering questions using additional gold standard documents and snippets. Yes/No Questions: • UR-IW-1 (gemini flash 2.0) and UR-IW-5 (gemini flash 2.0 + 10-shot) achieved perfect scores in Batch 1. • UR-IW-5 also achieved a perfect score in Batch 2. • The feedback system UR-IW-4 (o3-mini + feedback or o4-mini + feedback) performed well, often outperforming its non-feedback counterpart in later batches (e.g., UR-IW-4 in Batch 3 and Batch 4 with Macro F1 of 0.8706 and 0.9097 respectively).

Factoid Questions: • UR-IW-3 (gemini flash 2.0 + feedback) was our best system in Batch 1 (17th, MRR 0.4821). • In Batch 2, UR-IW-1 (gemini flash 2.0) performed strongly (11th, MRR 0.5926). • UR-IW-4 (o4-mini + feedback) was our top performer in Batch 4 (6th, MRR 0.5909). • The systems with feedback often showed competitive MRR scores, but overall the results were mixed. • UR-IW-4 (o3-mini + feedback or o4-mini + feedback) consistently performed well, ranking 28th in Batch 1 (F-Measure 0.5069) and 28th in Batch 2 (F-Measure 0.5188). • In Batch 3, UR-IW-5 (gemini flash 2.0 + 10-shot) and UR-IW-3 (gemini flash 2.0 + feedback) were our leading systems. • The results suggest that both few-shot prompting and feedback mechanisms can be beneficial, though their relative efectiveness varied across batches.

5. Discussion and Future Work

Model Performance: Based on the initial model selection tests and the BioASQ task 13B results, gemini-2.0-flash and its variants showed strong and consistent performance, particularly the 10-shot version (UR-IW-5) in retrieval tasks and Yes/No questions. o3-mini and o4-mini (UR-IW-2 and UR-IW-4 configurations) also proved to be competitive, especially in question answering tasks. deepseek-reasoner was competitive, particularly in Task A, Batch 1. But due to the slow API we were unable to complete runs with it in later batches, therefore opting for a proprietary replacement (o3-mini, o4-mini).

Impact of Self-Generated Feedback: The motivation to explore self-generated feedback stemmed from our ongoing research into comparing the impact of human expert feedback to LLM generated feedback. In these BioASQ preliminary results, the impact of adding a feedback step (UR-IW-3 and UR-IW-4 configurations) was mixed across all tasks and batches. For Task A (Retrieval), feedback conifgurations did not consistently outperform the base models or 10-shot configurations in terms of MAP scores. For Task A+ and Task B (Question Answering), feedback sometimes led to improvements. For instance, in Task B Yes/No questions, UR-IW-4 (with feedback) often surpassed UR-IW-2 (without feedback) in later batches. Similarly, in Task B Factoid questions, feedback systems showed competitive MRR scores. However, there were also instances where feedback did not lead to better or even resulted in worse performance compared to the base model or the few-shot model. The preliminary tests on model selection also hinted that self-feedback might not always enhance performance for some base models.

Few-Shot Learning vs. Feedback: The UR-IW-5 configurations, typically employing gemini flash 2.0 + 10-shot or gemini-2.5-flash-preview + 10-shot, frequently emerged as strong performers, especially in retrieval (Task A) and some question-answering sub-tasks (e.g., Task A+ Factoid Batch 4, Task B Yes/No Batch 1). This suggests that providing a few examples is still a successful way to guide these LLMs. When comparing Gemini Flash 2.0 base (UR-IW-1) with its feedback version (UR-IW-3) and its 10-shot version (UR-IW-5), the 10-shot approach often had an edge, particularly in retrieval.

Best Suited Models and Approaches: • For retrieval tasks (Task A), gemini flash 2.0 + 10-shot (UR-IW-5) appeared to be the most promising approach among our submissions. • For Yes/No questions (Task A+ & B), gemini flash 2.0 (base and 10-shot) and o3-mini/o4-mini (with and without feedback) all showed the ability to achieve high or perfect scores. • For Factoid questions (Task A+ & B), performance was more varied. o3-mini/o4-mini with feedback (UR-IW-4) and gemini flash 2.0 + 10-shot (UR-IW-5) had good performances in certain batches. • For List questions (Task A+ & B), o3-mini (UR-IW-2) had particularly strong showings in Task A+, Batch 1 and 2. In Task B, o3-mini/o4-mini with feedback (UR-IW-4) also performed well.

The choice of "best" model and approach appears to be task-dependent. Few-shot learning with gemini-2.0-flash seems broadly efective. The feedback mechanism shows potential but requires further refinement to ensure consistent improvements across diverse tasks and models. The preliminary test data indicated that gemini-2.0-flash had strong baseline factoid performance, which was reflected in some of the task results.

These are preliminary observations, and a more in-depth analysis will be conducted once the final, manually evaluated results are available. Future work will involve a more granular analysis of the generated answers and the types of errors made by diferent models and approaches to refine our strategies for future BioASQ challenges. The example code used for feedback and few-shot prompting can be found online11.

6. Ethical Considerations

Even if the accuracy and reliability of LLM generated answers in RAG improve, they still tend to make subtle errors or hallucinate information that is not supported by the source documents. These errors can be especially dificult to catch when expert information needs such as the questions posed in the BioASQ challenge are answered. The output of these systems should therefore not be used to inform clinical decision-making without thorough expert oversight.

Another ethical issue is the environmental costs of complex multistep RAG systems. As each LLM call is processed on GPU clusters with SOTA models having billions of parameters distributed over these GPUs, every call produces considerably more co2 than a simple TF_IDF based search result ranking.

7. Conclusion

Overall, our feedback-based approach returned mixed results. There was no clear improvement over the zero-shot baselines with the same models. The few-shot approach from last year’s participation that we reused as a baseline this year, was, according to the preliminary results, still the most competitive approach from our runs. It was also interesting to see that in our model selection test, the presumably cheaper and smaller distilled models (Gemini flash) were achieving better results than their pricier and presumably bigger counterparts (Gemini Pro) or the reasoning models (o3-mini, DeepSeek R1).

We will build on the introduced feedback approach in future work, comparing the impact of human and LLM generated feedback on overall task performance in professional search [28]. We believe this will be a valuable contribution to assess the performance of systems that foster human engagement vs. systems that promise full automation.

Acknowledgments

We thank the organizers of the BioASQ challenge for their continued support and quick response time. This work is supported by the German Research Foundation (DFG) as part of the NFDIxCS consortium (Grant number: 501930651).

Declaration on Generative AI

The authors used the following generative-AI tools while preparing this paper12: 11https://github.com/SamyAteia/bioasq2025 12https://ceur-ws.org/GenAI/Policy.html • OpenAI ChatGPT (o3, 4o, 4.5 preview) (May 2025) - drafting content, latex formatting, paraphrase and reword. • Google Gemini 2.5 Pro (May 2025) - drafting content, latex formatting, paraphrase and reword. • LanguageTool - spellchecking, paraphrase and reword.

All AI-generated material was critically reviewed, revised and verified by the human authors. The authors accept full responsibility for the integrity and accuracy of the final manuscript. tics, Punta Cana, Dominican Republic, 2021, pp. 3784–3803. URL: https://aclanthology.org/2021. 09332. arXiv:2112.09332. [25] J. Higgins, Cochrane handbook for systematic reviews of interventions, Cochrane Collaboration and John Wiley & Sons Ltd (2008). [26] S. Ateia, U. Kruschwitz, BioRAGent: A Retrieval-Augmented Generation System for Showcasing Generative Query Expansion and Domain-Specific Search for Scientific Q&A, in: European Conference on Information Retrieval, 2025. [27] A. Krithara, A. Nentidis, K. Bougiatiotis, G. Paliouras, BioASQ-QA: A manually curated corpus for

Biomedical Question Answering, Scientific Data 10 (2023) 170. [28] S. Ateia, From professional search to generative deep research systems: How can expert oversight improve search outcomes?, in: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (Doctoral Consortium), 2025. Doctoral Consortium, to appear.

A. Detailed Preliminary Results

Batch Test batch 1 Test batch 1 Test batch 1 Test batch 1 Test batch 1 Test batch 1 Test batch 2 Test batch 2 Test batch 2 Test batch 2 Test batch 2 Test batch 2 Test batch 3 Test batch 3 Test batch 3 Test batch 3 Test batch 3 Test batch 3 Test batch 4 Test batch 4 Test batch 4 Test batch 4 Test batch 4 Test batch 4 Batch Test batch 1 Test batch 1 Test batch 1 Test batch 1 Test batch 1 Test batch 1 Test batch 2 Test batch 2 Test batch 2 Test batch 2 Test batch 2 Test batch 2 Test batch 3 Test batch 3 Test batch 3 Test batch 3 Test batch 3 Test batch 3 Test batch 4 Test batch 4 Test batch 4 Test batch 4 Test batch 4 Test batch 4 Position 1 of 72 28 of 72 50 of 72 52 of 72 57 of 72 60 of 72 1 of 72 28 of 72 31 of 72 41 of 72 47 of 72 49 of 72 1 of 66 44 of 66 45 of 66 46 of 66 47 of 66 54 of 66 1 of 79 41 of 79 46 of 79 47 of 79 55 of 79 56 of 79

System Top Competitor UR-IW-4 UR-IW-2 UR-IW-1 UR-IW-3

UR-IW-5 Top Competitor UR-IW-4 UR-IW-3 UR-IW-1 UR-IW-5

UR-IW-2 Top Competitor UR-IW-5 UR-IW-3 UR-IW-4 UR-IW-1

UR-IW-2 Top Competitor UR-IW-3 UR-IW-4 UR-IW-5 UR-IW-1 UR-IW-2

[1]

Suri ,

Counts ,

Wang ,

Chen ,

Wan ,

Safavi ,

Neville ,

Shah ,

R. W.

White ,

Andersen , G. Buscher,

Manivannan ,

Rangan ,

Yang , The Use of Generative Search Engines for Knowledge Work and

Complex

Tasks , 2024 . URL: https://arxiv.org/abs/2404.04268. arXiv: 2404 . 04268 .

[2]

B. G.

Edelman ,

Ngwe ,

Peng , Measuring the impact of AI on information worker productivity , Available at SSRN 4648686 ( 2023 ).

[3]

M. P.

Bron ,

Greijn ,

B. M.

Coimbra , R. van de Schoot, A. Bagheri, Combining large language model classifications and active learning for improved technology-assisted review , in: Proceedings of the International Workshop on Interactive Adaptive Learning (IAL@PKDD/ECML 2024) , volume 3770 of CEUR Workshop Proceedings, CEUR-WS.org , 2024 , pp. 77 - 95 . URL: https://ceur-ws. org/ Vol- 3770 /paper6.pdf.

[4]

Verberne ,

He ,

Kruschwitz ,

Wiggers ,

Larsen ,

Russell-Rose , A. P. de Vries, First International Workshop on Professional Search, SIGIR Forum 52 ( 2019 ) 153 - 162 . URL: https: //doi.org/10.1145/3308774.3308799. doi: 10 .1145/3308774.3308799.

[5]

Verberne , Professional Search, 1 ed., Association for Computing Machinery , New York, NY, USA, 2024 , p. 501 - 514 . URL: https://doi.org/10.1145/3674127.3674141.

[6]

N. F.

Liu ,

Zhang , P. Liang, Evaluating verifiability in generative search engines , arXiv preprint arXiv:2304.09848 ( 2023 ).

[7]

S. E.

Spatharioti ,

Rothschild ,

D. G.

Goldstein ,

J. M.

Hofman , Efects of LLM-based Search on Decision Making: Speed, Accuracy, and Overreliance , in: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI '25 , Association for Computing Machinery, New York, NY, USA, 2025 . URL: https://doi.org/10.1145/3706598.3714082. doi: 10 .1145/ 3706598.3714082.

[8]

Nentidis ,

Katsimpras ,

Krithara ,

Krallinger ,

Rodríguez-Ortega ,

Rodriguez-López ,

Loukachevitch ,

Sakhovskiy ,

Tutubalina ,

Dimitriadis , G. Tsoumakas,

Giannakoulas ,

Bekiaridou ,

Samaras ,

Maria Di Nunzio ,

Ferro ,

Marchesin ,

Martinelli , G. Silvello, G. Paliouras, Overview of BioASQ 2025 : The thirteenth BioASQ challenge on large-scale biomedical semantic indexing and question answering , in: J. C. de Albornoz , J.

Gonzalo , L.

Plaza , A. G. S. de Herrera , J.

Mothe , F.

Piroi , P.

Rosso , D.

Spina , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF 2025 ), 2025 .

[9]

Ateia , U. Kruschwitz, Is chatgpt a biomedical expert? , in: M. Aliannejadi , G. Faggioli, N. Ferro , M. Vlachos (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023 ), Thessaloniki, Greece, September 18th to 21st , 2023 , volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org , 2023 , pp. 73 - 90 . URL: https://ceur-ws. org/ Vol- 3497 /paper-006.pdf.

[10]

Ateia , U. Kruschwitz, Can open-source llms compete with commercial models? exploring the few-shot performance of current GPT models in biomedical tasks , in: G. Faggioli,

Ferro ,

Galuscáková , A. G. S. de Herrera (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024 ), Grenoble, France, 9 - 12 September , 2024 , volume 3740 of CEUR Workshop Proceedings, CEUR-WS.org , 2024 , pp. 78 - 98 . URL: https://ceur-ws. org/ Vol- 3740 /paper-07. pdf.

[11]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , Attention is All You Need, in: Proceedings of the 31st International Conference on Neural ifndings-emnlp . 320 /. doi: 10 .18653/v1/ 2021 .findings-emnlp. 320 .

[22] OpenAI , Deep Research System Card, https://cdn.openai.com/deep-research -system-card .pdf , 2025 . System card, accessed 8 Jul 2025 .

[23]

Nakano ,

Hilton ,

Balaji ,

Wu ,

Ouyang ,

Kim ,

Hesse ,

Jain ,

Kosaraju ,

Saunders ,

Jiang ,

Cobbe ,

Eloundou , G. Krueger,

Button ,

Knight ,

Chess ,

Schulman , Webgpt: Browser-assisted question-answering with human feedback, 2022 . URL: https://arxiv.org/abs/2112.

[24]

Russell-Rose ,

Gooch , U. Kruschwitz, Interactive query expansion for professional search applications , Business Information Review 38 ( 2021 ) 127 - 137 . URL: https://doi.org/10.1177/02663821211034079. doi: 10 .1177/02663821211034079.