<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.18653/v1/N19-1423</article-id>
      <title-group>
        <article-title>Can Language Models Critique Themselves? Investigating Self-Feedback for Retrieval Augmented Generation at BioASQ 2025</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Samy Ateia</string-name>
          <email>Samy.Ateia@stud.uni-regensburg.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Udo Kruschwitz</string-name>
          <email>udo.kruschwitz@ur.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Science, University of Regensburg</institution>
          ,
          <addr-line>Universitätsstraße 31, 93053, Regensburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>1</volume>
      <fpage>6000</fpage>
      <lpage>6010</lpage>
      <abstract>
        <p>Agentic Retrieval Augmented Generation (RAG) and 'deep research' systems aim to enable autonomous search processes where Large Language Models (LLMs) iteratively refine outputs. However, applying these systems to domain-specific professional search, such as biomedical research, presents challenges, as automated systems may reduce user involvement and misalign with expert information needs. Professional search tasks often demand high levels of user expertise and transparency. The BioASQ CLEF 2025 challenge, using expert-formulated questions, can serve as a platform to study these issues. We explored the performance of current reasoning and nonreasoning LLMs like Gemini-Flash 2.0, o3-mini, o4-mini and DeepSeek-R1. A key aspect of our methodology was a self-feedback mechanism where LLMs generated, evaluated, and then refined their outputs for query expansion and for multiple answer types (yes/no, factoid, list, ideal). We investigated whether this iterative self-correction improves performance and if reasoning models are more capable of generating useful feedback. Preliminary results indicate varied performance for the self-feedback strategy across models and tasks. This work ofers insights into LLM self-correction and informs future work on comparing the efectiveness of LLM-generated feedback with direct human expert input in these search systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Retrieval Augmented Generation</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Biomedical Question Answering</kwd>
        <kwd>Professional Search</kwd>
        <kwd>Self-Feedback Mechanisms</kwd>
        <kwd>Query Expansion</kwd>
        <kwd>BioASQ</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>1.1. BioASQ Challenge</title>
        <p>
          The BioASQ challenge provides a long-running platform for evaluating systems on large-scale
biomedical semantic indexing and question answering [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Participants are tasked with retrieving relevant
documents and snippets from biomedical literature (PubMed4) and generating precise answers to
expertformulated questions, which can be in yes/no, factoid, list, or ideal summary formats. The structured,
domain-specific nature of the BioASQ challenge makes it especially suitable for assessing advanced
RAG methods for expert information needs.
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2. Our Contribution</title>
        <p>
          Our team has participated in previous iterations of the BioASQ challenge, examining the performance
of various commercial and open-source LLMs, the impact of few-shot learning, and the efects of
additional context from knowledge bases [
          <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
          ]. In this year’s challenge (CLEF 2025), we continued
our participation across Task A (document and snippet retrieval), Task A+ (Q&amp;A with own retrieved
documents), and Task B (Q&amp;A with retrieved and gold documents). Our primary investigation centered
on the efectiveness of a self-feedback loop implemented with current LLMs, including Gemini-Flash
2.0, o3-mini, o4-mini, and DeepSeek Reasoner, to evaluate if models can improve their own generated
query expansions and answers through self-critique.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>This work builds upon recent advancements in Large Language Models (LLMs), few-shot and zero-shot
learning, Retrieval Augmented Generation (RAG), and their applications to professional search.</p>
      <sec id="sec-2-1">
        <title>2.1. Large Language Models</title>
        <p>
          The field of Natural Language Processing (NLP) has been significantly advanced by Large Language
Models, mostly based on the transformer architecture [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Early influential models like BERT
(Bidirectional Encoder Representations from Transformers) [12] demonstrated the power of pre-training on
large text corpora. Parallel developments led to autoregressive models such as the GPT (Generative
Pre-trained Transformer) series [13, 14]. The capabilities of these models were further improved through
techniques like Reinforcement Learning from Human Feedback (RLHF), which helps align LLM outputs
with human preferences and instructions, making them better at following prompts [15].
        </p>
        <p>Recent months have seen the emergence of numerous so-called reasoning models from various
developers, including Google’s Gemini 2.55, OpenAI’s o16 to o4-mini model series and models like
DeepSeek R1 [16]. These models build on the idea of Chain of Thought (CoT) prompting [17] that
showed that models perform better when they are prompted to generate additional tokens in their output
that mimic reasoning or thinking steps. Fine-tuning models for this reasoning process and therefore
enabling variable scaling of test-time compute [18] enabled further advances in model performance on
popular benchmarks. Reinforcement learning on math and coding related datasets called Reinforcement
Learning with Verifiable Reward (RLVR) [ 19] seems to be a current approach to enable models to find
useful reasoning strategies.</p>
        <p>Our work uses several of these current reasoning and non-reasoning models to compare their
performance in a biomedical RAG setting and to see if these reasoning models are better at generating
self-feedback.</p>
        <sec id="sec-2-1-1">
          <title>4https://pubmed.ncbi.nlm.nih.gov/download/#annual-baseline</title>
          <p>5https://web.archive.org/web/20250518193243/https://blog.google/technology/google-deepmind/
gemini-model-thinking-updates-march-2025/
6https://web.archive.org/web/20250518101415/https://openai.com/index/learning-to-reason-with-llms/</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Few and Zero-Shot Learning</title>
        <p>A key characteristic of modern LLMs is their ability to perform tasks with minimal or no task-specific
training data, often referred to as In-Context Learning (ICL). Few-shot learning allows LLMs to learn
a new task by conditioning on a few input-output examples provided directly in the prompt. This
approach removes the need for extensive, curated training datasets, a concept popularized by models
like GPT-3 [14]. Zero-shot learning takes this further, enabling LLMs to perform tasks based solely
on a natural language description or a direct question, without any preceding examples.</p>
        <p>
          Our previous work has demonstrated the competitive performance of both zero-shot and few-shot
approaches in the BioASQ challenge [
          <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
          ]. These techniques are fundamental to the prompting
strategies used in our current experiments, forming the basis for initial query/answer generation before
any self-feedback loop.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Retrieval Augmented Generation (RAG)</title>
        <p>Retrieval Augmented Generation (RAG) combines the generative capabilities of LLMs with information
retrieved from external knowledge sources [20]. This approach aims to ground LLM responses in factual
data, thereby reducing the likelihood of hallucinations and improving the reliability and verifiability of
generated content [21]. A typical RAG pipeline involves a retriever that fetches relevant documents
or snippets, and a generator LLM that synthesizes an answer based on the prompt and the retrieved
context. The BioASQ challenge itself can be considered an example of a RAG setup in a specialized
domain.</p>
        <p>
          The RAG concept is evolving towards more dynamic and autonomous systems, sometimes termed
Agentic RAG or ’deep research’ systems [
          <xref ref-type="bibr" rid="ref12">22</xref>
          ]. One of the first of such systems was WebGPT a fine-tuned
version of GPT-3 published by a team at OpenAI in 2021 [
          <xref ref-type="bibr" rid="ref13">23</xref>
          ]. It took OpenAI another 3 years7 to roll
out a similar system to their ChatGPT user base 8. Their newest models, o3 and o4-mini are trained via
reinforcement learning to decide autonomously when and how long to search among using other tools
9. These advanced systems may involve LLM-powered agents performing multistep retrieval, reasoning
over the retrieved information, and iteratively refining their outputs or search strategies. The deep
research modes ofered by both OpenAI and Google take these concepts even further and let the models
search for over 5 minutes through up to hundreds of websites before synthesizing a multipage report.
        </p>
        <p>Our test of a self-feedback mechanism, where an LLM critiques and revises its own generated queries
and answers, is intended to analyze the abilities of of the shelf LLMs on such tasks. In future work,
we plan to switch out the LLM generated feedback with feedback from human experts to compare the
efectiveness of human and AI guided search processes.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Professional Search</title>
        <p>
          Professional search refers to information seeking conducted in a work-related context, often by
specialists who require high precision, control, and the ability to formulate complex queries [
          <xref ref-type="bibr" rid="ref14 ref4">4, 24</xref>
          ]. Domains
such as biomedical research demand robust evidence-based answers, making transparency and the
ability to trace information back to source documents crucial [25]. LLMs are increasingly being explored
for professional search applications, ofering potential benefits like advanced query understanding and
generation of evidence-based summaries [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. However, challenges such as LLM hallucinations and the
need to align with expert workflows remain significant.
        </p>
        <p>Our previous work, the BioRAGent system, has focused on making LLM-driven RAG accessible and
transparent for biomedical question answering, enabling users to review and customize generated
boolean queries in the search process [26]. This study builds upon this work by exploring the impact
of generated critical feedback on query generation and answer generation, which will be compared
against human feedback in future work.
7https://web.archive.org/web/20250516083609/https://openai.com/index/webgpt/
8https://web.archive.org/web/20250511211101/https://openai.com/index/introducing-chatgpt-search/
9https://web.archive.org/web/20250514114152/https://openai.com/index/introducing-o3-and-o4-mini/</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>We evaluated several Large Language Models (LLMs) in the context of the BioASQ CLEF 2025 Challenge,
specifically in Task 12 B, which is structured into Phase A (retrieval), Phase A+ (Q&amp;A based on retrieved
snippets), and Phase B (Q&amp;A based on additional gold-standard snippets).</p>
      <sec id="sec-3-1">
        <title>3.1. Models</title>
        <p>The models used were grouped into two categories:
• Non-reasoning models:
• Reasoning models:
– Gemini Flash 2.0
– Gemini 2.5 Flash (used without explicit reasoning mode)
– o3-mini
– o4-mini (introduced mid-challenge and used in later batches)
– DeepSeek Reasoner (initially used but replaced due to slow API)</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Task 12 B Experimental Setup</title>
        <p>We participated in all four batches of Task 13 B and submitted several systems under diferent
conifgurations. Each batch comprised five runs, covering combinations of baseline prompting,
feedbackaugmented prompting, and few-shot learning (10-shot).
3.2.1. Phase A: Document and Snippet Retrieval
Each model configuration involved one of the following strategies:
• Baseline: Direct prompt-based query generation without iteration.
• Feedback (FB): Prompt refinement using self-generated feedback.</p>
        <p>• Few-shot: Prompting the model with 10 examples of successful queries.</p>
        <p>UR-IW-1 and UR-IW-3 are paired for comparison, both using the same non-reasoning model (Gemini)
with and without feedback, respectively. Similarly, UR-IW-2 and UR-IW-4 form a second pair using a
reasoning model with and without feedback. While UR-IW-5 is always configured as a non-reasoning
few-shot baseline.</p>
        <p>The following table summarizes the configurations for Phase A across all four batches:</p>
        <p>The non-feedback and few-shot approaches were mostly identical to our last years’ participation,
feedback in phase-A was only used for the query generation and refinement step. The top 10 results
from the initial query were passed on to the feedback generating model as additional context. For
snippet extraction and reranking no feedback was used.
3.2.2. Phase A+ and Phase B: Answer Generation
The system configurations used for Phase A+ and Phase B were similar to those of Phase A. However,
the source of contextual snippets to ground the answer generation difered:
• Phase A+: Used the top-20 snippets retrieved by the corresponding model in Phase A.
• Phase B: Used a merged set combining the top-20 retrieved snippets from Phase A and the
gold-standard snippets provided by the organizers.</p>
        <p>As in Phase A, UR-IW-1/3 and UR-IW-2/4 are grouped to compare feedback vs. non-feedback
performance for non-reasoning and reasoning models, respectively. UR-IW-5 serves as a consistent
few-shot baseline using non-reasoning models.
• Yes/No questions: “Evaluate the draft answer (’yes’ or ’no’) against the provided snippets and the
question. Indicate explicitly if it should change, with brief reasoning.”
• Factoid questions: “Evaluate the draft JSON entity list answer against the provided snippets and
the question. Clearly suggest corrections, removals, or additions.”
• List questions: Same as factoid prompt.
• Ideal answer (summary): “Evaluate the provided summary answer for accuracy, clarity, and
completeness against the provided snippets and the question. Clearly suggest improvements.”
The generated feedback was then injected into a fixed refinement prompt to guide the model toward
a final improved answer:</p>
        <p>Expert Feedback: {feedback_response}
Revise and provide the final improved answer strictly following the
original instructions.</p>
        <p>This two-step feedback-refinement process aimed to simulate expert review and enforce more robust
quality control over generated answers.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Technical Implementation</title>
        <p>All pipelines were implemented using Python notebooks and the OpenAI, Google and DeepSeek APIs.
Query expansion used the query_string syntax of Elasticsearch. The PubMed annual baseline of
2024 was indexed (title, abstract only) on an Elasticsearch index using the standard English analyzer.
Snippet extraction and reranking were performed via LLM prompts. Code and notebooks are publicly
available on GitHub10 to ensure full reproducibility.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>As the final results of the BioASQ 2025 Challenge are still being rated by experts and won’t be released
before September, we can only report on the preliminary results published on the BioASQ website.
We participated in Task A (document and snippet retrieval), Task A+ (question answering with own
retrieved documents), and Task B (question answering with gold standard documents). The experiments
were designed to evaluate the eficacy of diferent large language models (LLMs) and the impact of
self-generated feedback. All results are preliminary and subject to change following manual expert
evaluation.</p>
      <sec id="sec-4-1">
        <title>4.1. Model Selection</title>
        <p>We tested multiple of the current available models with diferent settings on a small subset of the
BioASQ training set [27] from last year, specifically the fourth batch of BioASQ 12 Task B Phase B.
These models included:
• deepseek-reasoner
• deepseek-chat
• gemini-2.5-pro-exp-02-05
• gemini-2.0-flash-thinking-exp-01-21
• gemini-2.0-pro-exp-02-05
• gemini-2.0-flash-lite
• gemini-2.0-flash
• claude-3-5-haiku-2024102
• claude-3-7-sonnet-20250219
• gpt-4.5-preview-2025-02-27
• o3-mini-2025-01-31
• gpt-4o-mini-2024-07-18</p>
        <sec id="sec-4-1-1">
          <title>Key observations from these preliminary tests include:</title>
          <p>gemini-2.0-flash: Demonstrated strong performance across multiple metrics, particularly in
yesno_macro_f1 (0.954), and factoid_mrr (0.684), while being competitively priced.</p>
          <p>deepseek-reasoner: Achieved high yesno_accuracy (0.962963) and yesno_macro_f1 (0.957075),
comparable to gemini-2.0-flash, though with slightly lower performance in factoid and list question types
in these preliminary tests.</p>
          <p>We decided to choose gemini-2.0-flash as the non-reasoning LLM and also for our 10-shot baseline, as
it was both competitive, fast and cheap. For the reasoning model we chose deepseek-reasoner because it
is an open-weight model, cheaper to use via the oficial API and competitive with the other alternative
reasoning models (o3-mini, gemini-2.0-flash-thinking).</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Task A: Document and Snippet Retrieval</title>
        <p>In Task A, systems were evaluated on their ability to retrieve relevant documents and snippets for given
biomedical questions. Our systems were compared against other participating systems, with the "Top
Competitor" representing the leading system in each batch.</p>
        <p>Detailed preliminary result tables are available in Appendix A.</p>
        <p>Document Retrieval: Across the four test batches, our systems demonstrated varied performance.
• Batch 1: UR-IW-5 (gemini flash 2.0 + 10-shot) was our top performer, ranking 22nd
with a MAP of 0.2865, compared to the Top Competitor’s MAP of 0.4246. Our other systems
followed, with UR-IW-4 (deepseek-reasoner + feedback) having the lowest MAP (0.1739)
among our submissions in this batch.
• Batch 2: UR-IW-5 again led our systems (25th, MAP 0.2634), with UR-IW-4 (o3-mini +
feedback) closely following (26th, MAP 0.2601). The Top Competitor achieved a MAP of
0.4425.
• Batch 3: UR-IW-5 (gemini-2.5-flash-preview + 10-shot) was our best system (24th,</p>
        <p>MAP 0.1834). The Top Competitor’s MAP was 0.3236.
• Batch 4: UR-IW-5 (gemini-2.5-flash-preview + 10-shot) ranked 27th with a MAP of
0.0794, while the Top Competitor had a MAP of 0.1801.</p>
        <p>Snippet Retrieval: Similar trends were observed in snippet retrieval performance.
• Batch 1: UR-IW-5 (gemini flash 2.0 + 10-shot) performed best among our systems (8th,</p>
        <p>MAP 0.2768), with the Top Competitor achieving a MAP of 0.4535.
• Batch 2: UR-IW-5 again led our entries (12th, MAP 0.3080). The Top Competitor’s MAP was
0.5522.
• Batch 3: UR-IW-1 (gemini flash 2.0) and UR-IW-5 (gemini-2.5-flash-preview +
10-shot) were our strongest performers, ranking 15th (MAP 0.1534) and 18th (MAP 0.1488)
respectively. The Top Competitor had a MAP of 0.4322.
• Batch 4: UR-IW-5 (gemini-2.5-flash-preview + 10-shot) was our top system (18th,</p>
        <p>MAP 0.0511). The Top Competitor achieved a MAP of 0.1634.</p>
        <p>Generally, the 10-shot run with Gemini Flash 2.0 or 2.5 (UR-IW-5) tended to perform better in
document and snippet retrieval tasks compared to our other configurations. The impact of feedback on
retrieval tasks (UR-IW-3 and UR-IW-4) varied across batches and didn’t consistently outperform the
base models or the 10-shot variants in MAP scores.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Task A+: Question Answering (Own Retrieved Documents)</title>
        <p>Task A+ required systems to answer questions based on the documents and snippets they retrieved in
Phase A.</p>
        <p>Yes/No Questions:
Factoid Questions:
• UR-IW-1 (gemini flash 2.0) and UR-IW-2 (o3-mini) often performed strongly. In Batch 1,
UR-IW-1, UR-IW-2 and UR-IW-4 (o3-mini + feedback) achieved perfect accuracy and Macro
F1 scores.
• UR-IW-5 (gemini flash 2.0 + 10-shot) achieved a perfect score in Batch 2.
• In Batch 4, UR-IW-2 (o4-mini) was our top performer (2nd, Macro F1 0.9097).
• The feedback mechanism (UR-IW-3, UR-IW-4) showed mixed results, sometimes improving (e.g.,</p>
        <p>UR-IW-4 in Batch 3) and sometimes underperforming compared to non-feedback versions.
• In Batch 1, UR-IW-2 (o3-mini) and UR-IW-5 (gemini flash 2.0 + 10-shot) were our best
systems (7th and 8th, MRR 0.3782 and 0.3750 respectively).
• UR-IW-4 (o3-mini + feedback) performed well in Batch 2 (2nd, MRR 0.5370).
• UR-IW-5 (gemini flash 2.0 + 10-shot) took the top position in Batch 4 with an MRR of
0.5606.
• Feedback versions (UR-IW-3, UR-IW-4) had variable performance. For example, UR-IW-3 (gemini
flash 2.0 + feedback) ranked 7th in Batch 3 (MRR 0.3100).
• Our systems achieved several top rankings in this category. In Batch 1, UR-IW-2 (o3-mini),
UR-IW-1 (gemini flash 2.0), UR-IW-5 (gemini flash 2.0 + 10-shot), and UR-IW-4
(o3-mini + feedback) secured the top 4 positions with F-Measures of 0.2567, 0.2411, 0.2395,
and 0.2357 respectively.
• In Batch 2, UR-IW-2 (o3-mini) was again a strong performer (2nd, F-Measure 0.3805).
• The efect of feedback and few-shot learning varied. For instance, in Batch 3, UR-IW-5 ( gemini
flash 2.0 + 10-shot) ranked 13th (F-Measure 0.3618).</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Task B: Question Answering (Gold Standard Documents)</title>
        <p>Task B involved answering questions using additional gold standard documents and snippets.
Yes/No Questions:
• UR-IW-1 (gemini flash 2.0) and UR-IW-5 (gemini flash 2.0 + 10-shot) achieved
perfect scores in Batch 1.
• UR-IW-5 also achieved a perfect score in Batch 2.
• The feedback system UR-IW-4 (o3-mini + feedback or o4-mini + feedback) performed
well, often outperforming its non-feedback counterpart in later batches (e.g., UR-IW-4 in Batch 3
and Batch 4 with Macro F1 of 0.8706 and 0.9097 respectively).</p>
        <p>Factoid Questions:
• UR-IW-3 (gemini flash 2.0 + feedback) was our best system in Batch 1 (17th, MRR
0.4821).
• In Batch 2, UR-IW-1 (gemini flash 2.0) performed strongly (11th, MRR 0.5926).
• UR-IW-4 (o4-mini + feedback) was our top performer in Batch 4 (6th, MRR 0.5909).
• The systems with feedback often showed competitive MRR scores, but overall the results were
mixed.
• UR-IW-4 (o3-mini + feedback or o4-mini + feedback) consistently performed well,
ranking 28th in Batch 1 (F-Measure 0.5069) and 28th in Batch 2 (F-Measure 0.5188).
• In Batch 3, UR-IW-5 (gemini flash 2.0 + 10-shot) and UR-IW-3 (gemini flash 2.0 +
feedback) were our leading systems.
• The results suggest that both few-shot prompting and feedback mechanisms can be beneficial,
though their relative efectiveness varied across batches.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Future Work</title>
      <p>Model Performance: Based on the initial model selection tests and the BioASQ task 13B results,
gemini-2.0-flash and its variants showed strong and consistent performance, particularly the
10-shot version (UR-IW-5) in retrieval tasks and Yes/No questions. o3-mini and o4-mini (UR-IW-2
and UR-IW-4 configurations) also proved to be competitive, especially in question answering tasks.
deepseek-reasoner was competitive, particularly in Task A, Batch 1. But due to the slow API we
were unable to complete runs with it in later batches, therefore opting for a proprietary replacement
(o3-mini, o4-mini).</p>
      <p>Impact of Self-Generated Feedback: The motivation to explore self-generated feedback stemmed
from our ongoing research into comparing the impact of human expert feedback to LLM generated
feedback. In these BioASQ preliminary results, the impact of adding a feedback step (UR-IW-3 and
UR-IW-4 configurations) was mixed across all tasks and batches. For Task A (Retrieval), feedback
conifgurations did not consistently outperform the base models or 10-shot configurations in terms of MAP
scores. For Task A+ and Task B (Question Answering), feedback sometimes led to improvements.
For instance, in Task B Yes/No questions, UR-IW-4 (with feedback) often surpassed UR-IW-2 (without
feedback) in later batches. Similarly, in Task B Factoid questions, feedback systems showed competitive
MRR scores. However, there were also instances where feedback did not lead to better or even resulted
in worse performance compared to the base model or the few-shot model. The preliminary tests on
model selection also hinted that self-feedback might not always enhance performance for some base
models.</p>
      <p>Few-Shot Learning vs. Feedback: The UR-IW-5 configurations, typically employing gemini
flash 2.0 + 10-shot or gemini-2.5-flash-preview + 10-shot, frequently emerged as
strong performers, especially in retrieval (Task A) and some question-answering sub-tasks (e.g., Task
A+ Factoid Batch 4, Task B Yes/No Batch 1). This suggests that providing a few examples is still a
successful way to guide these LLMs. When comparing Gemini Flash 2.0 base (UR-IW-1) with its
feedback version (UR-IW-3) and its 10-shot version (UR-IW-5), the 10-shot approach often had an edge,
particularly in retrieval.</p>
      <p>Best Suited Models and Approaches:
• For retrieval tasks (Task A), gemini flash 2.0 + 10-shot (UR-IW-5) appeared to be the
most promising approach among our submissions.
• For Yes/No questions (Task A+ &amp; B), gemini flash 2.0 (base and 10-shot) and
o3-mini/o4-mini (with and without feedback) all showed the ability to achieve high or perfect
scores.
• For Factoid questions (Task A+ &amp; B), performance was more varied. o3-mini/o4-mini with
feedback (UR-IW-4) and gemini flash 2.0 + 10-shot (UR-IW-5) had good performances
in certain batches.
• For List questions (Task A+ &amp; B), o3-mini (UR-IW-2) had particularly strong showings in
Task A+, Batch 1 and 2. In Task B, o3-mini/o4-mini with feedback (UR-IW-4) also performed
well.</p>
      <p>The choice of "best" model and approach appears to be task-dependent. Few-shot learning with
gemini-2.0-flash seems broadly efective. The feedback mechanism shows potential but requires
further refinement to ensure consistent improvements across diverse tasks and models. The preliminary
test data indicated that gemini-2.0-flash had strong baseline factoid performance, which was
reflected in some of the task results.</p>
      <p>These are preliminary observations, and a more in-depth analysis will be conducted once the final,
manually evaluated results are available. Future work will involve a more granular analysis of the
generated answers and the types of errors made by diferent models and approaches to refine our
strategies for future BioASQ challenges. The example code used for feedback and few-shot prompting
can be found online11.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Ethical Considerations</title>
      <p>Even if the accuracy and reliability of LLM generated answers in RAG improve, they still tend to make
subtle errors or hallucinate information that is not supported by the source documents. These errors
can be especially dificult to catch when expert information needs such as the questions posed in the
BioASQ challenge are answered. The output of these systems should therefore not be used to inform
clinical decision-making without thorough expert oversight.</p>
      <p>Another ethical issue is the environmental costs of complex multistep RAG systems. As each LLM call
is processed on GPU clusters with SOTA models having billions of parameters distributed over these
GPUs, every call produces considerably more co2 than a simple TF_IDF based search result ranking.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>Overall, our feedback-based approach returned mixed results. There was no clear improvement over
the zero-shot baselines with the same models. The few-shot approach from last year’s participation that
we reused as a baseline this year, was, according to the preliminary results, still the most competitive
approach from our runs. It was also interesting to see that in our model selection test, the presumably
cheaper and smaller distilled models (Gemini flash) were achieving better results than their pricier and
presumably bigger counterparts (Gemini Pro) or the reasoning models (o3-mini, DeepSeek R1).</p>
      <p>We will build on the introduced feedback approach in future work, comparing the impact of human
and LLM generated feedback on overall task performance in professional search [28]. We believe this
will be a valuable contribution to assess the performance of systems that foster human engagement vs.
systems that promise full automation.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>We thank the organizers of the BioASQ challenge for their continued support and quick response time.
This work is supported by the German Research Foundation (DFG) as part of the NFDIxCS consortium
(Grant number: 501930651).</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>The authors used the following generative-AI tools while preparing this paper12:
11https://github.com/SamyAteia/bioasq2025
12https://ceur-ws.org/GenAI/Policy.html
• OpenAI ChatGPT (o3, 4o, 4.5 preview) (May 2025) - drafting content, latex formatting, paraphrase
and reword.
• Google Gemini 2.5 Pro (May 2025) - drafting content, latex formatting, paraphrase and reword.
• LanguageTool - spellchecking, paraphrase and reword.</p>
      <p>All AI-generated material was critically reviewed, revised and verified by the human authors. The
authors accept full responsibility for the integrity and accuracy of the final manuscript.
tics, Punta Cana, Dominican Republic, 2021, pp. 3784–3803. URL: https://aclanthology.org/2021.
09332. arXiv:2112.09332.
[25] J. Higgins, Cochrane handbook for systematic reviews of interventions, Cochrane Collaboration
and John Wiley &amp; Sons Ltd (2008).
[26] S. Ateia, U. Kruschwitz, BioRAGent: A Retrieval-Augmented Generation System for Showcasing
Generative Query Expansion and Domain-Specific Search for Scientific Q&amp;A, in: European
Conference on Information Retrieval, 2025.
[27] A. Krithara, A. Nentidis, K. Bougiatiotis, G. Paliouras, BioASQ-QA: A manually curated corpus for</p>
      <p>Biomedical Question Answering, Scientific Data 10 (2023) 170.
[28] S. Ateia, From professional search to generative deep research systems: How can expert oversight
improve search outcomes?, in: Proceedings of the 48th International ACM SIGIR Conference
on Research and Development in Information Retrieval (Doctoral Consortium), 2025. Doctoral
Consortium, to appear.</p>
    </sec>
    <sec id="sec-10">
      <title>A. Detailed Preliminary Results</title>
      <p>Batch
Test batch 1
Test batch 1
Test batch 1
Test batch 1
Test batch 1
Test batch 1
Test batch 2
Test batch 2
Test batch 2
Test batch 2
Test batch 2
Test batch 2
Test batch 3
Test batch 3
Test batch 3
Test batch 3
Test batch 3
Test batch 3
Test batch 4
Test batch 4
Test batch 4
Test batch 4
Test batch 4
Test batch 4
Batch
Test batch 1
Test batch 1
Test batch 1
Test batch 1
Test batch 1
Test batch 1
Test batch 2
Test batch 2
Test batch 2
Test batch 2
Test batch 2
Test batch 2
Test batch 3
Test batch 3
Test batch 3
Test batch 3
Test batch 3
Test batch 3
Test batch 4
Test batch 4
Test batch 4
Test batch 4
Test batch 4
Test batch 4
Position
1 of 72
28 of 72
50 of 72
52 of 72
57 of 72
60 of 72
1 of 72
28 of 72
31 of 72
41 of 72
47 of 72
49 of 72
1 of 66
44 of 66
45 of 66
46 of 66
47 of 66
54 of 66
1 of 79
41 of 79
46 of 79
47 of 79
55 of 79
56 of 79</p>
      <p>System
Top Competitor
UR-IW-4
UR-IW-2
UR-IW-1
UR-IW-3</p>
      <p>UR-IW-5
Top Competitor
UR-IW-4
UR-IW-3
UR-IW-1
UR-IW-5</p>
      <p>UR-IW-2
Top Competitor
UR-IW-5
UR-IW-3
UR-IW-4
UR-IW-1</p>
      <p>UR-IW-2
Top Competitor
UR-IW-3
UR-IW-4
UR-IW-5
UR-IW-1
UR-IW-2</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Suri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Counts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Safavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Neville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. W.</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Andersen</surname>
          </string-name>
          , G. Buscher,
          <string-name>
            <given-names>S.</given-names>
            <surname>Manivannan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rangan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>The Use of Generative Search Engines for Knowledge Work</article-title>
          and
          <string-name>
            <given-names>Complex</given-names>
            <surname>Tasks</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2404.04268. arXiv:
          <volume>2404</volume>
          .
          <fpage>04268</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B. G.</given-names>
            <surname>Edelman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ngwe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <article-title>Measuring the impact of AI on information worker productivity</article-title>
          ,
          <source>Available at SSRN</source>
          <volume>4648686</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Bron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Greijn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. M.</given-names>
            <surname>Coimbra</surname>
          </string-name>
          , R. van de Schoot, A. Bagheri,
          <article-title>Combining large language model classifications and active learning for improved technology-assisted review</article-title>
          ,
          <source>in: Proceedings of the International Workshop on Interactive Adaptive Learning (IAL@PKDD/ECML 2024)</source>
          , volume
          <volume>3770</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>77</fpage>
          -
          <lpage>95</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3770</volume>
          /paper6.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Verberne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Kruschwitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wiggers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Larsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Russell-Rose</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. P.</surname>
          </string-name>
          de Vries, First International Workshop on Professional Search,
          <source>SIGIR Forum 52</source>
          (
          <year>2019</year>
          )
          <fpage>153</fpage>
          -
          <lpage>162</lpage>
          . URL: https: //doi.org/10.1145/3308774.3308799. doi:
          <volume>10</volume>
          .1145/3308774.3308799.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Verberne</surname>
          </string-name>
          , Professional Search, 1 ed.,
          <source>Association for Computing Machinery</source>
          , New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>501</fpage>
          -
          <lpage>514</lpage>
          . URL: https://doi.org/10.1145/3674127.3674141.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , P. Liang,
          <article-title>Evaluating verifiability in generative search engines</article-title>
          ,
          <source>arXiv preprint arXiv:2304.09848</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Spatharioti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rothschild</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. G.</given-names>
            <surname>Goldstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Hofman</surname>
          </string-name>
          ,
          <article-title>Efects of LLM-based Search on Decision Making: Speed, Accuracy, and Overreliance</article-title>
          ,
          <source>in: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI '25</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2025</year>
          . URL: https://doi.org/10.1145/3706598.3714082. doi:
          <volume>10</volume>
          .1145/ 3706598.3714082.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rodríguez-Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rodriguez-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Loukachevitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakhovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tutubalina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitriadis</surname>
          </string-name>
          , G. Tsoumakas,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giannakoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bekiaridou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Samaras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Maria Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martinelli</surname>
          </string-name>
          , G. Silvello, G. Paliouras,
          <source>Overview of BioASQ</source>
          <year>2025</year>
          :
          <article-title>The thirteenth BioASQ challenge on large-scale biomedical semantic indexing and question answering</article-title>
          , in: J.
          <string-name>
            <surname>C. de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ateia</surname>
          </string-name>
          , U. Kruschwitz,
          <article-title>Is chatgpt a biomedical expert?</article-title>
          , in: M.
          <string-name>
            <surname>Aliannejadi</surname>
            , G. Faggioli,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
          </string-name>
          , M. Vlachos (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2023</year>
          ), Thessaloniki, Greece,
          <source>September 18th to 21st</source>
          ,
          <year>2023</year>
          , volume
          <volume>3497</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>73</fpage>
          -
          <lpage>90</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3497</volume>
          /paper-006.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ateia</surname>
          </string-name>
          , U. Kruschwitz,
          <article-title>Can open-source llms compete with commercial models? exploring the few-shot performance of current GPT models in biomedical tasks</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuscáková</surname>
          </string-name>
          , A. G. S. de Herrera (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2024</year>
          ), Grenoble, France,
          <fpage>9</fpage>
          -
          <issue>12</issue>
          <year>September</year>
          ,
          <year>2024</year>
          , volume
          <volume>3740</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>78</fpage>
          -
          <lpage>98</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /paper-07. pdf.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          , Attention is All You Need,
          <source>in: Proceedings of the 31st International Conference on Neural ifndings-emnlp</source>
          .
          <volume>320</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .findings-emnlp.
          <volume>320</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [22]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          , Deep Research System Card, https://cdn.openai.com/deep-research
          <article-title>-system-card</article-title>
          .pdf ,
          <year>2025</year>
          . System card,
          <source>accessed 8 Jul</source>
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nakano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Balaji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hesse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kosaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Saunders</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cobbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Eloundou</surname>
          </string-name>
          , G. Krueger,
          <string-name>
            <given-names>K.</given-names>
            <surname>Button</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Knight</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          , Webgpt:
          <article-title>Browser-assisted question-answering with human feedback, 2022</article-title>
          . URL: https://arxiv.org/abs/2112.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>T.</given-names>
            <surname>Russell-Rose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gooch</surname>
          </string-name>
          , U. Kruschwitz,
          <article-title>Interactive query expansion for professional search applications</article-title>
          ,
          <source>Business Information Review</source>
          <volume>38</volume>
          (
          <year>2021</year>
          )
          <fpage>127</fpage>
          -
          <lpage>137</lpage>
          . URL: https://doi.org/10.1177/02663821211034079. doi:
          <volume>10</volume>
          .1177/02663821211034079.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>