<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>lasigeBioTM: A Lean Biomedical QA System Empowered by Structured Knowledge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Notebook for the BioASQ Task</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paulo R. C. Lopes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sofia I. R. Conceição</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Fernandes</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francisco M. Couto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa</institution>
          ,
          <addr-line>1749-016 Lisboa</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen</institution>
          ,
          <addr-line>Copenhagen</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Biomedical Question Answering (BQA) presents unique challenges due to the vast and complex nature of biomedical literature and the highly specialized terminology involved. In the BioASQ13 challenge, we participated in all phases (A, A+, and B), proposing a hybrid system that integrates large language models (LLMs) with structured knowledge sources, namely ontologies and knowledge graphs. Our approach combines retrievalaugmented generation for document and snippet selection with graph-constrained decoding, which improves factual accuracy and supports reasoning grounded in biomedical knowledge. Its "lean" aspect arises from the use of a lightweight LLM and eficient techniques that minimize dependence on the extensive computational resources often associated with larger models. While our baseline system, lasigeBioTM, consistently performed well on exact answers across most batches, the integration of external information occasionally introduced bottlenecks or noise, highlighting how challenging it is to balance external knowledge and model fluency. Our results show that the system performance was closely tied to the quality of the retrieved snippets, with Phase B generally outperforming A+ due to better snippet quality. Our findings emphasize the value of high-quality retrieval and controlled knowledge integration for improving BQA performance. The code is publicly available at https://github.com/lasigeBioTM/BioASQ13_2025.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Constrained Decoding</kwd>
        <kwd>Knowledge Graphs</kwd>
        <kwd>LLMs</kwd>
        <kwd>Ontologies</kwd>
        <kwd>RAG</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>labeled datasets.</p>
      <p>
        More recently, Large language models (LLMs) have transformed natural language processing, ofering
unprecedented capabilities in understanding, generating, and synthesizing information. Despite these
eforts, these systems still encounter limitations in addressing complex, domain-specific questions,
such as hallucinations or the static nature of its intrinsic knowledge. These shortcomings can be
mitigated by integrating structured information representations such as ontologies and knowledge
graphs (KGs), that provide domain knowledge. Ontologies ofer a formal representation of knowledge,
providing a structured and standardized vocabulary for biomedical concepts and their relationships
(e.g., Gene Ontology (GO) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and Medical Subject Headings (MeSH)), while KGs represent information
as a network of entities and their relationships.
      </p>
      <p>
        Retrieval-augmented generation (RAG) is a technique that empowers LLMs to go beyond their
internal knowledge by incorporating up-to-date and domain-specific external information. This serves
as a dynamic basis for generating answers, especially for very specific or recently published details.
Combining RAG and LLMs ofers several key benefits: (i) improvement of factual accuracy, by retrieving
reliable external knowledge to support the answers and increase their verifiability; (ii) integration
of recently published information enabling updates with the latest knowledge; and (iii) handling of
specialized terminology by retrieving snippets from the literature or knowledge graphs, improving the
interpretation of complex terms [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        To address domain-specific challenges, we present lasigeBioTM a hybrid approach that focuses
on open-source resources, combining LLM strengths with structured knowledge integration of KGs
and ontologies. By integrating structured knowledge, lasigeBioTM aims to enhance factual accuracy,
semantic understanding, reasoning capabilities, and answer explainability. Nevertheless, integrating
ontologies and KGs with LLMs remains an ongoing research challenge [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>The main consideration for lasigeBioTM’s development is resource eficiency. Our system is developed
to run without computationally intensive fine-tuning, facilitating adoption by institutions with limited
resources. This "lean" design utilizes a lightweight LLM (Mistral-7B) with eficient 4-bit quantization and
employs RAG to selectively retrieve relevant information from the vast corpus of biomedical literature
found in PubMed. Participation in the BioASQ challenge provides a rigorous benchmark for evaluating
lasigeBioTM ’s performance.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. BioASQ challenge</title>
        <p>
          The BioASQ13 challenge provided the training dataset 13b, which contains around 5,380 biomedical
questions with corresponding answers ("exact" and "ideal") and supporting evidence (documents and
snippets). The testing dataset, similarly to previous years, contains around 300 new questions [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>In Phase A, participants address the challenge of handling the large volume and complexity of
Biomedical literature, in our case, PubMed. The aim in this task is to eficiently and accurately retrieve
the most relevant snippets from the literature to answer each given question. This becomes challenging
due to domain-specific language and the contextual interpretation required to assess relevance.</p>
        <p>In Phase B, the challenge is to combine contextual resources (i.e., information snippets) and
potentially domain-specific external knowledge to reason and produce a precise and appropriate answer.
This task requires the understanding of biomedical concepts, the reasoning over the provided support
information and the generation of a tailored answer in line with each question.</p>
        <p>The competition includes four types of questions: yes/no, list, factoid, and summary. For all types
except summary questions, the participants provide an exact and an ideal answer, where the second is a
more comprehensive, human-like answer providing support information for the given answer.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Biomedical QA systems</title>
        <p>
          BQA systems typically work as a multistep pipeline addressing two primary challenges: (i) retrieving
relevant information (documents or snippets) from vast literature, and (ii) providing precise answers
based on the retrieved context [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>Document and snippets retrieval: Traditional information retrieval models (e.g., BM25) rely on
lexical matching, term frequency, and inverse document frequency to rank the documents and snippets,
which are extremely eficient and efective for queries with clear keyword overlap. However, they are
highly afected by vocabulary mismatch (e.g., use of synonyms and diferent phrasing) [ 10]. On the
other hand, dense retrieval models (e.g., BERT-based models) encode questions and documents/snippets
into embeddings and then compute the relevance based on the similarity between the embeddings.
This approach better captures the semantic and contextual similarity but comes at the cost of increased
computational overhead and remains vulnerable to irrelevant or sparse vocabulary. Finally, hybrid
retrieval approaches combine both lexical and dense retrieval models, aiming to balance eficiency and
semantic understanding while addressing the individual limitations of each model. However, combining
them optimally is an open challenge.</p>
        <p>
          Answer generation: Previous works used transformers-based models (e.g., BioBERT [11]) with
ifne-tuning on labeled datasets with relevant document text. This provides a high-confident text
source and ensures factual correctness, which increases answer validation and reduces the risk of
model hallucinations. However, these models ofer limited information integration (i.e., restricted to
the information used for training and fine-tuning) and struggle with questions requiring information
aggregation and reasoning. More recently, encoder-decoder transformers and LLMs enabled the
integration of information from multiple sources and showed the ability to reason to provide precise
answers. Yet, those models are susceptible to hallucinations, require considerable computational power,
and ensuring factual accuracy and explainability remains challenging, particularly in domains with
very specific vocabulary, as is the case of biomedicine [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Integration of ontologies and knowledge graphs for biomedical QA</title>
        <p>Ontologies provide a structured representation of the information, and are widely used within the
biomedical field with the role of standardizing the highly specific vocabulary. In addition to the data
structure, ontologies also contain a rich context and relations networks that can be used as input for
ifne-tuning LLMs to expand their knowledge [12].</p>
        <p>
          We direct the reader to the summary provided by Pan et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] for a detailed description of the
individual strengths and limitations of LLMs and KGs, and how they are complementary. The main
strengths of LLMs are: (i) the general knowledge - with the ability to cover vast and diverse
knowledge; (ii) language processing - excelling at understanding and generating human language; and (iii)
generalizability - having the capacity to adapt to a wide range of tasks and diferent domains. On the
other hand, KGs provide: (i) structured knowledge - representing information in a structured, relational
(and sometimes hierarchical) format; (ii) accuracy - high precision and verifiability based on curated
information representations; (iii) contextual understanding; (iv) domain-specific and evolving
knowledge - handling specialized vocabulary and allowing incremental updates to integrate new knowledge.
Pan et al. conclude that combining both approaches enhances the overall representation and reasoning
capabilities of the systems.
        </p>
        <p>Diferent approaches have been proposed for the integration of LLMs and KGs. Graph-constrained
decoding (GCD) [13] has been proposed to enhance LLMs’ reasoning by integrating KGs and addressing
the hallucination issue. This is done by validating the reasoning paths with the KG’s factual structure.
This way, the reasoning follows the path most supported by the KGs.</p>
        <p>Despite not being well described in the BQA, there is potential to extend its application to this field,
given the availability of domain-specific KGs and the improvement of QA systems in other domains.
Li et al. [14] proposed an approach that takes advantage of the most relevant triplets from KGs and
uses them to provide extra context to the question and refine the provided answer. On the other hand,
Tian et al. [15] developed KG-Adapter to address the challenge of knowledge conflict and over-reliance
on powerful LLMs, where it leverages information about entities and their relationships in the KGs to
generate the final answers.</p>
        <p>In addition, KGs have also been shown to improve the information retrieval phase [16].</p>
        <p>Given the demonstrated potential of GCD in QA systems and the availability of open-source
biomedical KGs such as The Monarch Initiative [17], we integrated this framework into our system.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>An overview of the lasigeBioTM system is shown in Figure 1. In Phase A, for retrieving the candidate
documents, we combine BM25 and a re-ranking model.</p>
      <p>In the pipeline for Phase A+ and B, we perform Named-Entity Recognition (NER) and Linking (NEL)
to extract biomedical entities, which are aligned with a structured knowledge source. These entities
and their relations are then used to guide a constrained decoding process using an LLM to provide the
ifnal answer.</p>
      <p>The source of documents and snippets difers in Phase A + and Phase B, as in Phase A+, they come
from our Phase A retrieval, while in Phase B, they are provided by the challenge organizers.</p>
      <p>Phase A</p>
      <p>Query</p>
      <p>PubMed
Documents</p>
      <p>LLM</p>
      <p>Documents
and Snippets
3.1. Data</p>
      <sec id="sec-3-1">
        <title>3.1.1. BioASQ dataset</title>
        <p>The BioASQ dataset is a collection of annotated articles indexed for MEDLINE accessed from PubMed,
where the MeSH terms were assigned by PubMed experts. This is a publicly available dataset that was
Sparse Retrieval</p>
        <p>Documents
Re-ranking and</p>
        <p>Scoring
Phase B</p>
        <p>Query</p>
        <p>Snippets
(Phase B ≠ Phase A+)</p>
        <p>Graph constrained</p>
        <p>Decoding</p>
        <p>Answer
100 Documents
created with the actual information demands of biomedical specialists in mind and continues to expand
yearly with each new BioASQ edition [18].</p>
        <p>The training dataset for BioASQ Phase B (currently 13B) contains 5389 English questions of four
types: yes/no, list, factoid, and summary [19]. For each question the respective gold standard concepts,
related articles, snippets, RDF triples, "exact" and "ideal" answers in JSON format are provided, as
described in [18].</p>
        <p>The test set, similarly to previous years, was composed of 340 questions. The provided documents
focused only on article titles and abstracts , divided into four evaluation batches.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.1.2. External Knowledge - Knowledge Graph and Ontologies</title>
        <p>In addition to the datasets provided by the BioASQ organizers, our system leverages external structured
knowledge sources to enrich the information available for the question answering. Specifically we
utilized the Monarch KG and biomedical ontologies detailed in Table 1.</p>
        <p>The Monarch KG is a comprehensive resource that connects genotype, phenotype, disease, and other
biomedical information across species. This resource was designed to support translational research
and investigate human health and has two main components [17]. The first is the data ingestion and
integration pipeline, which aggregates and harmonizes knowledge from 33 heterogeneous sources,
including external databases such as the Panther Database [20], and holds the mappings between
diseases annotations and standardized terms like Human Phenotype Ontology terms [21]. The second
is the knowledge graph representation that captures the relationships among entities. This allows
advancing querying, reasoning, and inference across species and data types.</p>
        <p>The ontologies where employed for named-entity recognition (NER) and linking (NEL), to expand
the Decoding on Graphs approach and to provide more context about the extracted entities.</p>
        <p>To perform NER and NEL, we used BENT [22] (available at https://github.com/lasigeBioTM/BENT).
BENT is a biomedical entity annotator that also deals with NIL entities (i.e., entities that do not
link directly to any knowledge base) by associating them with the top-k relevant concepts within a
knowledge base. This tool provides several pre-trained models fine-tuned from PubMeDBERT [ 23]
with the respective knowledge organization system. Information regarding the knowledge bases and
ontologies used in our systems is provided in Table 1.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Knowledge Base</title>
        <p>
          Human Disease Ontology [24, 25]a
Chemical Entities of Biological
Interest [26]b
NCBI Gene
NCBI Taxon [27]c
UBERON [28, 29]d
Cellosaurus [30, 31, 32]e
Gene Ontology - Biological Process
and Cellular Component [
          <xref ref-type="bibr" rid="ref6">6, 33</xref>
          ]f
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Entity Type</title>
        <p>Disease Entities
Drugs and Chemicals
Genes and Gene Products
Organisms
Anatomical Entities: body
parts, organs and tissues
Cell Line Information
Bioprocesses and Cellular
Components</p>
      </sec>
      <sec id="sec-3-5">
        <title>Entity Example</title>
        <p>Glioblastoma DOID:3068
Acyclovir CHEBI:2453
CSF1R NCBIGene:108306404
Influenza virus NCBITaxon:11552
Coronary artery UBERON:0001621
HT-29 cells CVCL:6833
Glycolisis GO:0006096
All data accessed on March 19, 2025.
a: http://purl.obolibrary.org/obo/doid.obo; b: https://purl.obolibrary.org/obo/chebi.owl; c: https:
//purl.obolibrary.org/obo/ncbitaxon.obo; d: http://purl.obolibrary.org/obo/uberon/uberon-full.obo; e:
https://ftp.expasy.org/databases/cellosaurus/cellosaurus.obo; f: https://purl.obolibrary.org/obo/go.obo</p>
        <sec id="sec-3-5-1">
          <title>3.2. Workflow Phase A</title>
          <p>To retrieve documents relevant to each question, we developed a two-stage hybrid approach. First, we
indexed the articles and performed sparse retrieval to extract a set of candidate documents. Then, we
applied an embedding model to score and rerank these candidates based on semantic similarity. Figure
1 A shows the pipeline used in this phase.</p>
        </sec>
      </sec>
      <sec id="sec-3-6">
        <title>3.2.1. Sparse Retrieval</title>
        <p>First, we indexed a subset comprising 23, 285, 041 articles from the PubMed corpus, retrieved on 13
February 2025, using Elasticsearch 8.17.1 [34]. Then, we performed sparse retrieval based on the BM25
algorithm, with 1 = 1.2 and  = 0.75, to obtain an initial set of 100 candidate documents per question.
Due to computational resource constraints, we were unable to index the full PubMed collection and
instead worked with a representative subset to balance coverage and eficiency.</p>
      </sec>
      <sec id="sec-3-7">
        <title>3.2.2. Dense Retrieval</title>
        <p>In the second stage, we employed an embedding model to score and re-rank the candidate documents,
capturing semantic relevance beyond lexical overlap.</p>
        <p>As a baseline for the first batch, we used the smaller model gte-reranker-modernbert-base 2
and selected the top 10 articles, scored directly by applying the model (cross-encoder). For all subsequent
batches, we employed the larger and more capable gte-Qwen2-1.5B-instruct 3 to improve semantic
matching performance, computing the score using the cosine similarity between the embedding vector
of a candidate document and that of the corresponding question (bi-encoder).</p>
        <p>A threshold of 70 was used for batch 2, while a more lenient threshold of 64.1 was adopted for
batches 3 and 4, based on observations from earlier batches where certain questions failed to retrieve
any documents under stricter filtering.</p>
      </sec>
      <sec id="sec-3-8">
        <title>3.2.3. Snippet Extraction</title>
        <p>For snippet extraction, a prompt was given to Mistral with the questions and the candidate document’s
title and abstract. In the initial batches (1 and 2) errors in this pipeline consisted of extracting irrelevant
passages and not extracting word-for-word the exact copy of the passage of the input document. The
model correctly extracted most of the time the beginning and the ending of the snippet, so we applied a
boundary-based snippet extraction function that mapped the snippet’s start and end positions in the
original document for batch 3.</p>
        <p>In batch 3, most of the previous errors were resolved, but the model began extracting substrings from
larger passages, producing multiple fragmented snippets that were essentially repetitions of the same
content, split into diferent segments, rather than distinct snippets.</p>
        <p>In batch 4, the previous errors were minimized, and on this round the most prevalent error was the
ones from the initial batches that the model failed to extract the text word-by-word from the original
document.</p>
        <sec id="sec-3-8-1">
          <title>3.3. Workflow Phase B/A+</title>
          <p>This section presents the components involved in the answering pipeline, including the
graphconstrained decoding mechanism and the systems deployed by our team during the BioASQ13 Task B
challenge.</p>
          <p>It is worth noting that the same pipeline was used for both Phase B and A+; however, the source of
the snippets difered: in Phase B, snippets were provided directly by the BioASQ organizers, whereas in
Phase A+, they were extracted by our own retrieval pipeline developed for Phase A.
2https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base
3https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct</p>
          <p>The steps used in this pipeline are illustrated in Figure 1 B.</p>
        </sec>
      </sec>
      <sec id="sec-3-9">
        <title>3.3.1. Graph Constrained Decoding</title>
        <p>This section outlines the methods used to implement graph-constrained decoding in our question
answering pipeline. The implemented approach was based on the work Decoding on Graphs (DoG)
from Li et al. [35].</p>
        <p>To ensure that the model generates answers aligned with a structured knowledge source, we first
perform NER and NEL to identify biomedical entities mentioned in the question. For each linked entity,
we retrieve its corresponding subgraph from the knowledge graph, and then merge the subgraphs into
a unified representation.</p>
        <p>From this merged structure, we construct a prefix-constraining trie that encodes all valid reasoning
paths. During answer generation, this trie is used to manipulate the model’s output logits, restricting
the decoding process to only allow sequences consistent with the graph structure.</p>
      </sec>
      <sec id="sec-3-10">
        <title>3.3.2. Trie Data Structure</title>
        <p>A trie is a tree-based data structure that is used to eficiently store and retrieve sequences, such as
tokens, by organizing them based on common prefixes. In our case, each node represents a token, and
each path from the root to a leaf represents a plausible sequence. This structure allows fast prefix
matching and lookup, making it particularly useful for tasks like constrained decoding in language
models, where only valid token sequences should be generated.</p>
        <p>To enforce constrained decoding during answer generation, we used the trie implementation provided
by Facebook Research as part of the GENRE framework [36]. This implementation constructs a prefix
tree over valid token sequences, such as entity paths or answer candidates, and integrates with language
models to restrict token generation to paths consistent with the trie structure.</p>
        <p>This approach ensures that generated outputs remain faithful to a predefined set of valid completions,
which is particularly important when reasoning over structured knowledge sources4.</p>
      </sec>
      <sec id="sec-3-11">
        <title>3.3.3. Decoding</title>
        <p>For the decoding mechanism, the model was prompted to use the &lt;PATH&gt; and &lt;/PATH&gt; special tags to
mark the start and end of its reasoning path, or use &lt;T_BOS&gt; and &lt;T_EOS&gt; to delimit the start and end
of valid triplets.</p>
        <p>Upon entering one of these tags, constraints were enforced by modifying the output logits at each
decoding step: the valid tokens are retrieved at the th iteration of token generation from the trie, and
the logits of all other tokens are set to −∞ .</p>
        <p>After applying the softmax, the altered logits result in a probability distribution that assigns non-zero
probability only to the permitted tokens, ensuring that the generated sequence remains consistent with
the predefined graph structure.
3.3.4. Systems
This section describes the four systems implemented by our team in Phases A, A+ and B, which are
summarized in Table 2.</p>
        <p>We used Mistral-7B-Instruct-v0.35 as the baseline LLM model for all our systems. We chose
this model based on its balance of performance and resource eficiency, and since it also fits our purpose
of achieving a "lean" system.</p>
        <p>Due to varying GPU availability during the challenge, the inference was done either on an NVIDIA
A30 or on a Tesla T4 GPU. The models were loaded in 4-bit precision using the BitsAndBytes library6,
4Trie implementation utilized https://github.com/facebookresearch/GENRE/blob/main/genre/trie.py
5https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3
6https://huggingface.co/docs/bitsandbytes/main/en/index.
with nested double quantization and NF4 (Normal Float 4) as quantization type, and the computation
dtype was set to bfloat16.</p>
        <p>Our team deployed four systems:
• lasigeBioTM : this system consisted of using the LLM without any additional resources to answer
the questions using only the extracted (Phase A and A+) or the provided (Phase B) snippets.
• lasigeBioTM-onto-bl: until batch 3 this system consisted of using questions and snippets with
named entities tagged by BENT tool. Starting batch 3, this system consisted of adding additional
ontology information, namely definitions and synonyms, to help improve the answers.
• lasigeBioTM-onto-sm: until batch 3 this system consisted of adding additional ontology
information, namely definitions and synonyms, to help improve the answers. Starting batch 3, this
system consisted
• lasigeBioTM-onto-trie: this system [35]
• sp_lasigebiotm: this system is introduced on batch 3, and applies MarizaTrie</p>
        <p>System
lasigeBioTM
lasigeBioTM-onto-bl (before batch 3)
lasigeBioTM-onto-bl (from batch 3)
lasigeBioTM-onto-sm (before batch 3)
lasigeBioTM-onto-sm (from batch 3)
sp_lasigeBioTM (from batch 3)</p>
        <p>LLM
✓
✓
✓
✓
✓
✓</p>
        <p>BENT Annotations</p>
        <p>X
✓
✓
✓
✓
✓</p>
        <p>Ontology Info</p>
        <p>X
X
✓
✓
✓
X</p>
        <p>Knowledge Graph</p>
        <p>X
X
X
X
✓
✓</p>
        <p>The first system lasigeBioTM is our baseline, and involves using the LLM for inference without any
additional resources or text pre-processing to answer the questions, by using only the extracted (Phase
A+) or provided (Phase B) snippets.</p>
        <p>On the second system, lasigeBioTM-onto-bl, until the third batch, NER and NEL were applied to
the question and the snippets using the BENT tool. Then having the LLM perform the inference by
providing the annotated text as input. Starting the third batch, this system was improved by, besides
the annotated text, adding additional ontology information such as definitions and synonyms of the
annotated entities, to help the LLM perform inference.</p>
        <p>For the lasigeBioTM-onto-sm system, we applied BENT to extract entities, then ontology definitions
and synonyms of those entities were given as an input to the LLM prompt. Starting the third batch,
the system was adapted to apply the Decoding on Graphs framework [35]. This methodology explores
step-wise reasoning with the structure of knowledge graphs (KG), by forming a well-formed chain,
which connect facts grounded in the KG.</p>
        <p>Employing graph-aware constrained decoding, this approaches restrains the LLM’s output based
on the KG’s topology. This way the LLM can reason directly on the KG without the need of external
retrievers. We incorporated ontologies that matched those used in the named-entity task as additional
graph nodes, alongside Monarch. Furthermore, we used the gene2go file 7 provided by the NCBI that
maps gene IDs to GO terms, so we could map NCBI Genes to the Gene Ontology graph. For this system
as KG’s we used ontologies and a more broad KG, Monarch. Sections 3.3.1 and 3.1.2 contain more
details about the used methodology and KG, respectively. Additionally, in lasigeBioTM-onto-bl and
lasigeBioTM-onto-sm systems, the entities IDs were used to map the entities to the respective ontologies
and extract definitions and synonyms information.</p>
        <p>The sp_lasigebiotm system was first implemented during the third batch and implements the DoG
framework (see 3.3.1) without ontologies. The graph constrained mechanism leverages a prefix-lookup
data structure (trie) and the manipulation of the model logits to factually ground the LLM on the
Knowledge Graph.
7Data accessed on April 08, 2025, from https://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz</p>
        <sec id="sec-3-11-1">
          <title>3.4. Metrics</title>
          <p>The evaluation process is submission-based: we provide our system’s answers to the BioASQ organizers,
who then return the corresponding evaluation metrics for Phases A and B/A+. For Phase A, standard
information retrieval metrics are used, including Mean Precision, Recall, F-Measure, Mean Average
Precision (MAP), and Geometric Mean Average Precision (GMAP).</p>
          <p>While MAP reflects how well relevant documents are ranked near the top, GMAP is a stricter variant
that penalizes poor performance on individual queries, ofering a measure of robustness across questions.
For Exact Answers in Phase B and A+, accuracy is used for yes/no questions, and factoid questions are
evaluated using Mean Reciprocal Rank (MRR), strict accuracy, and lenient accuracy.</p>
          <p>The Ideal Answers of Phases B and A+ are evaluated based on the ROUGE metric, specifically
ROUGE2 and ROUGE-SU4. These variants measure the overlap of bigrams and skip-bigrams, respectively,
between the generated and reference answers—capturing both lexical similarity and fluency. This allows
for a qualitative evaluation of the generated text in terms of content relevance and coherence [37].</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>This section presents the results achieved by our systems in all phases. Phase A was divided in two
main tasks: first, extracting relevant documents from the designated article repositories and second,
extracting relevant snippets from those documents.</p>
      <p>Regarding the task of identifying the most relevant documents, our highest score was achieved in
batch 1 with an F-Measure of 0.1029 (Table 3). However, in the following batches, this score decreased,
dropping to 0.037.</p>
      <p>For the snippet extraction, the results are presented in Table 4, with our best F-Measure of 0.0629
achieved in the second batch.</p>
      <p>The results for Phase A+ regarding exact answers are presented in Table 5. For yes/no answer types,
our systems demonstrated a better ability to identify "yes" responses. The best performing system in
this category was lasigeBioTM, achieving a Macro F1 score of 1 in batch 1 and 0.8301 in batch 4.</p>
      <p>With a score of 0.3636 for both the lasigeBioTM and lasigeBioTM-onto-bl systems, batch 4 had the
highest mean reciprocal rank in the factoid category. The lasigeBioTM system achieved the highest mean
precision and F-Measure for list-type answers in batch 2, with scores of 0.3684 and 0.1598, respectively.
In later batches, however, the scores for this type of response declined.</p>
      <p>Table 6 provides specific results for ideal responses in Phase A +. Some of the systems do not appear
in this table due to an incorrect JSON format submission. Despite a slight decrease in the matching F1
scores, these results show an increase in R-2 and R-SU4 recall for the lasigeBioTM system from batch 1
to batch 4.
b1
b2
b2
b2
b3
b3
b3
b3
b4
b4
b4
b4</p>
      <p>The results for Phase B, exact answers, are summarized in Table 7. For yes/no answer types, the
highest score was a Macro F1 of 0.9328, achieved by the lasigeBioTM system in batch 1. In the factoid
category, the sp_lasigebiotm system attained the best mean reciprocal rank of 0.5 in batch 4. For list-type
answers, the top F-measure was 0.5592, also obtained by lasigeBioTM in batch 4.</p>
      <p>In batch 3, all systems showed a decrease in precision but an increase in recall for list-type answers.
Regarding ideal answers, the best R-2 and R-SU4 recall scores were 0.4309 and 0.4357, respectively, both
achieved by the lasigeBioTM system in batch 3. The corresponding F1 scores for lasigeBioTM in batch 2
were 0.2749 and 0.2560, respectively.</p>
      <p>These results are detailed in Table 8.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Conclusion</title>
      <p>Our team participated in all batches for Phases A, A+ and B and although some of the systems were
introduced in later batches, identified errors between each batch were systematically addressed.</p>
      <p>In the task for the most relevant articles in Phase A, was observed a decline in scores starting
at batch 3. This decline may be attributed to switching from the smaller Cross Encoder reranking
model to the LLM-based embedding model. The diference in efectiveness can be explained by the
underlying mechanisms of each approach: cross-encoders output a similarity score by jointly encoding
the query and document pairs, capturing deeper semantic interactions across tokens, in contrast to
embedding-based methods that rely on comparing separate vector representations. Additionally, a
threshold for document selection that was not flexible enough, potentially excluding relevant candidates,
may have contributed to the decline.</p>
      <p>A decrease in snippet performance was also noted starting from batch 3. Although the majority
of errors identified in previous batches were corrected, the reduction in snippet scores is unlikely to
be directly related to the snippets themselves. Instead, it is likely due to the extraction of irrelevant
documents, which subsequently causes irrelevant snippets to be extracted.</p>
      <p>Given that Phase A+ relied on the documents and snippets extracted during Phase A, the system’s
performance in Phase A+ was correspondingly impacted by the quality of snippets, with degraded
snippet quality resulting in reduced performance. Nonetheless, errors may occur independently of
snippet quality, such as the errors mentioned at 3.2.3. There was a decrease in performance of our
baseline system, lasigeBioTM, in batch 3, and the introduced sp_lasigebiotm model also exhibited
suboptimal performance.</p>
      <p>Notably, in batch 4, our models achieved the team’s best scores across all question types. lasigeBioTM
achieved the team’s best scores for all the exact answers in diferent categories. However, in batch 4,
lasigeBioTM-onto-bl performed equally well as lasigeBioTM for factoid answers.</p>
      <p>In Phase B "Ideal Answers", in batches 3 and 4, lasigeBioTM achieved the highest automatic scores
Rouge-R for R-2 (Rec) and R-SU4 (Rec). These results indicate that our system is capable of capturing
relevant content and key points. However, comparatively low F1 scores indicate generation of extra or
irrelevant information, reducing overall answer quality.</p>
      <p>Performance across exact and ideal answer types was more heterogeneous. Specifically, lasigeBioTM
dominated yes/no answers in the first three batches; however, in batch 4, sp_lasigebiotm outperformed
lasigeBioTM by 0.0374 in macro F1.</p>
      <p>A similar pattern was observed for factoid lists, with sp_lasigebiotm outperforming lasigeBioTM in
batch 4. Across list-type questions, the top-performing models were lasigeBioTM-onto-sm (batch
2), sp_lasigebiotm (batch 3), and lasigeBioTM-onto-bl (batch 4). The results indicate that systems
augmented with external knowledge succeed in list-type question answering.</p>
      <p>Comparison of the models across Phase A+ and Phase B, Tables 10 and 9, shows that improved
snippet quality impacts system performance, resulting in higher scores on Phase B answers.</p>
      <p>The lasigeBioTM-onto-sm system lags behind the other systems in the final batch, despite updates
applied to all systems to address prior issues. This system employs ontologies to enforce a step-wise
reasoning approach in the answering chain. The use of the OBO format, which only includes "is_a"
relationship, likely constrained the reasoning paths, leading to bottlenecks. Moreover, this system
occasionally generated reasoning paths rather than direct answers (e.g., "blinatumomab -&gt; targets -&gt;
B-cell acute lymphoblastic leukemia").</p>
      <p>Furthermore, the reliance on KG paths without suficient filtering for relevance may have introduced
noise, limiting answer quality, despite the structured knowledge integration. This highlights the inherent
challenge in aligning structured KG reasoning paths with the generative nature of LLMs, requiring
careful design to prevent hallucination or irrelevant path outputs.</p>
      <p>Overall, our systems demonstrate better performance on ideal answers compared to the more
structured exact answers. Moreover, when provided with high-quality snippets, as in Phase B, performance
improves, particularly benefiting from external knowledge in generating list-type answers. Despite
using a lightweight approach, the systems consistently delivered accurate outputs across various challenge
batches, highlighting their prominent performance and reliability, even with limited resources.</p>
      <p>One key takeaway from this work is the importance of carefully balancing the injection of structured
knowledge with generative LLM capabilities. Although the incorporation of external knowledge
provides explicit and verified context, its efectiveness depends on the quality of the entity recognition
and linking, and the relevance of retrieved paths. It was observed that limited relation types, such as
only "is_a" edges in OBO ontologies, may induce reasoning bottlenecks, thereby leading the model
down uninformative and noisy paths and reducing final answer quality.</p>
      <p>Another significant finding concerns the pipeline itself, as our results showed that errors in upstream
document retrieval or snippet extraction propagate downstream and degrade system performance.
Therefore, future iterations should prioritize designing pipelines where modules are loosely coupled
but efectively informed by each other. Additionaly, external knowledge can introduce noise if not
ifltered appropriately or formatted to align with LLM expectations. Our KG-constrained models, despite
their structured reasoning design, occasionally returned raw paths rather than direct answers. This
emphasizes the need for improved prompt engineering, decoding constraints, and integration strategies
to ensure that KG information enhances rather than overrides natural language generation capabilities.
Finally, this work demonstrates that lean, eficient architectures can achieve competitive results when
combined with targeted external knowledge and careful prompt design, suggesting promising directions
for deploying lightweight, guided QA systems in resource-constrained environments.</p>
      <p>Future work will focus on refining knowledge integration techniques, developing better error
mitigation strategies, and continue exploring models that combine structured reasoning with fluent,
accurate generation to support robust biomedical question answering.</p>
      <p>Specifically, improvements will be made to the document selection process by implementing a more
adaptive similarity threshold, informed by statistical measures such as the median or interquartile range.
Additionally, we intend to investigate model performance without quantization and to incorporate
beam search techniques to enhance generation quality.</p>
      <p>Further directions include experimenting with alternative models and knowledge sources, testing
diferent prompt designs, and investigating the sources of noise introduced by external knowledge in
order to mitigate their impact on the final output.</p>
      <p>Building on these experimental insights, we aim to leverage our system’s "lean" design to operate
eficiently in resource-constrained healthcare settings. This approach aims to support informed
decisionmaking and enable secure, locally deployed models customized to specific needs, while ensuring sensitive
patient data.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was supported by FCT (Fundacão para a Ciência e a
Tecnologia) through funding of the PhD Scholarships with ref. UI/BD/153730/2022
and DOI identifier https://doi.org/10.54499/UI/BD/153730/2022 and DOI identifier
https://doi.org/10.54499/UI/BD/153730/2022 attributed to SIRC, and the LASIGE Research Unit,
ref. UID/00408/2025, and by a grant from the Novo Nordisk Foundation (NNF23SA0084103) through
The CBMR International PhD &amp; Postdoc Program.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-4 and LanguageTool in order to: Grammar,
spelling check and paraphrase. Further, the authors used GPT-4 for Figure 1 in order to: Generate
icons. After using these tools, the authors reviewed and edited the content as needed and take full
responsibility for the publication’s content.
question answering: A survey of approaches and challenges, ACM Comput. Surv. 55 (2022).
doi:10.1145/3490238.
[10] S. Robertson, H. Zaragoza, The probabilistic relevance framework: Bm25 and beyond, Foundations
and Trends in Information Retrieval 3 (2009) 333–389. doi:10.1561/1500000019.
[11] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, Biobert: a pre-trained biomedical
language representation model for biomedical text mining, Bioinformatics 36 (2020) 1234–1240.</p>
      <p>URL: https://doi.org/10.1093/bioinformatics/btz682. doi:10.1093/bioinformatics/btz682.
[12] X. Yao, A. Sannabhadti, H. Wiberg, K. S. Shehadeh, R. Padman, Can llms support medical knowledge
imputation? an evaluation-based perspective, 2025. arXiv:2503.22954.
[13] L. Luo, Z. Zhao, C. Gong, G. Hafari, S. Pan, Graph-constrained reasoning: Faithful reasoning on
knowledge graphs with large language models, 2024. arXiv:2410.13080.
[14] S. Li, Y. Gao, H. Jiang, Q. Yin, Z. Li, X. Yan, C. Zhang, B. Yin, Graph reasoning for question
answering with triplet retrieval, 2023. arXiv:2305.18742.
[15] S. Tian, Y. Luo, T. Xu, C. Yuan, H. Jiang, C. Wei, X. Wang, KG-adapter: Enabling knowledge
graph integration in large language models through parameter-eficient fine-tuning, in: L.-W.
Ku, A. Martins, V. Srikumar (Eds.), Findings of the Association for Computational Linguistics:
ACL 2024, Association for Computational Linguistics, Bangkok, Thailand, 2024, pp. 3813–3828.
doi:10.18653/v1/2024.findings-acl.229.
[16] K. Soman, P. W. Rose, J. H. Morris, R. E. Akbas, B. Smith, B. Peetoom, C. Villouta-Reyes, G. Cerono,
Y. Shi, A. Rizk-Jackson, S. Israni, C. A. Nelson, S. Huang, S. E. Baranzini, Biomedical knowledge
graph-optimized prompt generation for large language models, Bioinformatics 40 (2024) btae560.</p>
      <p>URL: https://doi.org/10.1093/bioinformatics/btae560. doi:10.1093/bioinformatics/btae560.
[17] T. E. Putman, K. Schaper, N. Matentzoglu, V. Rubinetti, F. Alquaddoomi, C. Cox, J. H. Caufield,
G. Elsarboukh, S. Gehrke, H. Hegde, J. Reese, I. Braun, R. Bruskiewich, L. Cappelletti, S. Carbon,
A. Caron, L. Chan, C. Chute, K. Cortes, V. De Souza, T. Fontana, N. Harris, E. Hartley, E. Hurwitz,
J. B. Jacobsen, M. Krishnamurthy, B. Laraway, J. McLaughlin, J. McMurry, S. T. Moxon, K. Mullen,
S. O’Neil, K. Shefchek, R. Stefancsik, S. Toro, N. Vasilevsky, R. Walls, P. Whetzel, D.
OsumiSutherland, D. Smedley, P. Robinson, C. Mungall, M. Haendel, M. Munoz-Torres, The monarch
initiative in 2024: an analytic platform integrating phenotypes, genes and diseases across species,
Nucleic Acids Research 52 (2023) D938–D949. doi:10.1093/nar/gkad1082.
[18] A. Krithara, A. Nentidis, K. Bougiatiotis, G. Paliouras, BioASQ-QA: A manually curated corpus
for biomedical question answering, Scientific Data 10 (2023) 170. URL: https://doi.org/10.1038/
s41597-023-02068-4. doi:10.1038/s41597-023-02068-4.
[19] G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Partalas, M. Zschunke, M. R. Alvers, D. Weissenborn,
A. Krithara, S. Petridis, D. Polychronopoulos, et al., An overview of the bioasq large-scale
biomedical semantic indexing and question answering competition, BMC bioinformatics 16
(2015) 1–28.
[20] P. D. Thomas, D. Ebert, A. Muruganujan, T. Mushayahama, L.-P. Albou, H. Mi, Panther: Making
genome-scale phylogenetics accessible to all, Protein Science 31 (2022) 8–22. doi:https://doi.
org/10.1002/pro.4218.
[21] S. Köhler, M. Gargano, N. Matentzoglu, L. C. Carmody, D. Lewis-Smith, N. A. Vasilevsky, D. Danis,
G. Balagura, G. Baynam, A. M. Brower, T. J. Callahan, C. G. Chute, J. L. Est, P. D. Galer, S. Ganesan,
M. Griese, M. Haimel, J. Pazmandi, M. Hanauer, N. L. Harris, M. Hartnett, M. Hastreiter, F. Hauck,
Y. He, T. Jeske, H. Kearney, G. Kindle, C. Klein, K. Knoflach, R. Krause, D. Lagorce, J. A. McMurry,
J. A. Miller, M. Munoz-Torres, R. L. Peters, C. K. Rapp, A. M. Rath, S. A. Rind, A. Rosenberg, M. M.
Segal, M. G. Seidel, D. Smedley, T. Talmy, Y. Thomas, S. A. Wiafe, J. Xian, Z. Yüksel, I. Helbig, C. J.
Mungall, M. A. Haendel, P. N. Robinson, The human phenotype ontology in 2021, Nucleic Acids
Research 49 (2020) D1207–D1217. doi:10.1093/nar/gkaa1043.
[22] P. Ruas, F. M. Couto, Nilinker: Attention-based approach to nil entity linking, Journal of Biomedical</p>
      <p>Informatics 132 (2022) 104137. doi:https://doi.org/10.1016/j.jbi.2022.104137.
[23] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, H. Poon,
Domain-specific language model pretraining for biomedical natural language processing, 2020.
arXiv:arXiv:2007.15779.
[24] W. A. Kibbe, C. Arze, V. Felix, E. Mitraka, E. Bolton, G. Fu, C. J. Mungall, J. X. Binder, J. Malone,
D. Vasant, H. Parkinson, L. M. Schriml, Disease ontology 2015 update: an expanded and updated
database of human diseases for linking biomedical knowledge through disease data, Nucleic Acids
Research 43 (2014) D1071–D1078. URL: https://doi.org/10.1093/nar/gku1011. doi:10.1093/nar/
gku1011.
[25] L. M. Schriml, J. B. Munro, M. Schor, D. Olley, C. McCracken, V. Felix, J. Baron, R. Jackson, S. Bello,
C. Bearer, R. Lichenstein, K. Bisordi, N. C. Dialo, M. Giglio, C. Greene, The human disease ontology
2022 update, Nucleic Acids Research 50 (2021) D1255–D1261. doi:10.1093/nar/gkab1063.
[26] J. Hastings, G. Owen, A. Dekker, M. Ennis, N. Kale, V. Muthukrishnan, S. Turner, N. Swainston,
P. Mendes, C. Steinbeck, Chebi in 2016: Improved services and an expanding collection of
metabolites, Nucleic Acids Research 44 (2015) D1214–D1219. doi:10.1093/nar/gkv1031.
[27] S. Federhen, The ncbi taxonomy database, Nucleic Acids Research 40 (2011) D136–D143. doi:10.</p>
      <p>1093/nar/gkr1178.
[28] C. J. Mungall, C. Torniai, G. V. Gkoutos, S. E. Lewis, M. A. Haendel, Uberon, an integrative
multi-species anatomy ontology, Genome biology 13 (2012) 1–20.
[29] M. A. Haendel, J. P. Balhof, F. B. Bastian, D. C. Blackburn, J. A. Blake, Y. Bradford, A. Comte, W. M.</p>
      <p>Dahdul, T. A. Dececchi, R. E. Druzinsky, et al., Unification of multi-species vertebrate anatomy
ontologies for comparative biology in uberon, Journal of biomedical semantics 5 (2014) 1–13.
[30] T. Robin, A. Capes-Davis, A. Bairoch, Clastr: the cellosaurus str similarity search tool-a precious
help for cell line authentication, International Journal of Cancer 146 (2020) 1299–1306.
[31] A. Bairoch, The cellosaurus, a cell-line knowledge resource, Journal of biomolecular techniques:</p>
      <p>JBT 29 (2018) 25.
[32] A. Bairoch, Cellosaurus micro-review 1: cellonauts, spacefaring cell lines (2019).
[33] T. G. O. Consortium, The gene ontology resource: enriching a gold mine, Nucleic Acids Research
49 (2020) D325–D334. URL: https://doi.org/10.1093/nar/gkaa1113. doi:10.1093/nar/gkaa1113.
[34] E. NV, Elasticsearch version 8.17.1, 2025. URL: https://www.elastic.co/guide/en/elasticsearch/
reference/8.17/release-notes-8.17.1.html, accessed: 2025-05-28.
[35] K. Li, T. Zhang, X. Wu, H. Luo, J. Glass, H. Meng, Decoding on graphs: Faithful and sound reasoning
on knowledge graphs through generation of well-formed chains, 2024. arXiv:2410.18415.
[36] N. D. Cao, G. Izacard, S. Riedel, F. Petroni, Autoregressive entity retrieval, CoRR abs/2010.00904
(2020). URL: https://arxiv.org/abs/2010.00904. arXiv:2010.00904.
[37] G. Balikas, I. Partalas, A. Kosmopoulos, S. Petridis, P. Malakasiotis, I. Pavlopoulos, I.
Androutsopoulos, N. Baskiotis, E. Gaussier, T. Artieres, P. Gallinari, Evaluation Framework
Specifications, Project deliverable D4.1, 2013. URL: sites/default/files/PublicDocuments/BioASQ_D4.
1-EvaluationFrameworkSpecification_final.pdf.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Prompts</title>
      <p>In this section, we present the prompts used for lasigeBioTM, lasigeBioTM-onto-sm, and
lasigeBioTMonto-bl systems submitted in diferent batches of Phase B.
You are a helpful assistant that can analyse the knowledge graphs in the contexts and then
answer the questions based on the knowledge graphs.</p>
      <p>The answers should give the grounded reasoning chains and think step by step, and the
reasoning chains should be logically complete but have as fewer steps as possible. Do not
include information irrelvant to the question.</p>
      <p>Guidelines:
- Provide two distinct answers on separate lines
{question type specific prompt} - The answer format MUST always be:
Exact Answer: ...</p>
      <p>Ideal Answer: ...
**Do NOT:**
- Include any additional text, commentary, or reasoning in the Exact Answer beyond the required
answer.
- Reference how you derived the answer or mention the context (snippets, ontology info) in the
Exact Answer line.</p>
      <p>Based on the following question and entities, perform the two tasks described but you will only
output the last task:
1. Extract and output unique subject-predicate-object triplets in the format:
&lt;T_BOS&gt; entity1 -&gt; relation -&gt; entity2 &lt;T_EOS&gt;
Use these triples to think about a possible answer.
2. Output two answers on separate lines:
Exact Answer: ...</p>
      <p>Ideal Answer: ...</p>
      <p>Question: {question}
Entities: {entities}
Guidelines:
- Do not include explanations, headers, or additional comments beyond the two answers.
- Do not include information irrelevant to the question.
You are an expert biomedical question answering helper. Your response must strictly adhere to
the specific instructions provided under q_type. Follow these rules exactly:
1. **Response Format:**
Your response MUST consist of EXACTLY two lines:
- Line 1: "Exact Answer:" followed by the answer in the required format.
- Line 2: "Ideal Answer:" followed by a detailed answer (maximum 200 words).
2. **Exact Answer Requirements by q_type:**
- For yes/no questions: The Exact Answer MUST be either "yes" or "no" (only that word, no extra
text).
- For factoid questions: The Exact Answer MUST be a comma-separated list of up to 5 entity
names (or similar short expressions), sorted by decreasing confidence. Do not add any extra
words or context.
- For list questions: The Exact Answer MUST be a comma-separated list of entity names (or
similar short expressions) that represent a unified answer. Do not include more than the allowed
number of entries.
- For summary questions: There is no Exact Answer; leave it empty.
3. **Ideal Answer Requirements:**
The Ideal Answer SHOULD be a single paragraph (up to 200 words) that provides a detailed
explanation or summary. It MUST NOT include any meta commentary about your process or
reference to the context (e.g., do not say “the snippets provided. . . ”).
4. **Do NOT:**
- Include any additional text, commentary, or reasoning in the Exact Answer beyond the required
answer.
- Reference how you derived the answer or mention the context (snippets, ontology info) in the
Exact Answer line.</p>
      <p>Question: {q_body}
Context: {snippets}
Ontology info: {entity_context}</p>
      <p>Additional Instructions for this type: {q_type}</p>
      <p>lasigeBioTM
lasigeBioTM-onto-bl
lasigeBioTM-onto-sm
lasigeBioTM
lasigeBioTM
sp_lasigebiotm
lasigeBioTM-onto-sm
lasigeBioTM-onto-sm</p>
      <p>MRR</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Loukachevitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakhovskiy</surname>
          </string-name>
          , E. Tutubalina, G. Tsoumakas,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giannakoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bekiaridou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Samaras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Menotti</surname>
          </string-name>
          , G. Silvello, G. Paliouras, BioASQ at CLEF2025:
          <article-title>The thirteenth edition of the large-scale biomedical semantic indexing and question answering challenge</article-title>
          ,
          <source>in: Advances in Information Retrieval</source>
          , Springer Nature Switzerland, Springer Nature Switzerland, Cham,
          <year>2025</year>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -88720-8_
          <fpage>61</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rodríguez-Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rodriguez-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Loukachevitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakhovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tutubalina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitriadis</surname>
          </string-name>
          , G. Tsoumakas,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giannakoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bekiaridou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Samaras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. N. Maria</given-names>
            <surname>Di</surname>
          </string-name>
          <string-name>
            <surname>Nunzio</surname>
          </string-name>
          , Giorgio,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martinelli</surname>
          </string-name>
          , G. Silvello, G. Paliouras,
          <source>Overview of BioASQ</source>
          <year>2025</year>
          :
          <article-title>The thirteenth BioASQ challenge on large-scale biomedical semantic indexing and question answering</article-title>
          , in: L.
          <string-name>
            <surname>P. A. G. S. d. H. J. M. F. P. P. R. D. S. G. F. N. F. Jorge Carrillo-de Albornoz</surname>
          </string-name>
          , Julio Gonzalo (Ed.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Paliouras, Overview of BioASQ Tasks 13b and Synergy13 in CLEF2025</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.),
          <source>CLEF 2025 Working Notes</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rawat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          <string-name>
            <surname>Samant</surname>
          </string-name>
          ,
          <article-title>Comparative analysis of transformer based models for question answering</article-title>
          ,
          <source>in: 2022 2nd International Conference on Innovative Sustainable Computational Technologies (CISCT)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . doi:
          <volume>10</volume>
          .1109/CISCT55310.
          <year>2022</year>
          .
          <volume>10046525</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lamurias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Couto</surname>
          </string-name>
          , LasigeBioTM at MEDIQA 2019:
          <article-title>Biomedical question answering using bidirectional transformers and named entity recognition</article-title>
          , in: D.
          <string-name>
            <surname>Demner-Fushman</surname>
            ,
            <given-names>K. B.</given-names>
          </string-name>
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ananiadou</surname>
          </string-name>
          , J. Tsujii (Eds.),
          <source>Proceedings of the 18th BioNLP Workshop</source>
          and Shared Task, Association for Computational Linguistics, Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>523</fpage>
          -
          <lpage>527</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>W19</fpage>
          -5057.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ashburner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Ball</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Blake</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Botstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Butler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Cherry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dolinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Dwight</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Eppig</surname>
          </string-name>
          , et al.,
          <article-title>Gene ontology: tool for the unification of biology</article-title>
          ,
          <source>Nature genetics 25</source>
          (
          <year>2000</year>
          )
          <fpage>25</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Retrieval-augmented generation for large language models: A survey</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2312</volume>
          .
          <fpage>10997</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Unifying large language models and knowledge graphs: A roadmap</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>36</volume>
          (
          <year>2024</year>
          )
          <fpage>3580</fpage>
          -
          <lpage>3599</lpage>
          . doi:
          <volume>10</volume>
          .1109/tkde.
          <year>2024</year>
          .
          <volume>3352100</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ying</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yu</surname>
          </string-name>
          , Biomedical
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>