<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>iCARE: Ontology-Guided Intent Routing for Multi-Agent LLM-Based Dialogue Systems⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nirmalie Wiratunga</string-name>
          <email>n.wiratunga@rgu.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vihanga Ashinsana Wijayasekara</string-name>
          <email>v.wijayasekara@rgu.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ikechukwu Nkisi-Orji</string-name>
          <email>i.o.nkisi-orji@rgu.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pedram Salimi</string-name>
          <email>p.salimi@rgu.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kyle Martin</string-name>
          <email>k.martin3@rgu.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristina Bolaños</string-name>
          <email>Cristina.Bolanos@uclm.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Robert Gordon University</institution>
          ,
          <addr-line>Aberdeen</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad de Castilla - La Mancha</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <fpage>160</fpage>
      <lpage>171</lpage>
      <abstract>
        <p>We present iCARE, a modular multi-agent reasoning framework that integrates iCARE-Onto with Case-Based Reasoning (CBR) to support ontology grounded, LLM-based dialogue management. The framework addresses intent disambiguation and intent routing within an agentic architecture, where specialised agents handle domainspecific tasks. Agent outputs are combined to produce accurate responses through a composite strategy that couples few-shot prompting with Retrieval-Augmented Generation (RAG) and a CBR-based fallback mechanism. Conversational content is first grounded in ontological concepts that capture situational, personal, and dialogue knowledge. These are stored as short-term memory entries and subsequently organised into semantic, procedural, and episodic summaries that form long-term memory. This structured memory ensures factual accuracy, coherence, and contextual relevance beyond surface fluency in generated responses. While the framework encompasses multiple reasoning components, in this paper we focus on how the system reasons over user utterances to form clarified intents for routing to appropriate agents. Preliminary experiments on a synthetic dataset show a 20% relative improvement in routing accuracy for the LLM-as-Router when ontology-based disambiguation is applied. A comparative evaluation with a lazy-learner, ProtoKNN, demonstrates its potential as a low-latency, ofline alternative, suggesting opportunities for hybrid routing in dialogue systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Dialogue with LLM</kwd>
        <kwd>Multi-agent Intent Routing</kwd>
        <kwd>Ontology driven RAG</kwd>
        <kwd>Healthcare AI</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Large language models (LLMs) have transformed natural–language interfaces from brittle rule engines
into fluid, mixed-initiative dialogue partners [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. Yet, in high-stakes domains like health, law,
and finance, unqualified LLM output is risky since hallucinated facts, hidden biases, and inconsistent
reasoning can harm users [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We argue that interface fluency must be paired with rigorous knowledge
grounding and transparent control. Ontologies provide the semantic scafolding for such grounding,
while modular, specialist agents keep decision logic interpretable and auditable.
      </p>
      <p>Based on our experience developing intelligent healthcare systems, such as a conversational
framework for embodied care robots, we observed that even routine tasks (e.g., "remind me to take my
pills") often require cross-checking medication rules, user history, and sensor data, going well beyond
a single turn of question and answer. More recently, we are developing a digital-avatar health coach
to support adolescent and young adult cancer survivors in managing cardiovascular health through a
non-intrusive conversational interface embedded in their environment. Such scenarios demand dialogue
to be personalised, evidence-grounded, and safety-constrained. To meet these needs, we are introducing
iCARE as a modular multi-agent reasoning framework with these properties:
1. An ontology (iCARE-Onto) to model structured domain and dialogue knowledge;
2. A semantic router for intent classification for agent actioning;
3. The dialogue context control using retrieval-augmented generation (RAG), with fallback to
case-based retrieval when attribute-level matching is essential for accurate responses; and
4. A dual memory approach, combining short-term (dynamic) session memory and distilled
longterm memory, to support continual personalisation.</p>
      <p>We also present results from a comparative evaluation study on ontology-guided intent routing using
the iCARE framework (specifically evaluating properties 1 and 2).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Multi-Agent Dialogue Framework</title>
      <p>
        The iCARE framework couples fact-grounded dialogue with modular control (Figure 1). One or more
dialogue intents are handled by agents. Each agent interfaces with internal or external tools (e.g., a
risk prediction model) to execute its assigned tasks and can retrieve relevant information to validate
responses and incorporate educational and explanatory content. Contextual information is drawn
from shared long-term and short-term memory, organised by an ontology that represents concepts
relevant to the user’s personal context, necessary situational awareness, and ongoing dialogue
context (including current conversation goals) [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ].
      </p>
      <p>
        The planning pipeline begins with an ontology-guided intent disambiguation. If the human dialogue
utterances are found to be under-specified and requiring slots or maps to multiple ontology
classes/intents, iCARE asks one or two targeted clarifying questions. The user’s replies are then stored in triple
form within short-term memory, supporting the maintenance of dynamic conversational context. The
resulting (disambiguated) context is then mapped to canonical intent ID(s). If no suitable semantic
intent match is found, it falls back to the label OTHER and is flagged for human intervention. These
intents and resolved slots update short-term memory and are promoted to long-term memory to manage
conversational continuity. The mapping between short and long-term memory must consider not only
what to summarise from an extended dialogue but also how to do so appropriately [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Given these fully clarified intents, the planner decomposes them into mini-plans consisting of intent
steps that can be executed by specialised agents. Determining which agent to invoke becomes a routing
task, which can draw on knowledge about the agents, such as capability and dependency information,
including agent descriptions, preconditions, and data dependencies. The planner also coordinates
sequential or parallel execution to maintain coherent dialogue flow [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. For example, when asked
“Would a run this afternoon be recommended?”, the planner may first gather required vitals, then fuse
outputs from Biomarker, Risk, and Exercise agents, perhaps suggesting a brisk walk instead. Finally,
the Ethics–Policy agent checks compliance and, if it blocks the request, returns a counterfactual
guided response, such as: "If your heart-rate threshold were lower, I could approve that run".
      </p>
      <p>
        In trivial plans, an intent may directly map to an agent; however, multiple intents may be handled by
a single agent (many-to-one), and a given intent may be relevant to diferent agents across plan steps
(many-to-many). Such routing structures have precedent in specialist-agent coordination frameworks[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
In the many-to-many case, a case library of prior plan steps (queried via case-based retrieval or few-shot
exemplars) can guide mapping when multiple handlers are available or when policy/availability matters
(e.g., cost tiers, privacy).
      </p>
      <sec id="sec-2-1">
        <title>2.1. Knowledge Representation</title>
        <p>An example of domain concepts (with some instantiations) and relationships modeled by iCARE-Onto
appears in Figure 2. Unique to this modeling are the memory type annotations (coloured tags in the
ifgure) that help structure short and long-term memory, supporting coherence in dialogue management.
Specifically, three types can be identified: semantic, such as factual knowledge that should ideally be
represented as relational triples; episodic, where information is tied to specific events, timestamps,
or conversational workflows (e.g., an intent followed by a recommendation and later a user-reported
outcome); or procedural, where a known procedure needs to be carried out step by step by an agent
(see examples in Table 1).</p>
        <p>Accordingly, each ontological concept can be annotated with a preferred memory type, so that when
concrete instantiations arrive, the system knows where to store them. For example, Biomarker instances
such as “Total cholesterol = 3.9 mmol/L on 25–06–2025” are of episodic type memory; the class-level
fact “Dyslipidaemia increases cardiovascular risk” associated with Late Efect → Cardiovascular Risk
is semantic memory; and the multi-step Exercise PhyxioPlanA under Intervention is recognised as
procedural memory (as it will have a fixed set of exercises).</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Planning Pipeline: Disambiguation with iCARE-Onto</title>
        <p>
          Since utterances are mapped to intents for planning, ill-formed or underspecified inputs must first be
detected and disambiguated (via brief, ontology-guided clarifications) before routing [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. For this,
iCARE-Onto (see Figure 2) is used in the disambiguation phase to harness the structured relationships
among classes, properties, and instances to detect when essential information is missing or
underspeciifed. For example, if the user utterance provides a numeric value without specifying what it refers to
(e.g., "is 140 okay?"), the ontology can be used to expose that several properties of the relevant class
(e.g., Biomarkers - blood pressure, heart rate, glucose) could accept such a value. Additionally, the
range of possibilities can further be narrowed given the context of the user, such as previous recorded
measurements. When there is insuficient detail in the dialogue context to resolve the ambiguity, this
ontological knowledge allows the system to identify the precise property that is missing and to generate
a targeted clarification follow-up question (e.g., "What measurement does 140 refer to?" or "Did you
mean your last BP reading?”). Similarly, when an utterance maps to multiple possible classes or relations,
ontology constraints (domain, range, cardinality) are used to prune candidates or formulate the most
discriminative question to minimise the number of clarification turns. By systematically detecting
ambiguity patterns based on ontology structure (e.g., missing property type, missing relation
arguments, multiple class candidates), short clarifying turns can be used to refine or complete the semantic
representation of an utterance through follow-up questions. Figure 3 shows how brief ontology-guided
clarification turns, together with short and long-term memory, collect the minimal information needed
before it is handed over to the planner for agent assignment (here to the Triage agent).
Listing 1: RAG prompt skeleton over the disambiguated text
Listing 2: Worked example output
Role: Task Planner in a multi-agent dialogue system.
        </p>
        <p>Input (disambiguated utterance):
triples:</p>
        <p>User:u123 | hasBiomarker | TotalCholesterol
Biomarker:TotalCholesterol | last_value | 6.7 mmol/L</p>
        <p>@2025-10-10
Conversation:Goal | ask | risk_explanation</p>
        <p>Conversation:Goal | ask | exercise_plan_start?
Ontology intent labels (retrieval top-k):</p>
        <p>CHECK_BIOMARKER, ASSESS_RISK, ASK_EXERCISE_PLAN,</p>
        <p>SET_REMINDER, ...</p>
        <p>Instructions:
1) Split the user request into plan steps.
2) Find best match intent id from list, else "OTHER".
3) Return JSON with fields: step_id, intent_id, span,
args, depends_on, confidence.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Planning Pipeline: Decomposing with Task Planer</title>
        <p>In iCARE, a retrieval-augmented LLM first embeds the disambiguated user utterance, and retrieves the
top- ontology intent labels. It segments the request and returns a JSON array of plan-step candidates
with fields (e.g., step_id, intent_id, args, depends_on, similarity). See Figure 4 and Listings 1–2, for an
example involving the utterance “Hi! What’s my last cholesterol reading? Should I worry? Would it be
OK to start the recommended exercise plan?” (The JSON shown is illustrative, not exhaustive). The
Task Planner then organises these candidate intents into a mini-plan, capturing dependencies, such as
needing the biomarker values for risk assessment, and marks any parallelisable steps or follow-ups.</p>
        <p>Next, the Intent Router binds each step to an appropriate agent. It first matches the disambiguated
frame against intent schemas to identify candidate handlers. When a single intent in a step is uniquely
handled by an agent, the mapping is obtained directly from the ontology’s handledBy relation. When
multiple candidate agents exist, a learned router (few-shot LLM-as-Router or ProtoKNN) selects the best
match based on the step context. Compound steps containing multiple intents are similarly mapped to an
agent capable of performing all included actions. Low-confidence cases can trigger a human-in-the-loop
recommendation.</p>
        <p>In the given example in Figure 4, the router maps the intents in the two steps to agents:
(CHECK_BIOMARKER, ASSESS_RISK) → RiskPredictionAgent, and ASK_EXERCISE_PLAN →
ExerciseCoach. The planner enforces the intent dependency order CHECK_BIOMARKER → ASSESS_RISK
→ ASK_EXERCISE_PLAN and fuses the agents’ outputs into a single coherent reply.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Planning Pipeline: Intent Router</title>
        <p>We explore several approaches to intent routing using both LLMs and more resource eficient matching
methods. With the LLM-as-Router setup, a prompt (see Figure 5) with few-shot in-context exemplars
per agent is used to predict the agent given a disambiguated dialogue context frame. LLMs are strong
zero-/few-shot routers: a single prompt with a handful of exemplars per agent can yield competitive
routing without task-specific training. Because they encode broad world knowledge
(approximateretriever behaviour from large-scale pretraining), they cope well with open phrasing, rare synonyms,
and long-tail constructions. This makes an LLM-as-Router attractive for rapid onboarding of new
intents and agents, early-stage pilots with scarce exemplars, and domains where ontology cues (units,
dates, history) can be injected post-disambiguation directly into the prompt.</p>
        <p>
          The alternative is the non-parametric ProtoKNN router, a lazy learner using a Class–Prototype
-Nearest-Neighbour based router strategy, inspired by Prototypical Networks [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. To construct
prototypes, for every agent  ∈  we keep a small support set (like the few-shots in the LLM-as-Router
method),  = {(z, )}=1. Here z is the embedding of the disambiguated intents (i.e., resolved
dialogue context plus ontology-derived cues) and  the agent label. The class prototype for agent 
is the mean of its support embeddings; p = 1 ∑︀(z,)∈ z. Using these prototypes, the model
selects the relevant agent based on pairwise comparisons of the intent (plus disambiguated context)
embeddings and prototype embeddings. Given one or more intent embeddings q from the planner’s
mini-plan step, we compute its distance to each prototype, and assign it to the agent with the nearest
prototype: ⋆ = arg min∈ ‖q−p ‖2, optionally, a confidence margin ∆ may be enforced. If matching
confidence is low or ambiguous, control may need to be handed over to a catch-all General agent for
clarification.
        </p>
        <p>
          ProtoKNN’s advantages are fourfold. First, it is few-shot friendly: centroids can be calculated with
as few as five to ten labelled examples per agent [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], whereas keyword rules [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] require exhaustive
vocabularies. Second, it is tolerant of noise: averaging the support embeddings smooths over minor
wording diferences, so small lexical variations are unlikely to cause misclassification. Third, it remains
interpretable: at inference the router can reveal the k-nearest prototype sentences (for example, “matched
remind-me agent’s prototype with similarity 0.92"), thereby providing human-readable justification
for routing decisions. Fourth, comparison of fixed embeddings makes routing both cheaper and
deterministic, unlike LLM-as-Router methods that vary with wording. However the ProtoKNN does
require representative coverage across intents; performance may lag if the support set omits common
phrasings or context variants and the embedding model is not appropriate for the domain. This is unlike
the LLM-as-Router which does have access to broad world knowledge.
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Response Generation in iCARE Framework</title>
        <p>
          In iCARE, agents produce outputs that must be presented coherently to the user. Some responses
are straightforward results, while others are explanatory, designed to educate the user about their
health, for example, through a question–answer dialogue. To ensure factual reliability, each response is
grounded in external knowledge rather than relying solely on the LLM’s parametric memory [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>
          A conventional, vector-based RAG module embeds the user’s disambiguated utterance frame, together
with its candidate intents, as a single vector and returns the closest passages from a knowledge store
that contains both general medical literature and domain-specific documents (e.g., ESC guidelines
on cardio-oncology) and any case knowledge (e.g., patient instances). This approach works well for
simple, self-contained questions such as “What is dyslipidaemia?” However, complex health queries can
bundle several clinically important attributes:
“I finished anthracycline chemotherapy last year, my cholesterol is still high, and I’d like an
exercise plan that minimises cardiac risk – any recommendations?”
Treating the whole disambiguated utterance as one vector returns generic exercise advice and ignores
treatment history. Instead, we plan to adopt CBR-RAG [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], which represents the structured intent
as attribute–value pairs (treatment = anthracycline, biomarker = cholesterol, goal = exercise, plus
persona data), and matches locally against stored cases. Because situational and personal facts live
in the ontology, CBR-RAG can retrieve the case where “Exercise PhyxioPlanA” was personalised for
the treatment (e.g., anthracycline), yielding a far more relevant recommendation. At run-time the
ontology-organised intents guide the retriever to use basic RAG when no fine-grained attributes are
present, and to fall back to CBR-RAG whenever attribute-level matching is required. The Retriever also
has access to iCARE’s short and long-term memories.
        </p>
      </sec>
      <sec id="sec-2-6">
        <title>2.6. Response Generation and Refinement</title>
        <p>The Response Generator combines the agents’ outputs into a single, fluent reply. This stage has three
layers (illustrated as three stacked boxes in Figure. 1).</p>
        <p>
          Response Templates &amp; Prompts: For every dialogue-act/intent pair, our framework provides a
small set of pre-defined prompt templates such as Explain-Risk, Recommend-Intervention,
ConfirmRecommendation, and Explain-Change. The generator fills the template with slot variables such as
{evidence_snippet}, {biomarker_value}, {guideline_ref}, then passes the result to the LLM for surface
realisation. Drafting responses as slot-filled templates improves factual accuracy and ensures
clinicallyapproved wording. Coupling the template with a refinement prompt yields responses that are more
natural and less ‘robotic’ than strict slot-filled output. Integrating knowledge of a template taxonomy
within the prompt then facilitates flexibility to resolve sensitive attributes [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <p>Personalisation: A light post-processor adapts the draft response to the user’s persona and the
current situational context, both supplied by the ontology. Some examples are to:
• convert units or language variants (“mg/dL” → “mmol/L”);
• reference recent episodic events (“Last week you completed Exercise Plan A.”);
• adjust reading level (≤ 8th-grade Flesch–Kincaid for cardiovascular health management of cancer
survivors);
• adapt wording to accommodate accessibility needs (replacing colour-based cues with shape or text
descriptors for colour-blind users and simplifying visual references for low-vision readers);
• format the output for multimodal delivery (automatic-speech-recognition (ASR) and text-to-speech
pipelines so the same content can be voiced naturally when the user prefers audio interaction).</p>
        <p>
          Ethics &amp; Compliance: Personalised drafts are finally assessed for classes of violations, such as: 1)
Medical safety where every recommendation is cross-checked against the 2022 ESC Cardio-Oncology
Guideline with missing evidence downgrades the advice to a cautionary statement [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]; 2) Hallucination
iflter to ensure all factual claims must map to a retrieved snippet or semantic triples, otherwise the
sentence is removed; and 3) Privacy &amp; tone where the draft is screened for Protected Health Information
leakage, rude or biased language, and for lack of empathy (i.e., poor bedside manner). Responses
must maintain a supportive, respectful tone in line with national AI-Act risk categories and
patientcommunication guidelines [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. If any rule fails, the message is either redacted or returned to the LLM
with a corrective instruction. Only text that passes the review stage is delivered to the user.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation</title>
      <p>Having introduced the iCARE framework (in Figure 1), we focus this workshop paper’s evaluation on a
single question: do ontology-derived features improve intent routing to specialised agents?</p>
      <p>For our experiments, we created a parallel corpus. Each example contains (i) a raw utterance spanning
a set of intents; and (ii) a disambiguated counterpart forming the fully clarified intent with some ontology
annotation cues (classes, relations, units, temporal tags). Typically, the fully formed intent utterances
are gathered through the disambiguation planning step, which would interactively elicit missing slots
(in reference to the ontology) via brief clarifying turns before passing on the clarified utterance for
routing. In this study, we simulate those turns ofline, by providing a disambiguated counterpart whose
triples encode exactly the information that short clarification turns would have supplied based on the
ontology. A few examples of the initial utterance (raw text), and its fully clarified intent (disambiguated
text) obtained from the turn simulation process appear in Table 2.</p>
      <p>The parallel corpus was created using a two-step prompting procedure. First, for each likely intent
(aligned with the skeletal ontology in Figure 2), we asked an LLM (GPT-5) to generate 2–3 raw utterances
that, owing to their generality, could span multiple intents and therefore correspond to multiple
:handledBy ontology relations. Next, we re-prompted with the same raw text, the candidate intents,
and the skeletal ontology (classes, relations) to produce a disambiguated counterpart containing minimal
clarifying cues and triple annotations mapped to a single handling agent. For this study, we focused on
9 such agent classes (Exercise, Biomarker, Risk, Dietitian, Logger, Triage, Medication, Appointments,
Privacy). The resulting parallel corpus was written to CSV with fields id, intent_id, agent, raw_text,
disambig_text, and annotations. For simplicity, we assume a one-intent one-step planner, where each
step is handled by a single agent. An intent router here classifies the given step with the utterance (raw
or disambiguated) to one of the 9 agents. Extending this to multiple-intent plan steps and agent routing
is left for future work. We assess and report dataset quality through inter-annotator agreement.</p>
      <p>We evaluate the intent router under two conditions: (a) Raw (raw_text only); and (b)
Ontologyenriched (disambig_text with ontology-derived cues). We compare three routing methods:
LLM-asRouter; ProtoKNN; and a KNN variant. In our KNN variant, instead of aggregating same-class shots
into a single prototype, the sampled shots themselves serve as candidate prototypes for similarity
comparison; efectively a constrained KNN that ranks similarities only within the sampled shots rather
than the full training set. This restricted variant is used primarily to study the potential of forming
prototypes from randomly sampled shots per class. Both ProtoKNN and KNN use PubMedBERT
embeddings to represent the raw_text, and disambig_text for similarity computations. For the
LLM-asRouter, we evaluate GPT-4o (mini), Gemini 2.5 Flash, and LLaMA 8B Instruct. Evaluation results report
routing accuracy on the parallel corpus (raw vs. ontology-enriched) using 4-fold cross-validation. In
each fold, the held-out fold serves as the test set of utterances for classification, while the remaining
folds are used to sample exemplar shots per class. Because the dataset provides only limited variation
across classes (i.e., low class coverage), we restricted our experiments to 1-, 2-, and 3-shot settings. We
do not assess downstream agent execution or response quality.</p>
      <sec id="sec-3-1">
        <title>3.1. Analysing the quality of the parallel corpus</title>
        <p>We used three independent annotators (blinded to the data generation prompt) to assess each row of
the parallel corpus via two questions:</p>
        <p>Q1 Disambiguation correctness (Yes/No): “Is the value in Disambig_Text the correct
disambiguated version of the Raw utterance?”
Q2 Ontology coverage (Yes/No + Ontology figure): “Are all the concepts mentioned in</p>
        <p>Disambig_Text present in the ontology figure? If No, list the missing concepts.”
Three assessors independently evaluated each binary (yes/no) item. Overall agreement, measured
using Fleiss’  , was 0.10, indicating only slight agreement beyond chance. Given the low reliability, an
independent meta-assessor reviewed all instances where at least one assessor disagreed (approximately
50% of the LLM-generated examples) and adjudicated the final fully clarified utterance with assistance
from an LLM (ChatGPT-5) used solely for normalisation.</p>
        <p>The LLM reformulated the contested fully clarified utterances to ensure they met four reproducibility
criteria: (1) no advice or resolutions were included, (2) clarifications were added only to make the
utterance interpretable, (3) the clarified utterance is mapped to a single target agent, and (4) the phrasing
followed the corpus style of user statements with additional minimal ontology tags. The meta-reviewer
then reviewed the LLM-normalised version and made the final accept directly, or with revision, decisions.
The adjudicated utterances, together with their ontology annotations (minimal tags), were then used as
the corrected gold standard in our experiments.</p>
        <p>For Q2, when items were marked “No”, annotators list concepts present in the disambiguated text
but absent from the ontology figure. We normalise these strings (lowercase, punctuation removal, unit
standardisation) and collapse common variants. We tally the canonicalised terms to produce a short
list of out-of-ontology concepts. All of which we found to be useful suggestions and should serve as
potential candidates for ontology expansion.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <p>Across all results, ontology clarified utterances consistently led to superior performance. Further
increasing number of support examples (i.e., the number of shots from 1 to 3) results in substantial
improvement across the board, with approximately 15-20% accuracy gains observed for the disambiguated
text results. For Raw texts, performance improvements with additional shots are modest and often
inconsistent, indicating that the limited semantic clarity of the original utterances (in the absence of
the ontology) constrains the models’ ability to form well-separated routing class representations.</p>
      <p>Comparing ProtoKNN with the KNN baseline (see Tables 4 and 3), we find the use of prototypes
provides comparable accuracy results. With regards to latency, ProtoKNN achieved lower latency than
KNN, with improvements growing from 0.3% in 1-shot to 6.1% in 3-shot settings. On average, ProtoKNN
reduced latency by 3.4%. Note in our implementation both ProtoKNN and KNN baseline use only the
randomly selected shots for local neighbourhood computation. With larger support sets and improved
class coverage, prototype-based methods will ofer eficiency gains by reducing similarity computations
while maintaining or enhancing accuracy relative to KNN. Among the distance metrics, Mahalanobis
performs best, especially with ProtoKNN, likely because in large, non-normalised vector spaces it better
manages noisy feature directions (variance over prototype aggregation) and decorrelates dimensions.
With KNN both Mahalanobis and Euclidean are comparable.</p>
      <p>
        Table 5 presents the LLM-as-Router results, which clearly show performance gains of over 20%
compared to ProtoKNN. Among the evaluated models, GPT-4o mini demonstrates strongest performance,
achieving 92.2% accuracy with just 3-shots. The LLM-as-Router setup also allows us to explore
ZeroShot prompting, where no exemplars from the dataset are provided, ofering insight into the models’
in-built world knowledge relevant to the iCARE routing task. Notably, even in the Zero-Shot setting,
GPT-4o mini achieves 76.56% accuracy, and this increases to 92.2% with the three shots. This clearly
shows how additional domain evidence (i.e., ontology-disambiguated, fully clarified intents) enhances
performance. Gemini 2.5 Flash’s results are also noteworthy, achieving only about 3% lower accuracy
than GPT-4o mini (with 3-shots), despite having no connection to the LLM family used for dataset
generation or validation (in this case, we had used ChatGPT 5). These results also suggest that the use
of an ensemble [
        <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
        ] of LLM-as-Routers could be an interesting direction to explore, provided that
latency costs are kept low. For instance, we found that compared to ProtoKNN, the LLM variants had
latency increases of approximately 85–92%, with GPT-4o-mini, Gemini-2.5-flash, and LLaMA-3-8B all
showing substantially higher inference times across all shot settings. This demonstrates the eficiency
advantage of ProtoKNN for low-latency prediction. These results suggest the potential for a hierarchical
hybrid approach, where ProtoKNN routing is invoked in regions with good evidential coverage, while
the more computationally costly LLM router is employed otherwise.
5. Conclusion and Future Work
iCARE integrates lightweight knowledge to precisely capture intents, enabling LLMs to keep dialogues
lfuent yet factual within an agentic setup. The ProtoKNN approach ofers a lightweight, easily deployable
intent router that can operate ofline, an important advantage for in-home care environments with
limited connectivity. In contrast, the LLM-as-Router variants demonstrated higher accuracy in our
56.25
57.81
70.31
56.25
64.06
73.44
      </p>
      <p>
        Disambig_Text
simulated experiments, despite the limited dataset size. This contrast highlights the complementary
strengths of symbolic and generative approaches and the potential for hybrids. A major challenge in
this research area remains the scarcity of domain-specific data. Our experience indicates that even
state-of-the-art LLMs do not produce fully reliable data without human oversight, although they serve
as a valuable foundation for dataset generation. In our simulated corpus, each intent maps to a single
agent handler, allowing a clear test of disambiguation efects on intent routing; extending this to
many-to-many mappings is left for future work. Efective task decomposition will be essential to ensure
that generative models follow accurate, well-structured routes. Future work could explore case-based
planning for managing multiple intents and hybrid routing strategies [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] that combine low-latency,
lazy-learning methods such as ProtoKNN with LLM-as-Router mechanisms when exemplar coverage is
sparse.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was funded by the European Union’s Horizon Europe programme under grant agreement No.
101213323. The views and opinions expressed are those of the authors only and do not necessarily reflect
those of the European Health and Digital Executive Agency or the European Union. Neither the European
Union nor the granting authority can be held responsible for them.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The authors employed language technologies only for editorial assistance and dataset bootstrapping
under human oversight: (i) ChatGPT (GPT-5) was used to generate a synthetic parallel corpus that was
subsequently curated and evaluated via a human study; (ii) Grammarly in Overleaf for paraphrasing,
re-wording, and grammar/spelling checks; and (iii) ChatGPT for minor improvements to writing style
and clarity. No generative tool was used to create research content, analyses, figures, or citations.
All text was reviewed and approved by the authors who take full responsibility for the publication’s
content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] OpenAI, Introducing chatgpt,
          <year>2022</year>
          . URL: https://openai.com/index/chatgpt/, accessed:
          <fpage>2025</fpage>
          -07-23.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chamola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hussain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Guizani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Niyato</surname>
          </string-name>
          ,
          <article-title>Transforming conversations with ai: A comprehensive study of chatgpt</article-title>
          ,
          <source>Cognitive Computation 16</source>
          (
          <year>2024</year>
          )
          <fpage>2487</fpage>
          -
          <lpage>2510</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sahoo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Meharia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chadha</surname>
          </string-name>
          ,
          <article-title>A comprehensive survey of hallucination in large language, image, video and audio foundation models</article-title>
          ,
          <source>in: EMNLP</source>
          <year>2024</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          ,
          <year>2024</year>
          , pp.
          <fpage>11709</fpage>
          -
          <lpage>11724</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Hatalis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Christou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Myers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lambert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Amos-Binks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dannenhauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dannenhauer</surname>
          </string-name>
          ,
          <article-title>Memory matters: The need to improve long-term memory in LLM-agents</article-title>
          ,
          <source>in: Proceedings of the AAAI Symposium Series</source>
          , volume
          <volume>2</volume>
          ,
          <year>2023</year>
          , pp.
          <fpage>277</fpage>
          -
          <lpage>280</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          , et al.,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>9459</fpage>
          -
          <lpage>9474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Bach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bergmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Brand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Caro-Martínez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Eisenstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. W.</given-names>
            <surname>Floyd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jayawardena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Leake</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Malburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Ménager</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schack</surname>
          </string-name>
          , I. Watson,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wilkerson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Wiratunga</surname>
          </string-name>
          ,
          <source>Case-Based Reasoning Meets Large Language Models: A Research Manifesto For Open Challenges and Research Directions</source>
          ,
          <year>2025</year>
          . URL: https://hal.science/hal-05006761, working paper or preprint.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Zhuo</surname>
          </string-name>
          ,
          <article-title>Integrating ai planning with natural language processing: A combination of explicit and tacit knowledge</article-title>
          ,
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          <volume>16</volume>
          (
          <year>2025</year>
          )
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cheng</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Leong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , Cooper:
          <article-title>Coordinating specialized agents towards a complex dialogue goal</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>38</volume>
          ,
          <year>2024</year>
          , pp.
          <fpage>17853</fpage>
          -
          <lpage>17861</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Hengst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wolter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Altmeyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kaygan</surname>
          </string-name>
          ,
          <article-title>Conformal intent classification and clarification for fast and accurate intent recognition</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: NAACL</source>
          <year>2024</year>
          ,
          <year>2024</year>
          , pp.
          <fpage>2412</fpage>
          -
          <lpage>2432</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Snell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Swersky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zemel</surname>
          </string-name>
          ,
          <article-title>Prototypical networks for few-shot learning</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ukai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hirakawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yamashita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fujiyoshi</surname>
          </string-name>
          ,
          <article-title>This looks like it rather than that: Protoknn for similarity-based classifiers</article-title>
          ,
          <source>in: The Eleventh International Conference on Learning Representations</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>D. M. Manias</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Chouman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Shami</surname>
          </string-name>
          ,
          <article-title>Semantic routing for enhanced performance of LLM-assisted intent-based 5g core network management and orchestration</article-title>
          , in: GLOBECOM 2024
          <string-name>
            <surname>-2024 IEEE Global Communications</surname>
            <given-names>Conference</given-names>
          </string-name>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>2924</fpage>
          -
          <lpage>2929</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>N.</given-names>
            <surname>Wiratunga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Abeyratne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jayawardena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Massie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Nkisi-Orji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weerasinghe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fleisch</surname>
          </string-name>
          ,
          <article-title>CBR-RAG: case-based reasoning for retrieval augmented generation in LLMs for legal question answering</article-title>
          ,
          <source>in: International Conference on Case-Based Reasoning</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>445</fpage>
          -
          <lpage>460</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>P.</given-names>
            <surname>Salimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Wiratunga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Corsar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wijekoon</surname>
          </string-name>
          ,
          <article-title>Towards feasible counterfactual explanations: A taxonomy guided template-based nlg method</article-title>
          ,
          <source>in: ECAI</source>
          <year>2023</year>
          , IOS Press,
          <year>2023</year>
          , pp.
          <fpage>2057</fpage>
          -
          <lpage>2064</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Lyon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lopez-Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Couch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Asteggiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Aznar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bergler-Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Boriani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cardinale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cordoba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cosyns</surname>
          </string-name>
          , et al.,
          <year>2022</year>
          <article-title>esc guidelines on cardio-oncology developed in collaboration with the european hematology association (eha), the european society for therapeutic radiology and oncology (estro) and the international cardio-oncology society (ic-os) developed by the task force on cardio-oncology of the european society of cardiology (esc</article-title>
          ),
          <source>European Heart Journal-Cardiovascular Imaging</source>
          <volume>23</volume>
          (
          <year>2022</year>
          )
          <fpage>e333</fpage>
          -
          <lpage>e465</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>K. G. van Leeuwen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Doorn</surname>
            ,
            <given-names>E. Gelderblom,</given-names>
          </string-name>
          <article-title>The ai act: Responsibilities and obligations for healthcare professionals and organizations</article-title>
          , Diagnostic and Interventional
          <string-name>
            <surname>Radiology</surname>
          </string-name>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , W.-L. Chiang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Stoica</surname>
          </string-name>
          ,
          <article-title>Judging LLM-as-a-judge with mt-bench and chatbot arena</article-title>
          ,
          <source>in: 37th NIPS</source>
          , NIPS '23, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Es</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>James</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Espinosa</given-names>
            <surname>Anke</surname>
          </string-name>
          , S. Schockaert,
          <article-title>RAGAs: Automated evaluation of retrieval augmented generation</article-title>
          , in: N.
          <string-name>
            <surname>Aletras</surname>
          </string-name>
          , O. De Clercq (Eds.),
          <source>18th EACL: System Demonstrations</source>
          ,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          , St. Julians, Malta,
          <year>2024</year>
          , pp.
          <fpage>150</fpage>
          -
          <lpage>158</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .eacl-demo.
          <volume>16</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mallick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ruhle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. V.</given-names>
            <surname>Lakshmanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Awadallah</surname>
          </string-name>
          ,
          <string-name>
            <surname>Hybrid</surname>
            <given-names>LLM</given-names>
          </string-name>
          :
          <article-title>Cost-eficient and quality-aware query routing</article-title>
          ,
          <source>arXiv preprint arXiv:2404.14618</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>