<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Learning to Refine: An Agentic RL Approach for Iterative SPARQL Query Construction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Floris Vossebeld</string-name>
          <email>f.r.vossebeld@student.utwente.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shenghui Wang</string-name>
          <email>shenghui.wang@utwente.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Knowledge Graph Question Answering</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Agentic Language Models</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>SPARQL Query Generation</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Reinforcement</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Learning, Iterative Query</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Electrical Engineering</institution>
          ,
          <addr-line>Mathematics and Computer Science</addr-line>
          ,
          <institution>University of Twente</institution>
          ,
          <addr-line>Drienerlolaan 5, 7522 NB Enschede</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Microsoft Netherlands</institution>
          ,
          <addr-line>Evert van de Beekstraat 354, 1118 CZ Schiphol</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Generating complex, logically-sound SPARQL queries for multi-hop questions remains a critical bottleneck for Knowledge Graph Question Answering, as the brittle nature of one-shot generation by Large Language Models (LLMs) hinders reliable interaction with structured data. Current methods lack the adaptive policies needed to dynamically debug queries based on real-time execution feedback. This paper introduces a novel agentic framework where an LLM learns a resilient policy for the sequential process of iterative SPARQL construction. We show that a compact 3B-parameter model, trained exclusively via outcome-driven Reinforcement Learning (GRPO) without supervised fine-tuning, can learn efective policies for this task, discovering how to systematically recover from execution errors and refine its queries toward a correct answer. On a curated, executable single-answer subset of LC-QuAD 2.0, our agent achieves 49.7% accuracy post-entity-linking, a significant 17.5 percentage point improvement over the strongest iterative zero-shot baseline. Further analysis reveals that while the agent's capability is driven by RL, its performance is enhanced by an explicit deliberative reasoning step that acts as a cognitive scafold to improve policy precision. This work presents a generalizable blueprint for teaching agents to master formal, symbolic tools through interaction, bridging the gap between probabilistic LLMs and the structured world of Knowledge Graphs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>Consider the question “Which actors starred in movies directed by the director of Inception?”
Answering this requires identifying the film, finding its director, retrieving the director’s other films, and
then the actors in those films, each step relying on correct schema navigation, relation selection, and
entity disambiguation. Translating this full path into a correct SPARQL query in one shot is error-prone.
An alternative is to incrementally build and test parts of the query, adapt based on results, and correct
mistakes mid-way, requiring interaction with the KG as a semantic environment.</p>
      <p>While integrating LLMs with KGs is actively researched [15, 16, 14], methods often rely on fixed
interaction logic, prompting strategies applied to base models, or using KG information primarily as
retrieved context. They typically do not involve fine-tuning the model specifically to learn adaptive
policies for the iterative construction of structured SPARQL queries based on execution feedback, which
is crucial for handling the complexity and potential errors inherent in multi-step KG interactions.</p>
      <p>This paper addresses this gap by proposing an agentic framework in which a language model learns
a policy for iterative SPARQL query construction through interaction with a knowledge graph. The
agent operates in a think–act–observe loop: it reasons about the current state (&lt;think&gt;), generates
a SPARQL query or final answer ( &lt;query&gt;, &lt;answer&gt;), and receives execution feedback from the KG
(&lt;query_result&gt;). Rather than relying on static, one-shot generation, the agent adapts its strategy
based on results, including errors or empty outputs, progressively refining its queries. To enable
this behavior, we fine-tune a compact LLM using Group Relative Policy Optimization (GRPO), a
reinforcement learning algorithm designed for sparse, outcome-based rewards. The agent learns not
only to generate queries, but to interpret feedback and dynamically debug or explore, improving
robustness in complex multi-hop scenarios.</p>
      <p>This motivates the following research questions:</p>
      <sec id="sec-1-1">
        <title>Research questions</title>
        <p>RQ1: How can an LLM learn to iteratively build and refine SPARQL queries using execution
feedback to answer complex multi-hop KG questions?
RQ2: Can reinforcement learning efectively train such an agent to produce accurate answers from
outcome signals alone?
RQ3: How does this iterative, RL-guided approach compare with static or prompt-only baselines
on the LC-QuAD 2.0 benchmark?</p>
        <p>The remainder of this thesis details related work (§2), presents our methodology (§3), details the
experiments and results (§4 and §5), and discusses the findings and discussion and conclusions (§ 6).</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>Multi-hop KGQA requires combining structured reasoning, language understanding, and interaction.
We review prior work on traditional KGQA methods, agentic LLMs, and reinforcement learning,
highlighting how our approach integrates symbolic interaction, tool-based reasoning, and adaptive
query construction to address the unique challenges of multi-hop KGQA.</p>
      <p>Multi-hop KGQA and traditional approaches Knowledge Graph Question Answering (KGQA)
maps natural language questions to structured answers by reasoning over triples in a knowledge
graph (KG) [5, 10]. While single-hop questions can be resolved through direct relations, multi-hop
KGQA requires compositional reasoning across multiple entities and relations [17, 18]. This increases
complexity due to path explosion [6], KG incompleteness and noise [7], and semantic ambiguity [8].</p>
      <p>Traditional KGQA methods fall into three categories: semantic parsing, retrieval-based approaches,
and embedding-based reasoning. Semantic parsing methods aim to generate formal queries such
as SPARQL [9, 19], but are brittle to linguistic variation and require substantial supervision [20, 21].
Retrieval-based methods extract subgraphs for ranking [10, 6] but often struggle with complex logic.
Embedding-based approaches reason in vector space [11, 17], sacrificing interpretability and logical
precision. Critically, these approaches apply fixed computation and lack iterative refinement mechanisms
based on intermediate feedback, hindering performance on complex multi-hop tasks.</p>
      <sec id="sec-2-1">
        <title>Agentic LLMs and symbolic interaction in KGQA Recent work leverages LLMs as agents that</title>
        <p>combine internal reasoning with external actions. Agentic frameworks such as ReAct [13] and MRKL
[22] allow LLMs to operate in a loop of &lt;think&gt; → &lt;act&gt; → &lt;observe&gt;, interacting with tools to solve
complex tasks. In KGQA, systems like StructGPT [14] and Think-and-Graph [23] extend this idea by
giving LLMs access to navigation tools (e.g., retrieving neighbors or relations). However, these tools
are often predefined, and reasoning policies are static or heuristic-driven, limiting adaptivity.</p>
        <p>Our work shifts the focus from tool-based navigation to formal query generation. Inspired by ARTIST
[24], we treat SPARQL construction as the agent’s primary action. The model alternates between
&lt;think&gt;, &lt;query&gt;, and &lt;answer&gt; tags, learning to refine its reasoning through symbolic interaction
with the KG. This reframes KGQA as a dynamic decision-making process grounded in executable
feedback.</p>
        <p>This strategy also relates to test-time compute scaling [25], where additional reasoning efort is
allocated adaptively. Some approaches use inference-time sampling or search [26, 27]; others explicitly train
models to optimize reasoning under compute constraints [28, 29]. Our work falls in the latter category,
focusing on training an agent to efectively use interaction cycles for symbolic query refinement.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Reinforcement learning for iterative query generation While supervised fine-tuning enables</title>
        <p>LLMs to imitate reasoning (e.g., Chain-of-Thought prompting [30]), it relies heavily on high-quality
demonstrations and struggles with long-horizon credit assignment. Reinforcement learning (RL) ofers
a more flexible alternative, enabling agents to learn from outcome-based interaction.</p>
        <p>We build on Group Relative Policy Optimization (GRPO), a recent RL algorithm designed for sparse,
symbolic environments. GRPO has shown success in math problem solving [28], SQL generation [31],
and general tool use [24]. In our setting, GRPO allows a compact LLM to learn symbolic refinement
strategies from task-level rewards alone, recovering from syntax errors, adapting query structure,
and issuing exploratory probes. This enables robust multi-hop reasoning without requiring step-level
supervision or hand-coded recovery heuristics.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>Our approach transforms multi-hop KGQA from a one-shot generation task into an iterative, sequential
decision-making problem. We developed an autonomous agent, powered by a Large Language Model
(LLM), that learns an optimal policy for constructing and refining SPARQL queries through live
interaction with a Knowledge Graph (KG). The agent operates within a Reinforcement Learning (RL)
framework, where its behavior is optimized to maximize a reward signal reflecting the accuracy and
validity of its actions.</p>
      <p>The agent’s core is an interaction loop, conceptually illustrated in Figure 1. In each turn, the agent:
1) analyzes the history of the task, including the initial question and all previous KG interactions;
2) reasons about the next best step within a &lt;think&gt; block; and 3) acts by generating either a new
SPARQL query (&lt;query&gt;) or a final answer ( &lt;answer&gt;). This cycle repeats until the agent confidently
terminates the process. This section details the formal problem definition, the mechanics of the
agentenvironment interaction, and the RL-based training process used to learn the query refinement policy.</p>
      <sec id="sec-3-1">
        <title>3.1. Formalism: an agentic Markov Decision Process</title>
        <p>We model the iterative query construction task as a finite-horizon Markov Decision Process (MDP),
defined by the tuple ( ,  ,  , ℛ,  ) .
 +1 =   ∘   ∘  +1 . The episode terminates if the agent produces an answer action, exceeds the
maximum number of turns, or generates a malformed action.</p>
        <p>Reward function (ℛ) The reward ( ) is a terminal, outcome-based reward assigned at the end of a
full trajectory  . It is a composite signal designed to evaluate the success of the agent’s multi-turn
strategy, as detailed in Section 3.3.1.</p>
        <p>Policy (  ) The agent’s policy is the LLM itself, parameterized by a set of trainable QLoRA adapter
weights  . The policy   (  ∣   ) maps the current state (history) to a probability distribution over
the action space. Our objective is to find the optimal weights  ∗ that maximize the expected
terminal reward.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Agent–environment loop</title>
        <p>One interaction turn
1. Think → Act :   appends a &lt;think&gt; block plus either &lt;query&gt; (SPARQL) or &lt;answer&gt;.
2. Environment :
2.1. &lt;answer&gt; ⇒ episode ends.
2.2. &lt;query&gt; ⇒ KG executes; reply comes back as &lt;query_result&gt;.</p>
        <p>2.3. malformed output ⇒ abort and error flag.</p>
        <p>3. Loop until success or turn limit  max.
3.2.1. Knowledge‐graph execution environment
For RL we require a fast, quota-free SPARQL endpoint. We therefore deploy a containerised
qEndpoint (truthy Wikidata HDT) inside our Azure VNet. A lightweight aiohttp client issues queries
asynchronously, with an in-memory LRU cache, pre-flight syntax checks ( rdflib), and automatic
retry/back-of. The result is a private, low-latency endpoint that sustains the thousands of queries
demanded by training.
3.2.2. Agent Prompting and Structured Actions.</p>
        <p>The agent’s policy is guided by a detailed system prompt, which provides task instructions, defines the
required interaction format, and includes few-shot examples of successful refinement trajectories.</p>
        <p>A critical technique during RL training is loss masking. The policy’s parameters  are updated
only based on the log-probabilities of tokens generated by the agent (i.e., within &lt;think&gt;, &lt;query&gt;,
and &lt;answer&gt;). Tokens from the environment (the initial prompt and all &lt;query_result&gt; blocks) are
masked out from the loss calculation. This follows best practices from frameworks like ARTIST [24]
and focuses the learning signal squarely on the agent’s decision-making policy, rather than wasting
capacity trying to predict deterministic environment outputs.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Policy optimization via Group Relative Policy Optimization (GRPO)</title>
        <p>We used Reinforcement Learning to fine-tune the agent’s policy, specifically choosing Group Relative
Policy Optimization (GRPO) [28]. This allows the agent to learn complex, sequential strategies from
interactive experience and sparse, outcome-based rewards.</p>
        <p>We fine-tuned the agent’s policy using Group Relative Policy Optimization (GRPO) [ 28], a
reinforcement learning algorithm well-suited to sparse, outcome-based tasks. GRPO compares the terminal
reward of each trajectory against others in a group sampled from the same prompt, using relative
performance to compute an advantage signal. This enables learning efective query refinement strategies
without requiring a learned value function.</p>
        <p>solutr
aSmple
uqestion
nad
hcBat
asmpling
oplicy
ML
losT
for each trajectory. GRPO uses these rewards to calculate a policy gradient, which is then used to update the
LoRA adapter weights.
3.3.1. Reward design for efective learning
if trajectory not structurally valid,
Terminal reward ( )
( ) =</p>
        <p>−1,
{
Answer term
Cost term</p>
        <p>1 +  ans( ) − (0.1  err + 0.02  ), otherwise.</p>
        <p>Structural validity
correct tag format and termination with &lt;answer&gt;.
 ans( ) = {
+0.5 if judge deems the answer correct,
−0.2</p>
        <p>otherwise
0.1  err + 0.02  , where  err is the number of failed SPARQL executions and
 the number of agent turns.
3.3.2. Training Protocol and Implementation.</p>
        <p>The policy   was fine-tuned using QLoRA [ 32] with the Unsloth library’s optimizations for memory
and speed. The training, depicted in Figure 2, was executed on an Azure ML compute cluster with
NVIDIA H100 GPUs. The GRPO training loop proceeds as follows:
1. Rollout generation: Sample a batch of questions. For each question, generate  = 16 full
agentic trajectories using the current policy   .
2. Reward calculation: Compute the composite reward ( ) for each of the  trajectories.
3. Policy update: Use the GRPO objective to calculate the policy gradient, where trajectories with
a reward greater than their group’s average contribute positively.
4. Parameter update: Update the LoRA adapter weights  via the AdamW optimizer, applying loss
masking and KL-divergence regularization to maintain stability.</p>
        <sec id="sec-3-3-1">
          <title>This cycle repeats, progressively improving the agent’s policy.</title>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Design choices and scope</title>
        <p>Several key design choices shaped this research. A key design choice was to proceed directly to RL
ifne-tuning, deliberately bypassing a supervised fine-tuning (SFT) phase on gold trajectories. This tests
a critical hypothesis for the field: can the combination of a powerful base model’s instruction-following
ability and a well-designed RL reward signal be suficient to learn complex, tool-using behaviors, thereby
reducing the dependency on costly, expert-curated demonstration data? Second, our reward function’s
heavy penalty for structural and execution errors was a deliberate choice to force the agent to prioritize
generating valid and executable SPARQL above all else. Finally, we must acknowledge two critical
scope limitations that bound our claims: our system’s performance is evaluated on a curated subset of
single-answer questions, and it relies on pre-linked entities provided in the dataset. We did not address
the significant challenges of entity linking or multi-answer aggregation, which remain out of scope for
this thesis.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>This section outlines the experimental setup used to evaluate our agentic reinforcement learning
approach to multi-hop KGQA. We describe the dataset curation process, model configuration, and
knowledge graph environment, followed by training details and a report on compute and energy usage
for reproducibility. We then present the comparative baselines and the evaluation methodology used to
assess performance.</p>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>Following §3.2.1, we re-execute every gold query of LC-QuAD 2.0 against the frozen 2023-12-21 Wikidata
HDT dump and keep a triple ⟨,   ,   ⟩ only if the query (i) succeeds, (ii) returns exactly one
row, and (iii) yields a valid RDF term. The resulting corpus preserves entity, literal, and boolean answers
while discarding noisy items (Table 2).</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Core models and KG environment</title>
        <p>We use the unsloth/Qwen2.5-3B-Instruct-bnb-4bit model1, selected for its instruction-following
quality and compatibility with 4-bit QLoRA fine-tuning. Parameter-eficient training is performed using
Unsloth’s QLoRA implementation with commonly used hyperparameters: rank 64,  = 16 , dropout
0.05, learning rate 5 × 10−6, group size  = 16 , KL coeficient  = 0.04 , and batch size 128. These values
were selected through light, manual trial-and-error only; No systematic hyperparameter tuning was
conducted.</p>
        <p>The agent interacts with a private SPARQL endpoint (qEndpoint v2.5.2) loaded with the 2023-12-21
“truthy” HDT dump of Wikidata. All SPARQL queries are executed asynchronously with memoization,
exponential backof, and a 3-second timeout.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Training and compute setup</title>
        <p>We fine-tune the agent using Group Relative Policy Optimization (GRPO), a reinforcement learning
algorithm well-suited for sparse, outcome-based rewards. At each update step,  = 16 full roll-outs
are sampled for each of 128 training questions. Terminal rewards are computed based on final answer
correctness (see Section 3.3.1), and the LoRA weights are optimized using AdamW with a learning rate
of 5 × 10−6 and KL coeficient  = 0.04 . Each interaction episode is capped at ten &lt;think&gt;–&lt;query&gt;
cycles, and individual SPARQL executions are limited to a 3-second timeout.</p>
        <p>Training is performed over a single epoch, converging in 11.5 hours on an NVIDIA H100 GPU (94
GB). This process consumed approximately 4.6 kWh of energy, which corresponds to an estimated 1.7</p>
        <sec id="sec-4-3-1">
          <title>1Model commit ID 2672b58 on the HuggingFace Hub</title>
          <p>kg CO2e under the 2024 Dutch grid emission factor (0.37 kg CO2e/kWh). While the model is relatively
compact, reinforcement learning remains computationally intensive, and further work is needed to
evaluate the scalability and energy eficiency of this approach at larger scales or across multiple domains.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Comparative baselines</title>
        <sec id="sec-4-4-1">
          <title>We compare our agentic RL model to three baselines:</title>
          <p>B1: Direct QA (Zero-Shot CoT) The base model answers from its parametric knowledge only,
prompted with two chain-of-thought exemplars.</p>
          <p>B2: One-Shot SPARQL A single-turn prompt instructs the model to emit a full SPARQL query;
decoding uses temperature 0.2 and top-p 0.95.</p>
          <p>B3: Prompt-Guided Iterative Agent Our think-query loop without RL; identical prompt as the
RLtuned agent and greedy decoding.</p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Evaluation protocol and metrics</title>
        <p>We evaluate KG-based agents using three key end-to-end metrics: answer accuracy, query executability,
and interaction length.</p>
        <p>Accuracy Correctness is determined by a frozen LLM-based evaluator (GPT-4o-nano) shared across
all systems, including Direct QA. The evaluator receives the question, gold scalar binding, and the
model’s &lt;answer&gt; response, and returns a Boolean verdict along with a justification. This allows
for semantically equivalent but non-identical answers—e.g., paraphrasing, formatting diferences,
or unit conversions—to be marked as correct, unlike exact string matching.</p>
        <p>Executability rate The proportion of all SPARQL queries generated by a system that are syntactically
valid and execute successfully against the KG—computed as total successful executions divided
by the total number of queries issued across the test set.</p>
        <p>Average turns The mean number of agent interaction steps per question. While not a performance
metric, it serves as a diagnostic indicator of the agent’s reasoning depth and adaptivity.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>5.1. Quantitative performance</title>
        <p>Our primary experiment evaluates the RL-Tuned Iterative Agent against increasingly capable
baselines. As shown in Table 3, performance steadily improves with greater interactivity. Relying on
parametric knowledge (B1) yields 16.3% accuracy. A single SPARQL query (B2) improves this to
19.7%, though hampered by a low 47.7% executability rate. The prompt-guided iterative agent (B3)
demonstrates the value of a refinement loop, reaching 32.2% accuracy.</p>
        <p>
          Our RL-Tuned Agent marks a transformative leap, achieving a final accuracy of 49.7%—an absolute
improvement of 17.5 percentage points over the strongest baseline. This gain is driven by a learned policy
for interaction, evidenced by the executability rate soaring to 81.0%. The improvement is statistically
significant, confirmed by McNemar’s test on the discordant pairs (  RL-correct, baseline-wrong = 354 vs.
 RL-wrong, baseline-correct = 130), yielding  2(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) = 102.75,  ≪ .001 .
Parametric Baseline (No KG Interaction)
B1: Direct QA (CoT)
SPARQL-based Baselines (Zero-Shot)
B2: One-Shot SPARQL
B3: Prompt-Guided Agent
Our Method (RL Fine-Tuned)
RL-Tuned Agent
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Ablation: deconstructing agent performance</title>
        <p>To isolate the sources of this gain, we trained a purely Reactive agent (no &lt;think&gt; block) with the same
RL process. Table 4 shows this agent still achieved 48.1% accuracy, confirming that outcome-driven RL
is the primary engine of performance, capable of learning efective strategies from interaction alone.
However, our main Deliberative agent performed best (49.7%), suggesting the &lt;think&gt; block acts as a
powerful cognitive scafold. By prompting the model to externalize its plan, the structure regularizes
the learning process, leading to a more precise final policy.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Analysis of learning dynamics</title>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Qualitative and error analysis</title>
        <p>5.4.1. Error analysis: a shift to higher-quality failures.</p>
        <p>Reinforcement learning induces a crucial shift from syntactic incompetence to semantic reasoning
errors (Figure 4). The zero-shot baselines were plagued by fundamental failures: 57% of the One-Shot
agent’s failures were due to execution errors or refusing to generate a query at all. In stark contrast,
our RL-tuned agent nearly eliminated these issues, with such errors accounting for a negligible fraction
of failures. Its primary failure mode became Incorrect Logic (72.5% of its own failures). The baselines
fail because they cannot ”speak SPARQL” correctly; our agent has mastered the tool’s language and
now fails on the much harder problem of reasoning correctly with it.</p>
        <p>(a) Smoothed Average Reward
(b) In-Batch Accuracy (%)
(c) In-Batch Executability Rate (%)
(d) Average Agent Turns
5.4.2. Case study: learned resilience and strategic decomposition.</p>
        <p>To illustrate the learned policy, we analyzed behavior on the complex question: “Name the Han dynasty
capital city with a twin town called Plovdiv.”. A direct query fails. Our RL-Tuned agent correctly
diagnosed this, pivoted to an exploratory query to find all cities twinned with Plovdiv, and then used a
second verification query on the candidates to find the correct answer. This dynamic decomposition, a
direct result of RL training, contrasts sharply with the baseline, which became trapped in syntax and
logic errors.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion and Conclusion</title>
      <p>This work demonstrates that outcome-driven reinforcement learning (RL) enables compact language
models to learn robust, multi-hop reasoning strategies over knowledge graphs. Our agent, trained
using GRPO, significantly outperforms static and zero-shot baselines—closing the gap between symbolic
structure and neural flexibility. Beyond raw accuracy, we observe emergent behaviors such as adaptive
“compute scaling” and strategic query decomposition, showing that the agent learns to allocate efort
based on task complexity.</p>
      <p>These findings suggest that even small LLMs, when trained with structured feedback, can learn</p>
      <p>Breakdown of Failure Modes by Model
1
7
2
6
3
4
7
5
3
4
7
2
7
5
7
6
5
2
RL-Agent
(Reactive)
6
6
4
0
7
1
7</p>
      <p>RL-Agent
(Deliberative)</p>
      <p>B2:
One-Shot</p>
      <p>B3:</p>
      <p>Prompt-Guided</p>
      <p>Execution failure Refused to query Incorrect logic
to navigate symbolic environments through interaction. While preliminary, this work highlights a
promising direction for combining language models and formal reasoning in a more adaptive and
interpretable way.</p>
      <p>Our evaluation was limited to a curated subset of LC-QuAD 2.0 with gold entity links and
singleanswer queries. We did not address open-domain entity linking, incomplete or noisy KGs, or more
complex answer types such as lists or aggregations. Additionally, while our approach is lightweight in
model size, training with reinforcement learning remains computationally demanding. We conducted
experiments on a single H100 GPU, which limits our ability to assess scalability. Questions around
energy eficiency, training cost, and feasibility for broader deployment remain open and deserve closer
attention in future work.</p>
      <p>Future work includes combining supervised fine-tuning with RL to reduce sample complexity;
extending to end-to-end KGQA by integrating a learned entity linker; adapting the framework to other
structured domains such as NL2SQL; and studying how the agent’s policy complexity scales with model
size and query dificulty.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used GPT-4o in order to: (i) check grammar and
spelling, (ii) assist with LaTeX formatting, and (iii) provide sparring and critical feedback on sections.
After using this tool, the author(s) reviewed and edited the content as needed and take full responsibility
for the publication’s content.
[2] D. Vrandečić, M. Krötzsch, Wikidata: a free collaborative knowledgebase, Communications of the</p>
      <p>ACM 57 (2014) 78–85. URL: https://dl.acm.org/doi/10.1145/2629489. doi:10.1145/2629489.
[3] A. Hogan, E. Blomqvist, M. Cochez, C. D’amato, G. D. Melo, C. Gutierrez, S. Kirrane, J. E. L. Gayo,
R. Navigli, S. Neumaier, A.-C. N. Ngomo, A. Polleres, S. M. Rashid, A. Rula, L. Schmelzeisen,
J. Sequeda, S. Staab, A. Zimmermann, Knowledge Graphs, ACM Computing Surveys 54 (2022)
1–37. URL: https://dl.acm.org/doi/10.1145/3447772. doi:10.1145/3447772.
[4] N. Noy, Y. Gao, A. Jain, A. Narayanan, A. Patterson, J. Taylor, Industry-scale knowledge graphs:
lessons and challenges, Communications of the ACM 62 (2019) 36–43. URL: https://dl.acm.org/
doi/10.1145/3331166. doi:10.1145/3331166.
[5] Y. Zhang, H. Dai, Z. Kozareva, A. J. Smola, L. Song, Variational Reasoning for Question Answering
with Knowledge Graph, 2017. URL: http://arxiv.org/abs/1709.04071. doi:10.48550/arXiv.1709.
04071, arXiv:1709.04071 [cs].
[6] H. Sun, T. Bedrax-Weiss, W. Cohen, PullNet: Open Domain Question Answering with Iterative
Retrieval on Knowledge Bases and Text, in: Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the 9th International Joint Conference on Natural
Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong,
China, 2019, pp. 2380–2390. URL: https://www.aclweb.org/anthology/D19-1242. doi:10.18653/
v1/D19-1242.
[7] H. Ren, H. Dai, B. Dai, X. Chen, M. Yasunaga, H. Sun, D. Schuurmans, J. Leskovec, D. Zhou, LEGO:
Latent Execution-Guided Reasoning for Multi-Hop Question Answering on Knowledge Graphs,
in: M. Meila, T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine
Learning, volume 139 of Proceedings of Machine Learning Research, PMLR, 2021, pp. 8959–8970.</p>
      <p>URL: https://proceedings.mlr.press/v139/ren21a.html.
[8] B. Y. Lin, X. Chen, J. Chen, X. Ren, KagNet: Knowledge-Aware Graph Networks for
Commonsense Reasoning, 2019. URL: http://arxiv.org/abs/1909.02151. doi:10.48550/arXiv.1909.02151,
arXiv:1909.02151 [cs].
[9] J. Berant, A. Chou, R. Frostig, P. Liang, Semantic Parsing on Freebase from Question-Answer
Pairs, in: D. Yarowsky, T. Baldwin, A. Korhonen, K. Livescu, S. Bethard (Eds.), Proceedings
of the 2013 Conference on Empirical Methods in Natural Language Processing, Association
for Computational Linguistics, Seattle, Washington, USA, 2013, pp. 1533–1544. URL: https://
aclanthology.org/D13-1160/.
[10] H. Sun, B. Dhingra, M. Zaheer, K. Mazaitis, R. Salakhutdinov, W. Cohen, Open Domain Question
Answering Using Early Fusion of Knowledge Bases and Text, in: Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing, Association for Computational
Linguistics, Brussels, Belgium, 2018, pp. 4231–4242. URL: http://aclweb.org/anthology/D18-1455.
doi:10.18653/v1/D18-1455.
[11] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, O. Yakhnenko, Translating Embeddings for</p>
      <p>Modeling Multi-relational Data., in: Neural Information Processing Systems, 2013.
[12] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou,
Chain-ofThought Prompting Elicits Reasoning in Large Language Models, 2023. URL: http://arxiv.org/abs/
2201.11903. doi:10.48550/arXiv.2201.11903, arXiv:2201.11903 [cs].
[13] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, Y. Cao, ReAct: Synergizing Reasoning
and Acting in Language Models, 2023. URL: http://arxiv.org/abs/2210.03629. doi:10.48550/arXiv.
2210.03629, arXiv:2210.03629 [cs].
[14] J. Jiang, K. Zhou, Z. Dong, K. Ye, W. X. Zhao, J.-R. Wen, StructGPT: A General Framework for
Large Language Model to Reason over Structured Data, 2023. URL: http://arxiv.org/abs/2305.09645.
doi:10.48550/arXiv.2305.09645, arXiv:2305.09645 [cs].
[15] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, X. Wu, Unifying Large Language Models and Knowledge
Graphs: A Roadmap, IEEE Transactions on Knowledge and Data Engineering 36 (2024) 3580–3599.</p>
      <p>URL: http://arxiv.org/abs/2306.08302. doi:10.1109/TKDE.2024.3352100, arXiv:2306.08302 [cs].
[16] A. Chakraborty, Multi-hop Question Answering over Knowledge Graphs using Large
Language Models, 2024. URL: http://arxiv.org/abs/2404.19234. doi:10.48550/arXiv.2404.19234,
arXiv:2404.19234 [cs].
[17] A. Saxena, A. Tripathi, P. Talukdar, Improving Multi-hop Question Answering over Knowledge
Graphs using Knowledge Base Embeddings, in: Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, Association for Computational Linguistics, Online,
2020, pp. 4498–4507. URL: https://www.aclweb.org/anthology/2020.acl-main.412. doi:10.18653/
v1/2020.acl-main.412.
[18] Y. Gu, S. Kase, M. Vanni, B. Sadler, P. Liang, X. Yan, Y. Su, Beyond I.I.D.: Three Levels of
Generalization for Question Answering on Knowledge Bases, in: Proceedings of the Web Conference
2021, 2021, pp. 3477–3488. URL: http://arxiv.org/abs/2011.07743. doi:10.1145/3442381.3449992,
arXiv:2011.07743 [cs].
[19] W.-t. Yih, M.-W. Chang, X. He, J. Gao, Semantic Parsing via Staged Query Graph Generation:
Question Answering with Knowledge Base, in: Proceedings of the 53rd Annual Meeting of
the Association for Computational Linguistics and the 7th International Joint Conference on
Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics,
Beijing, China, 2015, pp. 1321–1331. URL: http://aclweb.org/anthology/P15-1128. doi:10.3115/
v1/P15-1128.
[20] P. Liang, M. I. Jordan, D. Klein, Learning Dependency-Based Compositional Semantics, 2011. URL:
http://arxiv.org/abs/1109.6841. doi:10.48550/arXiv.1109.6841, arXiv:1109.6841 [cs].
[21] W.-t. Yih, M. Richardson, C. Meek, M.-W. Chang, J. Suh, The Value of Semantic Parse Labeling
for Knowledge Base Question Answering, in: Proceedings of the 54th Annual Meeting of the
Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational
Linguistics, Berlin, Germany, 2016, pp. 201–206. URL: http://aclweb.org/anthology/P16-2033.
doi:10.18653/v1/P16-2033.
[22] E. Karpas, O. Abend, Y. Belinkov, B. Lenz, O. Lieber, N. Ratner, Y. Shoham, H. Bata, Y. Levine,
K. Leyton-Brown, D. Muhlgay, N. Rozen, E. Schwartz, G. Shachaf, S. Shalev-Shwartz, A. Shashua,
M. Tenenholtz, MRKL Systems: A modular, neuro-symbolic architecture that combines large
language models, external knowledge sources and discrete reasoning, 2022. URL: http://arxiv.org/
abs/2205.00445. doi:10.48550/arXiv.2205.00445, arXiv:2205.00445 [cs].
[23] J. Sun, C. Xu, L. Tang, S. Wang, C. Lin, Y. Gong, L. M. Ni, H.-Y. Shum, J. Guo, Think-on-Graph:
Deep and Responsible Reasoning of Large Language Model on Knowledge Graph, 2024. URL:
http://arxiv.org/abs/2307.07697. doi:10.48550/arXiv.2307.07697, arXiv:2307.07697 [cs].
[24] J. Singh, R. Magazine, Y. Pandya, A. Nambi, Agentic Reasoning and Tool Integration for LLMs
via Reinforcement Learning, 2025. URL: https://www.microsoft.com/en-us/research/publication/
agentic-reasoning-and-tool-integration-for-llms-via-reinforcement-learning/.
[25] C. Snell, J. Lee, K. Xu, A. Kumar, Scaling LLM Test-Time Compute Optimally can be More Efective
than Scaling Model Parameters, 2024. URL: http://arxiv.org/abs/2408.03314. doi:10.48550/arXiv.
2408.03314, arXiv:2408.03314 [cs].
[26] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, D. Zhou, Self-Consistency
Improves Chain of Thought Reasoning in Language Models, 2023. URL: http://arxiv.org/abs/2203.
11171. doi:10.48550/arXiv.2203.11171, arXiv:2203.11171 [cs].
[27] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Grifiths, Y. Cao, K. Narasimhan, Tree of Thoughts: Deliberate
Problem Solving with Large Language Models, 2023. URL: http://arxiv.org/abs/2305.10601. doi:10.
48550/arXiv.2305.10601, arXiv:2305.10601 [cs].
[28] Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, D. Guo,
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, 2024.</p>
      <p>URL: http://arxiv.org/abs/2402.03300. doi:10.48550/arXiv.2402.03300, arXiv:2402.03300 [cs].
[29] H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever,
K. Cobbe, Let’s Verify Step by Step, 2023. URL: http://arxiv.org/abs/2305.20050. doi:10.48550/
arXiv.2305.20050, arXiv:2305.20050 [cs].
[30] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani,
S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, A. Castro-Ros, M. Pellat,
K. Robinson, D. Valter, S. Narang, G. Mishra, A. Yu, V. Zhao, Y. Huang, A. Dai, H. Yu, S. Petrov,
E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V. Le, J. Wei, Scaling Instruction-Finetuned
Language Models, 2022. URL: http://arxiv.org/abs/2210.11416. doi:10.48550/arXiv.2210.11416,
arXiv:2210.11416 [cs].
[31] P. Ma, X. Zhuang, C. Xu, X. Jiang, R. Chen, J. Guo, SQL-R1: Training Natural Language to
SQL Reasoning Model By Reinforcement Learning, 2025. URL: http://arxiv.org/abs/2504.08600.
doi:10.48550/arXiv.2504.08600, arXiv:2504.08600 [cs].
[32] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, QLoRA: Eficient Finetuning of
Quantized LLMs, 2023. URL: http://arxiv.org/abs/2305.14314. doi:10.48550/arXiv.2305.14314,
arXiv:2305.14314 [cs].</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jakob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jentzsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Morsey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Van Kleef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          , C. Bizer, DBpedia
          <article-title>- A large-scale, multilingual knowledge base extracted from Wikipedia, Semantic Web 6 (</article-title>
          <year>2015</year>
          )
          <fpage>167</fpage>
          -
          <lpage>195</lpage>
          . URL: https://journals.sagepub.com/doi/full/10. 3233/SW-140134. doi:
          <volume>10</volume>
          .3233/SW- 140134.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>