<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Text-to-SPARQL Goes Beyond English: Multilingual Question Answering Over Knowledge Graphs through Human-Inspired Reasoning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aleksandr Perevalov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas Both</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DICE Research Group, University of Paderborn</institution>
          ,
          <addr-line>Warburger Str. 100, 33098, Paderborn</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>WSE Research Group, Leipzig University of Applied Sciences</institution>
          ,
          <addr-line>Karl-Liebknecht-Straße 132, 04277, Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Accessing knowledge via multilingual natural-language interfaces is one of the emerging challenges in the field of information retrieval and related ones. Structured knowledge stored in knowledge graphs can be queried via a specific query language (e.g., SPARQL). Therefore, one needs to transform natural-language input into a query to fulfill an information need. Prior approaches mostly focused on combining components (e.g., rule-based or neural-based) that solve downstream tasks and come up with an answer at the end. We introduce mKGQAgent, a human-inspired framework that breaks down the task of converting natural language questions into SPARQL queries into modular, interpretable subtasks. By leveraging a coordinated LLM agent workflow for planning, entity linking, and query refinement-guided by an experience pool for in-context learning-mKGQAgent eficiently handles multilingual KGQA. Evaluated on the DBpedia- and Corporate-based KGQA benchmarks within the Text2SPARQL challenge 2025, our approach took first place among the other participants. This work opens new avenues for developing human-like reasoning systems in multilingual semantic parsing.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;LLM Agents</kwd>
        <kwd>Text2SPARQL</kwd>
        <kwd>Knowledge Graph Question Answering</kwd>
        <kwd>Semantic Parsing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Previous approaches to multilingual knowledge graph question answering (KGQA), like Diefenbach et al.
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Turganbay et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], have employed both rule-based and neural methods to address downstream
tasks (e.g., named entity recognition, relation detection, query template classification) necessary for
constructing structured queries (e.g., SPARQL queries). More recent methods (e.g., Srivastava et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ])
leverage Large Language Models (LLMs) to generate such structured queries directly from non-English
input. The application of newly introduced LLM agents (or augmented language models) to KGQA
has demonstrated significantly improved performance compared to LLMs that rely solely on standard
prompting techniques e.g., Jiang et al. [4], Huang et al. [5]). However, the multilingual aspect of these
systems remains largely unexplored within the research community. To the best of our knowledge,
there are no studies investigating the LLM agent architectures for KGQA in multilingual settings.
      </p>
      <p>One of the key advantages of LLMs is that they enable developers and researchers to model
humanlike reasoning processes via agentic workflows (cf. Li et al. [6]). When solving complex problems,
humans typically break them down into a series of simpler subtasks (cf. Diefenbach et al. [7], Correa
et al. [8]), efectively creating a step-by-step plan to arrive at a solution. While generating a SPARQL
query, this decomposition is essential: not only does one need to break down the task, but also look
up query language syntax, identify relevant entity identifiers in the target knowledge graph (KG), and
analyze feedback (e.g., from executing the SPARQL query candidate on the triplestore). To replicate this
human-like process, we introduce mKGQAgent–an LLM-based agent framework designed as a KGQA
system that follows a semantic-parsing approach. Specifically, given a user query (multiple languages
are supported), it generates a SPARQL query to fulfill the information need. Accordingly, this paper
aims to answer the following research questions:
ℛ1 How do diferent LLM agent steps (e.g., plan, action, tool calling, feedback, etc.) impact the
generation of SPARQL queries from natural language?
ℛ2 How eficient are these LLM agent steps in terms of computation time and the number of additional
calls required?
ℛ3 How does the quality of SPARQL query generation vary when prompting LLM agents in
non</p>
      <p>English languages (especially low-resource ones)?
ℛ4 How does translating non-English questions into English afect the quality of KGQA?</p>
      <p>We conducted preliminary experiments on the widely used KGQA benchmark QALD-9-plus
(introduced in Perevalov et al. [9]) with multilingual support. We evaluate 10 languages, including
two classified as endangered. The experimental results on both proprietary and open-source LLMs
demonstrate the efectiveness of mKGQAgent’s architecture, achieving superior performance even
in non-English settings. During the final evaluation on the DBpedia- and Corporate-based KGQA
benchmarks within the Text2SPARQL challenge 2025, our approach took first place among the other
participants. The source code and the evaluation results are available in our GitHub repository1.</p>
      <p>The paper is organized as follows. In the next section, an overview of the related work is presented.
The mKGQAgent architecture is described in Section 3. Section 4 is dedicated to the experimental setup.
The results are shown in Section 5 and discussed in Section 6. Section 7 concludes our paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Recent KGQA research has included classical, rule-based, and neural approaches [10, 11]. Diefenbach
et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] (QAnswer) and Punjani et al. [12] used query templates and rule indexes without language
models. Pellissier Tanon et al. [13] applied grammar rules for SPARQL query transformation. DeepPavlov
2023 employs a fine-tuned language model pipeline for query generation, cf. Turganbay et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Omar
et al. [14] proposed KGQAN, which integrates answer type prediction and triple pattern generation.
      </p>
      <p>Multilingual KGQA solutions including Zhou et al. [15], which fine-tune multilingual transformers
and leverage bilingual lexicon induction. Zhang et al. [16] address cross-lingual semantic parsing over
multiple meaning representations in XSemPLR, including SPARQL. Tan et al. [17] improve cross-lingual
reasoning, enhancing the Entity Alignment model performance in English, Chinese, and French in the
CLRN approach.</p>
      <p>
        Zong et al. [18] employ the multi-role LLM agent architecture Triad for SPARQL query construction.
MST5 (Srivastava et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) fine-tunes mT5-XL for generating structured queries. Lehmann et al. [19]
enhances LLMs with external tools to mimic human-like reasoning. Jiang et al. [4] integrates a KG-based
executor (KG-Agent) and fine-tunes Llama2-7B for improved tool usage. QueryAgent (Huang et al.
[5]) mitigates hallucinations with ERASER-based self-correction, excelling on GrailQA and GraphQ.
Interactive-KBQA (Xiong et al. [20]) iteratively refines LLM outputs via direct KB interactions.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. The mKGQAgent Architecture</title>
      <p>The mKGQAgent workflow consists of several key steps (see Figure 1 for an overview). Our approach
follows the terminology established in recent survey articles on LLM agents, cf. Mialon et al. [21], Wang
et al. [22]. The framework operates in two main phases: the ofline phase and the evaluation (online)
phase. The ofline phase is essential for preparing the experience pool (see Section 3.1.4). During the
ofline phase, we employ the simple agent () to gather intermediate processing steps for the
experience pool (see Figure 2).
1https://github.com/WSE-research/text2sparql-agent</p>
      <p>uses the plan step (cf. Section 3.1.2) to generate a structured step-by-step plan and the action
step, that either calls the LLM or the named entity linking (NEL) tool (cf. Section 3.1.1) ultimately leading
to the SPARQL query generation. In the evaluation (online) phase, the mKGQAgent is using the plan
step and the action step with the experience pool and the NEL tool, and the feedback step that has
access to the triplestore.</p>
      <p>The important feature of our framework is that it does not require supervised fine-tuning, which
significantly reduces the computation costs and preserves the generalizability of the original LLMs (cf.
catastrophic forgetting); see Luo et al. [23].</p>
      <sec id="sec-3-1">
        <title>3.1. Ofline Phase of the mKGQAgent</title>
        <sec id="sec-3-1-1">
          <title>3.1.1. Named Entity Linking (NEL) Tool</title>
          <p>Likewise, humans look up a resource identifier in a KG, and the NEL step interacts with the environment
(i.e., KG) and retrieves resource labels from there. Assuming the fact that an LLM was not given the
URI-label mappings of a particular resource, the SPARQL query generation would not be possible.
Importantly, while introducing the NEL tool, we do not propose a novel NEL algorithm. In contrast, we
demonstrate how to utilize an existing NEL service in the LLM agent workflow (see Algorithm 1).
Algorithm 1 NEL Tool</p>
          <p>The entity and relation candidates are proposed by the backbone LLM within the tool calling process at
the action step (see Sections 3.1.3 and 3.2.2). Entity and relation linking is crucial for the text-to-SPARQL
process since the URIs representing resources in a KG may be done using random identifiers 2.
3.1.2. Plan step
The plan step leverages the backbone LLM to generate a step-by-step list of tasks to come up with a
SPARQL query given a question. The intuition behind the plan step is that it simplifies the task for
the model such that it does not need to handle the whole complexity at once. For example, such tasks
as entity recognition and linking, query refinement, etc. Thus, following the human-like behavior (cf.
Huys et al. [25], Correa et al. [8]), the plan step intends to break down the complex task of writing a
SPARQL query into a combination of simpler subtasks. Hence, the action step deals at one point in time
with a simple subtask having the results of the previous steps in its conversation history. For details
regarding the plan step (for details, see Algorithm 2).</p>
          <p>1:  ← ℒℒℳ
2: return 
Algorithm 2 Plan step without experience pool
Require: Natural language question , system prompt plan, model ℒℒℳ
Ensure: Step-by-step plan 
(plan, )
◁ List of textual tasks
◁ Query LLM with system prompt and question</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.3. Action Step without Experience Pool</title>
          <p>Once the plan is generated, the action step executes each of the plan tasks sequentially, leveraging the
NEL Tool for the entity linking (see Algorithm 3). This approach ensures that the agent follows the
structured plan, interacting with necessary tools to refine and complete the SPARQL query generation
process.
2e.g., in Wikidata [24], Q567 (https://www.wikidata.org/wiki/Q567) for “Angela Merkel”
Algorithm 3 Action Step without Experience Pool
◁ LLM may call tool or just itself
Algorithm 4 Add Example to the Experience Pool
Require: Training set example  ∈ train, step-by-step plan , chat history ℋ, Experience pool ℰ ,</p>
          <p>Text embedding model ℰ ℳℬ
Ensure: Updated experience pool ℰ ′
◁ Unpack training example (question and ground truth SPARQL)
◁ Get the SPARQL generated by</p>
          <p>◁ Compute F1 score
◁ Convert question to a vector
1: ,  ←
2: ˆ ←
3: F1 ←</p>
          <p>lastElementOf(ℋ)
 1score(, ˆ)</p>
          <p>()
4:  ← ℰ ℳℬ
5: ℰ ′ ← ℰ + {,  , , ˆ, , ℋ, F1}
6: return ℰ ′</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.1.4. Experience Pool Construction</title>
          <p>During the ofline phase, we utilize  to collect the experience pool. This involves evaluating
the correctness of the generated SPARQL queries (based on the ground truth data) and storing them
together with the intermediate steps (i.e., plan, chat history) in a vector database (see Algorithm 4).
Therefore, each natural language question from the train subset is converted into a vector representation
that is associated with metadata, including the corresponding plan, intermediate steps of the action
step, and the final results. The experience pool is a non-parametric memory of our agent that contains
both successful (F1 score = 1.0) and unsuccessful (F1 score &lt; 1.0) SPARQL query generation attempts
based on a ground truth.</p>
          <p>Therefore, the experience pool holds the information about the quality of the generated SPARQL
queries (F1), the step-by-step plan () that was used to generate this particular query, and other
metadata (e.g., ground truth SPARQL query).</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Evaluation Phase of the mKGQAgent</title>
        <sec id="sec-3-2-1">
          <title>3.2.1. Plan step with the Experience Pool</title>
          <p>In the evaluation phase, the plan step leverages the experience pool to find relevant plan examples for
better planning quality. The plan examples are included in the system prompt plan (see Algorithm 5).</p>
          <p>Hence, the plan step benefits from the prior successful planning examples while using them in the
system prompt for in-context learning.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Action step with the Experience Pool</title>
          <p>Once the plan is generated, the action step executes each of the plan tasks sequentially, leveraging the
NEL Tool for the entity linking (see Algorithm 6). The usage of the experience pool ensures that the
LLM benefits from the in-context SPARQL query examples from the training subset. It is important
4:  ← ℒℒℳ
5: return</p>
          <p>plan + 
(plan, )
Algorithm 5 Plan step with Experience Pool
Require: Natural language question , system prompt plan, model ℒℒℳ, experience pool ℰ , text
embedding model ℰ ℳℬ
Ensure: Step-by-step plan  ◁ List of textual tasks
1:  ← ℰ ℳℬ ()
2:  ←     (ℰ ,  ) ◁ Finds top-N similar plans with F1 = 1.0
3: pelxapnerience ← ◁ The plans are included to the prompt
◁ Query LLM with system prompt and question
7: ℎ ← ℒℒℳ
8: Append ℎ to ℋ
9: end for
10: ˆ ← lastElementOf(ℋ)
11: return ˆ
Algorithm 6 Action Step with the Experience Pool
Require: Step-by-step plan , model ℒℒℳ, tool  , system prompt action (see appendix),
experience pool ℰ , text embedding model ℰ ℳℬ
Ensure: Generated SPARQL query ˆ
1: Bind   to ℒℒℳ
2: Initialize empty chat history ℋ
3:  ← ℰ ℳℬ ()
4:  ←    (ℰ ,  ) ◁ Finds top-N similar SPARQL queries
5: action ← action +  ◁ The queries are included to the prompt
6: for each step  ∈  do</p>
          <p>(action,  ) ◁ LLM may call tool or just itself
Algorithm 7 Feedback Step
Require: Intermediate query , prompt template feedback (see appendix), triplestore 
Ensure: Feedback prompt f′eedback
1:  ←  ()
2: f′eedback ← feedback + 
3: return f′eedback
◁ Query the triplestore and get the response
◁ Populate prompt with the response
to note that the plan  can be populated with the result of the feedback step (in case the feedback is
triggered).</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>3.2.3. Feedback Step</title>
          <p>The feedback executes the generated SPARQL query  on a triplestore, collects the response, and
integrates it into a pre-defined prompt template for the action step. Once the first version of a SPARQL
query is generated (i.e., the result of the last planning step executed at the action step), it is redirected
to the feedback step. The feedback is formulated only once per input question, i.e., there are no multiple
feedback options intended to avoid infinite loops. The detailed feedback step workflow is defined in
Algorithm 7. After that, the feedback f′eedback is redirected to the action step. The action step executes
the feedback to refine the SPARQL query and delivers the final query as the result.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>We conduct our experiments on the commonly used KGQA benchmark: QALD-9-plus (Perevalov et al.
[9]). QALD-9-plus contains 558 questions in multiple languages and queries over DBpedia [26] and
Wikidata cf. [27]. We consider all available languages from QALD-9-plus, in addition, we also take the
Spanish questions, which were contributed to this dataset later (Soruco et al. [28]). The structure of
QALD-9-plus includes question texts and the corresponding ground truth SPARQL queries that return
the expected answer to a question. For the evaluation of KGQA quality, we use the Macro F1 score [29].</p>
      <sec id="sec-4-1">
        <title>4.1. Large Language Models and Text Embedding Models</title>
        <p>In this work, we use both open-source and proprietary LLMs. The proprietary ones are provided by
OpenAI3, namely, GPT-3.5 (gpt-3.5-turbo-0125), and GPT-4o (gpt-4o-2024-05-13). The models
are accessed via the oficial Python SDK 4 with temperature=0, and other hyperparameters are set to
default.</p>
        <p>The open-source LLMs are: Qwen2.5 72B Instruct5 and Meta Llama 3.1 70B Instruct6. Both models
were used with the AWQ (Lin et al. [30]) quantization (4-bit) to fit into the memory. The models were
deployed via the vLLM framework (Kwon et al. [31]). The maximal context size of the models was set
to 16384 tokens to avoid out-of-memory exceptions. The other hyperparameters were set to default.
For the open-source LLMs, we used a server with two Nvidia L40S GPUs (each 48GB VRAM).</p>
        <p>For creating text embeddings for the experience pool, we used a specific model trained for producing
high-quality text embeddings for multilingual input – multilingual e5 large7 (introduced by
Wang et al. [32]). According to the MTEB leaderboard8 introduced by Muennighof et al. [33], the model
is listed among the top-3 in diferent languages (we considered embedding models with a size smaller
than 1 billion parameters).</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Implementation of mKGQAgent</title>
        <p>The mKGQAgent architecture is implemented within the LangChain framework9 in Python. This
framework facilitates the integration of various components required for the agent’s functionality.</p>
        <p>The entity linking within the NEL tool is implemented via Wikidata’s oficial public entity lookup
endpoint10. This endpoint is capable of handling input in multiple languages. The NEL tool also uses
an external relation linker, Falcon 2.0 (Sakor et al. [34]), for enhanced linking capabilities.</p>
        <p>The routing between the plan, action, and feedback is implemented within the LangGraph
framework11, which is part of the LangChain ecosystem.</p>
        <p>The prompts used within the mKGQAgent are written in diferent languages, s.t., they match the
input question language. The prompts in English, German, and Russian were written by native speakers,
the other prompts were acquired via machine translation and further structure validation. We list the
prompts in Figure 3.</p>
        <p>The SPARQL queries generated by the mKGQAgent are executed on the oficial Wikidata SPARQL
endpoint12.
3https://platform.openai.com/docs/models
4https://github.com/openai/openai-python
5https://huggingface.co/Qwen/Qwen2-72B-Instruct-AWQ
6https://huggingface.co/hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4
7https://huggingface.co/intfloat/multilingual-e5-large
8https://huggingface.co/spaces/mteb/leaderboard
9https://python.langchain.com
10https://www.wikidata.org/w/api.php?action=wbsearchentities
11https://langchain-ai.github.io/langgraph/
12https://query.wikidata.org/bigdata/namespace/wdq/sparql
You are an intelligent Knowledge Graph-based Question Answering system.</p>
        <p>You can use the tools to help yourself only if you \
DON’t have this information in chat history:
- ’wikidata_el’ for named entity linking \
(e.g. "Person name" -&gt; "URI" or "is child of" -&gt; "URI")
to determine URIs in the Wikidata KG
{QUESTION_QUERY_EXAMPLE} # comes from the experience pool</p>
        <p>experience) for the action step with the usage of the experience pool ℰ
(a) The system prompt (
For the given objective, come up with a simple step by step plan to write a SPARQL query.
This plan should involve individual tasks (e.g., named entity linking, relation linking,
expected answer type classification), that if executed correctly \
will yield the correct SPARQL.</p>
        <p>Do not add any superfluous steps.</p>
        <p>The result of the final step should be the final SPARQL query.</p>
        <p>Don’t propose to execute the query.</p>
        <p>At the end step you MUST output exactly **ONE** SPARQL query string \
**without extra text or markdown**.
{USER_QUESTION}
{PLAN_EXPERIENCE_EXAMPLE} # comes from the experience pool</p>
        <p>experience) for the plan step with the usage of the experience pool ℰ
(b) The planning prompt (plan
This is feedback to your generated SPARQL query produced by executing it on a triplestore.
Please rework your query if neccessary.</p>
        <p>Initial question: {USER_QUESTION}
Your query: {GENERATED_SPARQL}# intermediate SPARQL query
--- Start triplestore response
--{FEEDBACK}# comes from the query execution on a triplestore
--- End triplestore response
--Make sure that the query is formatted correctly.</p>
        <p>No extra text. No markdown. Just plain SPARQL.</p>
        <p>Determine whether to output a URI (SELECT ?uri), number (COUNT), date, \
boolean (ASK), string (SELECT ?label).</p>
        <p>DON’T USE "SERVICE wikibase:label"</p>
        <p>(c) The feedback prompt (feedback) for the feedback step</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Baselines</title>
        <p>To compare the performance of the mKGQAgent we select both “pre-LLM era” KGQA systems and
the ones that use diferent prompting techniques with LLMs. Also, the baselines were selected in a
way that they can generate SPARQL queries over Wikidata. In particular, the following approaches are
selected for comparison with ours: QAnswer, Platypus, DeepPavlov 2023, KGQAN, Triad, MST5, and
HQA (cf. Section 2).</p>
        <p>The selection of the baselines was also influenced by the results reported in the KGQA leaderboard by
Perevalov et al. [10]. We reuse the reported results in our paper for comparison with our mKGQAgent
approach.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Machine Translation of the Input</title>
        <p>Following our research agenda [35, 36], we evaluate how well machine translation to English serves as
an alternative to processing non-English questions natively with the OPUS MT models; cf. Tiedemann
%
,
e
r
o
c
s
1
F</p>
        <p>DeepPavlov2023
and Thottingal [37].</p>
        <p>Our machine translation experiments are complementary to the main contribution and, therefore,
are limited to the German, Russian, and Spanish languages. We selected these languages as they all
represent diferent language branches—the Germanic, Slavic, and Romance, respectively.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation and Analysis</title>
      <sec id="sec-5-1">
        <title>5.1. English-only Comparison with the Baselines</title>
        <p>The results presented in Figure 4 illustrate a comparative analysis between our mKGQAgent approach
(highlighted in teal) and various baseline methods (depicted in grey) on the English questions from the
QALD-9-plus benchmark. Our mKGQAgent (GPT-4o) achieves the highest F1 score of 54.83%, surpassing
all baselines, including HQA (GPT-4), which attains 50.00%. This demonstrates the efectiveness of our
approach in leveraging structured planning and retrieval mechanisms to enhance semantic parsing
performance.</p>
        <p>Among the baselines, QAnswer (44.59%) and KGQAN (44.07%) show competitive results but still
fall short of our top-performing model. Interestingly, HQA (GPT-3.5) achieves 43.00%, indicating that
the transition to GPT-4 has significantly improved query generation capabilities. The performance
of mKGQAgent (Qwen 2.5 72B) (41.87%) and Triad (GPT-4) (41.77%) suggests that large models, even
with structured workflows, benefit from additional fine-tuning and experience pooling. Notably, our
mKGQAgent (GPT-3.5) variant scores just 37.16%, still outperforming several baselines but trailing
behind its GPT-4o counterpart.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Multilingual Comparison with the Baselines</title>
        <p>In Table 1, we present the evaluation results of our approach in comparison to the selected baselines
that support multiple languages (see Section 4.3).</p>
        <p>The experimental results demonstrate mKGQAgent’s robust performance across multiple languages
on the QALD-9-plus benchmark. When implemented with GPT-4o, the system achieves state-of-the-art
results with the F1 scores of 54.83%, 43.08%, 38.28%, 31.56%, and 40.48%, respectively, for English, German,
Spanish, Belarusian, and Bashkir. The languages using Cyrillic-based scripts (Russian, Belarusian,
Ukrainian, and Bashkir) generally yield poorer results in comparison to the languages using Latin-based
scripts.</p>
        <p>While comparing mKGQAgent to the other baselines, we see that QAnswer outperforms mKGQAgent
(GPT-3.5) on French; however, the diference is not substantial (23.00% vs 22.87%). The MST5 system
significantly outperforms mKGQAgent (GPT-4o) on Russian (37.61% vs 31.67%), Ukrainian (34.67% vs
28.54%), and Lithuanian (25.54% vs 31.15%).</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Machine Translation for non-English Questions</title>
        <p>The evaluation compares the performance of the models on native-language questions and those
translated into English using machine translation (see Table 2). Across all models and languages, the
performance of mKGQAgent is generally higher for translated questions than for native-language
questions. This suggests that translating non-English questions into English before processing yields
better results.</p>
        <p>The mKGQAgent based on GPT-4o achieves the highest performance across all settings, demonstrating
the superior comprehension and reasoning capabilities of this model. Qwen 2.5 72B exhibits strong
performance in translation-based settings but falls behind GPT-4o. The variance in performance between
languages suggests that translation quality and linguistic characteristics play a role in how efectively
mKGQAgent can process and answer questions. In general, this comparison demonstrates that the
translation of non-English questions into English consistently improves the quality and underlines the
unequal quality distribution among the languages.</p>
        <p>Table 3 presents a comparative evaluation of the performance of MT against questions in their native
language within the KGQA task. The results demonstrate that, for most models and languages MT
yields improvements over native-language question answering. This efect is particularly pronounced
in Russian and Spanish, where MT provides significant gains. GPT-4o, despite its strong overall
performance, exhibits slight performance degradation when using MT for German (-17.22%), suggesting
that this model may already be well optimized for handling German-language queries natively. Overall,
these findings highlight the advantages of translations in multilingual KGQA, even when objectively
strong LLMs are used.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. LLM Calls and Costs</title>
        <p>An important aspect of using LLM agent frameworks is the number of model calls within one task
solution, i.e., in our case, we report the number of calls per generated SPARQL query for an input
question. In addition, we report the estimated number of tokens per question and the underlying costs
of the LLMs’ usage.</p>
        <sec id="sec-5-4-1">
          <title>5.4.1. Costs calculation for the OpenAI models</title>
          <p>According to our calculations, mKGQAgent requires 13.03 LLM calls on average to generate a SPARQL
query for an input question (cf. Table 4). Consequently, every LLM call consumes 144 input and 199
output tokens on average. This includes chat history that gradually grows during agent execution. The
pricing strategy of OpenAI is based on token consumption. Therefore, we calculate the token-based
price (TBP) as in Equation 1.
13https://www.runpod.io/pricing
14Qwen: https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html,</p>
          <p>Llama: https://artificialanalysis.ai/models/llama-3-1-instruct-70b</p>
          <p>TBP = [︀ ( × ) + ( × )]︀ ×  × 
Where:  represents the number of input tokens,  represents the price per input token,  represents
the number of output tokens,  represents the price per output token,  represents the number of LLM
calls per question,  represents the number of questions. This results in USD 0.48 per 100 questions
for the GPT-3.5 and USD 3.06 per 100 questions for the GPT-4o, respectively (prices as of March 01,
2025). For the costs of the diferent  configurations, see Table 4.</p>
        </sec>
        <sec id="sec-5-4-2">
          <title>5.4.2. Costs calculation for the open source models</title>
          <p>For the open-source LLMs, we use the same values regarding the average number of LLM calls to
generate a SPARQL query (13.03) and the average number of tokens per call–144 input tokens and 199
output tokens. The pricing of open source models relies on the GPU hours of cloud providers and the
model eficiency measured in tokens per second (tok/sec). Therefore, we calculate the GPU hours-based
price (GBP) as in Equation 2.</p>
          <p>GBP =  ×  × 
tok/sec
× GPU/sec
Where: tok/sec represents model eficiency rate (tok/sec), GPU/sec price per GPU-second. We
estimated the market prices of our GPU experimental setup (2x Nvidia L40S GPU) according to one of
the well-known cloud providers13. The model performance (tok/sec) was retrieved from the oficial
documentation of the respective models14 taking into account the usage of the vLLM framework for
deployment and the size of the context window – 16384 tokens. Hence, for processing 100 questions,
mKGQAgent requires 1.96 GPU hours when using the Qwen model and 0.97 GPU hours when using
the Llama model. Therefore, the prices per 100 questions are USD 4.05 and 2.01, respectively. For the
costs of the diferent  configurations, see Table 4.</p>
        </sec>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Impact of Individual Components on the Quality (Ablation)</title>
        <p>To understand the contribution of each architectural component to the overall system performance, we
conducted an ablation study using the English questions from the QALD-9-plus dataset. As our baseline
system, we consider the  with plan step and NEL tool components.
(1)
(2)</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion, Research Questions and Limitations</title>
      <p>ℛ1 The analysis reveals that the proposed agent architecture enables more accurate SPARQL query
generation. In particular, mKGQAgent achieved state-of-the-art results (54.83% F1 score) on
English and demonstrated superior quality on German, Spanish, Belarusian, and Bashkir.
ℛ2 The evaluation indicates that the full mKGQAgent setup with all components achieved
substantially better quality but requires additional computational resources. For example, the 
requires 8.87 LLM calls on average to achieve the end goal, while the final mKGQAgent requires
13.03 LLM calls on average.
ℛ3 Our work indicates that multilingual SPARQL generation presents significant challenges even to
the state-of-the-art LLMs. In particular, even among European languages, the quality of SPARQL
query generation may degrade by more than a factor of three.
ℛ4 Our results indicate that machine translation generally leads to higher KGQA performance
compared to processing questions in their native languages. However, the efectiveness of
translation-based approaches varies by language and model.</p>
      <p>We acknowledge several limitations of our work. Since our evaluation relies on Wikidata-based
datasets, it may not fully capture the ability of LLMs to generalize or compose SPARQL queries for
previously unseen data. The issue of data memorization was not the primary focus of this study;
however, we address it in a separate publication [38].</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>The paper introduces a novel LLM agent framework called mKGQAgent for the multilingual
Text-toSPARQL task. The experiments carried out have shown that each step of the mKGQAgent workflow
contributes positively to the quality of the results. The mKGQAgent substantially outperforms previous
systems in the English, German, Spanish, Belarusian, and Bashkir questions of the QALD-9-plus data
set.</p>
      <p>We highlighted significant challenges when LLMs deal with non-English languages, especially
lowresource ones. The latter challenge can be partially covered by the use of MT techniques, which was
demonstrated by our experiments. However, the use of diferent translation techniques requires further
systematic study to identify settings where each of them performs best. Despite this, the mKGQAgent
framework demonstrates a promising approach to KGQA by adopting the LLM agent paradigm. While it
shows its ability to work with multiple languages having reasonably good quality, we also demonstrated
the trade-of between the quality and computational costs that increase with the agent paradigm
adoption.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work has been partially supported by grants for the ITZBund15-funded research project “QA4CB—
Entwicklung von Question-Answering-Komponenten zur Erweiterung des Chatbot-Frameworks” at the
Leipzig University of Applied Sciences in Leipzig (Germany).</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT by OpenAI in order to: Grammar and
spelling check. After using this service, the authors reviewed and edited the content as needed and
take(s) full responsibility for the publication’s content.
[4] J. Jiang, K. Zhou, W. X. Zhao, Y. Song, C. Zhu, H. Zhu, J.-R. Wen, KG-Agent: An eficient
autonomous agent framework for complex reasoning over knowledge graph, arXiv preprint
arXiv:2402.11163 (2024).
[5] X. Huang, S. Cheng, S. Huang, J. Shen, Y. Xu, C. Zhang, Y. Qu, QueryAgent: A reliable and
eficient reasoning framework with environmental feedback based self-correction, arXiv preprint
arXiv:2403.11886 (2024).
[6] Y. Li, Y. Zhang, L. Sun, MetaAgents: Simulating interactions of human behaviors for LLM-based
task-oriented coordination via collaborative generative agents, CoRR abs/2310.06500 (2023).
doi:10.48550/ARXIV.2310.06500. arXiv:2310.06500.
[7] D. Diefenbach, K. Singh, A. Both, D. Cherix, C. Lange, S. Auer, The Qanary ecosystem: Getting
new insights by composing question answering pipelines, in: J. Cabot, R. De Virgilio, R. Torlone
(Eds.), Web Engineering, Springer International Publishing, Cham, 2017, pp. 171–189.
[8] C. G. Correa, M. K. Ho, F. Callaway, T. L. Grifiths, Resource-rational task decomposition to
minimize planning costs, in: Proceedings of the 42th Annual Meeting of the Cognitive Science
Society - Developing a Mind: Learning in Humans, Animals, and Machines, CogSci 2020, virtual,
July 29 - August 1, 2020, cognitivesciencesociety.org, 2020. URL: https://cogsci.mindmodeling.org/
2020/papers/0746/index.html.
[9] A. Perevalov, D. Diefenbach, R. Usbeck, A. Both, QALD-9-plus: A multilingual dataset for
question answering over DBpedia and Wikidata translated by native speakers, in: 2022 IEEE 16th
International Conference on Semantic Computing (ICSC), IEEE, 2022, pp. 229–234.
[10] A. Perevalov, X. Yan, L. Kovriguina, L. Jiang, A. Both, R. Usbeck, Knowledge graph question
answering leaderboard: A community resource to prevent a replication crisis, in: Proceedings of
the Language Resources and Evaluation Conference, European Language Resources Association,
Marseille, France, 2022, pp. 2998–3007. URL: https://aclanthology.org/2022.lrec-1.321.
[11] A. Perevalov, A. Both, A.-C. Ngonga Ngomo, Multilingual question answering systems for
knowledge graphs—a survey, Semantic Web 15 (2024) 2089–2124.
[12] D. Punjani, K. Singh, A. Both, M. Koubarakis, I. Angelidis, K. Bereta, T. Beris, D. Bilidas, T. Ioannidis,
N. Karalis, C. Lange, D. Pantazi, C. Papaloukas, G. Stamoulis, Template-based question answering
over linked geospatial data, in: Proceedings of the 12th Workshop on Geographic Information
Retrieval, GIR’18, Association for Computing Machinery, New York, NY, USA, 2018. URL: https:
//doi.org/10.1145/3281354.3281362. doi:10.1145/3281354.3281362.
[13] T. Pellissier Tanon, M. D. de Assunção, E. Caron, F. M. Suchanek, Demoing Platypus – a multilingual
question answering platform for Wikidata, in: A. Gangemi, A. L. Gentile, A. G. Nuzzolese,
S. Rudolph, M. Maleshkova, H. Paulheim, J. Z. Pan, M. Alam (Eds.), The Semantic Web: ESWC
2018 Satellite Events, Springer International Publishing, Cham, 2018, pp. 111–116.
[14] R. Omar, I. Dhall, P. Kalnis, E. Mansour, A universal question-answering platform for knowledge
graphs, Proceedings of the ACM on Management of Data 1 (2023) 1–25.
[15] Y. Zhou, X. Geng, T. Shen, W. Zhang, D. Jiang, Improving zero-shot cross-lingual transfer
for multilingual question answering over knowledge graph, in: K. Toutanova, A. Rumshisky,
L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, Y. Zhou
(Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, Association for Computational
Linguistics, Online, 2021, pp. 5822–5834. URL: https://aclanthology.org/2021.naacl-main.465/.
doi:10.18653/v1/2021.naacl-main.465.
[16] Y. Zhang, J. Wang, Z. Wang, R. Zhang, XSemPLR: Cross-lingual semantic parsing in multiple
natural languages and meaning representations, in: Proceedings of the 61st Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational
Linguistics, Toronto, Canada, 2023, pp. 15918–15947. URL: https://aclanthology.org/2023.acl-long.
887.
[17] Y. Tan, X. Zhang, Y. Chen, Z. Ali, Y. Hua, G. Qi, CLRN: A reasoning network for multi-relation
question answering over cross-lingual knowledge graphs, Expert Systems with Applications
231 (2023) 120721. URL: https://www.sciencedirect.com/science/article/pii/S095741742301223X.
doi:https://doi.org/10.1016/j.eswa.2023.120721.
[18] C. Zong, Y. Yan, W. Lu, J. Shao, Y. Huang, H. Chang, Y. Zhuang, Triad: A framework leveraging a
multi-role LLM-based agent to solve knowledge base question answering, in: Proceedings of the
2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 1698–1710.
[19] J. Lehmann, D. Bhandiwad, P. Gattogi, S. Vahdati, Beyond boundaries: A human-like approach for
question answering over structured and unstructured information sources, Transactions of the
Association for Computational Linguistics 12 (2024) 786–802.
[20] G. Xiong, J. Bao, W. Zhao, Interactive-KBQA: Multi-turn interactions for knowledge base question
answering with large language models, CoRR abs/2402.15131 (2024). doi:10.48550/ARXIV.2402.
15131.
[21] G. Mialon, R. Dessi, M. Lomeli, C. Nalmpantis, R. Pasunuru, R. Raileanu, B. Roziere, T. Schick,
J. Dwivedi-Yu, A. Celikyilmaz, et al., Augmented language models: a survey, Transactions on
Machine Learning Research (2023).
[22] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin, et al., A
survey on large language model based autonomous agents, Frontiers of Computer Science 18
(2024) 186345. doi:10.1007/s11704-024-40231-1.
[23] Y. Luo, Z. Yang, F. Meng, Y. Li, J. Zhou, Y. Zhang, An empirical study of catastrophic forgetting in
large language models during continual fine-tuning, arXiv preprint arXiv:2308.08747 (2023).
[24] D. Vrandečić, M. Krötzsch, Wikidata: a free collaborative knowledgebase, Communications of the</p>
      <p>ACM 57 (2014) 78–85. doi:10.1145/2629489.
[25] Q. J. Huys, N. Lally, P. Faulkner, N. Eshel, E. Seifritz, S. J. Gershman, P. Dayan, J. P. Roiser, Interplay
of approximate planning strategies, Proceedings of the National Academy of Sciences 112 (2015)
3098–3103. doi:10.1073/pnas.1414219112.
[26] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z. Ives, DBpedia: A nucleus for a web of
open data, in: The semantic web, Springer, 2007, pp. 722–735.
[27] D. Vrandečić, M. Krötzsch, Wikidata: A free collaborative knowledgebase, Commun. ACM 57
(2014) 78–85. doi:10.1145/2629489.
[28] J. Soruco, D. Collarana, A. Both, R. Usbeck, QALD-9-ES: A Spanish dataset for question answering
systems, in: Knowledge Graphs: Semantics, Machine Learning, and Languages, IOS Press, 2023,
pp. 38–52.
[29] R. Usbeck, M. Röder, M. Hofmann, F. Conrads, J. Huthmann, A.-C. Ngonga-Ngomo, C. Demmler,</p>
      <p>C. Unger, Benchmarking question answering systems, Semantic Web 10 (2019) 293–304.
[30] J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, S. Han,
AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration,
in: P. Gibbons, G. Pekhimenko, C. D. Sa (Eds.), Proceedings of Machine Learning and
Systems, volume 6, 2024, pp. 87–100. URL: https://proceedings.mlsys.org/paper_files/paper/2024/
ifle/42a452cbafa9dd64e9ba4aa95cc1ef21-Paper-Conference.pdf.
[31] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, I. Stoica, Eficient
memory management for large language model serving with pagedattention, in: Proceedings of
the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
[32] L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, F. Wei, Multilingual E5 text embeddings: A
technical report, arXiv preprint arXiv:2402.05672 (2024).
[33] N. Muennighof, N. Tazi, L. Magne, N. Reimers, MTEB: Massive text embedding benchmark, arXiv
preprint arXiv:2210.07316 (2022).
[34] A. Sakor, K. Singh, A. Patel, M.-E. Vidal, Falcon 2.0: An entity and relation linking tool over
Wikidata, in: Proceedings of the 29th ACM International Conference on Information &amp; Knowledge
Management, CIKM ’20, Association for Computing Machinery, New York, NY, USA, 2020, p.
3141–3148. doi:10.1145/3340531.3412777.
[35] A. Perevalov, A. Both, D. Diefenbach, A.-C. Ngonga Ngomo, Can machine translation be a
reasonable alternative for multilingual question answering systems over knowledge graphs?, in:
Proceedings of the ACM Web Conference 2022, WWW ’22, Association for Computing Machinery,
New York, NY, USA, 2022, p. 977–986. URL: https://doi.org/10.1145/3485447.3511940. doi:10.1145/
3485447.3511940.
[36] N. Srivastava, A. Perevalov, D. Kuchelev, D. Moussallem, A.-C. Ngonga Ngomo, A. Both, Lingua
franca – entity-aware machine translation approach for question answering over knowledge
graphs, in: Proceedings of the 12th Knowledge Capture Conference 2023, K-CAP ’23, Association
for Computing Machinery, New York, NY, USA, 2023, p. 122–130. URL: https://doi.org/10.1145/
3587259.3627567. doi:10.1145/3587259.3627567.
[37] J. Tiedemann, S. Thottingal, OPUS-MT — Building open translation services for the World, in:
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
(EAMT), Lisbon, Portugal, 2020.
[38] A. Gashkov, A. Perevalov, M. Eltsova, A. Both, Sparql query generation with llms: Measuring the
impact of training data memorization and knowledge injection, 2025. URL: https://arxiv.org/abs/
2507.13859. arXiv:2507.13859.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Diefenbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. H.</given-names>
            <surname>Migliatti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Qawasmeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lully</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Maret</surname>
          </string-name>
          ,
          <article-title>QAnswer: A question answering prototype bridging the gap between a considerable part of the lod cloud and end-users</article-title>
          ,
          <source>in: The World Wide Web Conference</source>
          , WWW '19,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2019</year>
          , p.
          <fpage>3507</fpage>
          -
          <lpage>3510</lpage>
          . doi:
          <volume>10</volume>
          .1145/3308558.3314124.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Turganbay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Surkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Evseev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Drobyshevskiy</surname>
          </string-name>
          ,
          <article-title>Generative question answering systems over knowledge graphs and text</article-title>
          , volume
          <volume>22</volume>
          ,
          <year>2023</year>
          , pp.
          <fpage>1112</fpage>
          -
          <lpage>1126</lpage>
          . doi:
          <volume>10</volume>
          .28995/ 2075-7182-2023-22-1112-1126.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vollmers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zahera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moussallem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          , MST5
          <article-title>-multilingual question answering over knowledge graphs</article-title>
          ,
          <source>arXiv preprint arXiv:2407.06041</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>