1. Introduction

Graf von Data: A Knowledge Graph Question Answering Agent for Organisational Usage

Rene Dorsch

rene.dorsch@iis.fraunhofer.de 0

Daniel Henselmann

daniel.henselmann@iis.fraunhofer.de 0

Andreas Harth

andreas.harth@iis.fraunhofer.de 0 1 0 Fraunhofer IIS , Nordostpark 84, 90411 Nuremberg , Germany 1 Friedrich-Alexander-Universität Erlangen-Nürnberg , Lange Gasse 20, 90403 Nuremberg , Germany

2025

Large Language Model-based (LLM) systems have advanced Knowledge Graph-based Question Answering (KGQA) but overlook organisational demands for transparency, control, and flexible deployment. We present Graf von Data (GvD), an LLM-augmented agent for KGQA, designed to meet key organisational demands. GvD translates natural language questions into SPARQL queries by iteratively exploring the knowledge graph (KG) using the ReAct framework and a set of three actions - search, describe, and query - requiring only a SPARQL endpoint with text search capabilities. Unlike other approaches, GvD is graph-agnostic and uses an open-weight LLM. We evaluate GvD in the Text2SPARQL challenge, demonstrating near state-of-the-art performance on the Corporate KG benchmark. In addition to quantitative results, we conduct a qualitative comparison against the challenge competitors, showing that GvD thrives under the organisational demands.

eol>Large Language Model RDF Knowledge Graph Question Answering Knowledge Graph Question Answering

1. Introduction

Connecting data across various sources is a crucial task for many organisations, enabling process optimisation and informed decision-making. Industrial organisations [ 1, 2, 3 ], technology organisations [ 4 ], and government institutions [ 5 ] employ knowledge graphs (KGs) due to their flexible structure and schema to connect data, thereby simplifying data integration.

However, accessing and querying KGs remains challenging for non-technical users in organisations [ 6 ] due to the specialised expertise required for crafting SPARQL queries. For non-technical users, Knowledge Graph Question Answering (KGQA) approaches have been developed to ofer an interface for natural language questions. Current approaches rely on Large Language Models (LLMs) to interact with the user and retrieve answers from the KG.

Nevertheless, existing LLM-based approaches (see Section 2) for KGQA are not developed with the demands of organisations in mind. The demands are [ 7, 8 ]:

To satisfy the additional demands of organisations, this paper introduces Graf von Data (GvD), an LLM-augmented agent. Users can ask questions in natural language, which GvD uses to explore and retrieve answers from the KG and then provide an answer. Users not only receive natural language answers but also corresponding executable SPARQL queries, supporting transparency. GvD’s agentbased approach enables iterative gathering and reasoning over KG information stored within a trace that is available to users, thereby also enhancing the transparency of query generation. Furthermore, GvD employs a KG-agnostic approach, requiring only a SPARQL endpoint and no prior assumption about the KG. Thus, it supports flexible deployments. By prioritising medium-sized, open-weight models, GvD supports deployment within trusted environments (on-premise or in the cloud), reduces operational costs, and avoids vendor lock-in.

GvD achieved the highest F1 score (44.4%) for the Corporate KG benchmark in the quantitative evaluation of the Text2SPARQL challenge1. A qualitative comparison with the competitors also shows that GvD is the only competitor addressing all organisational demands.

The remainder of this paper is structured as follows. Section 2 presents related work on LLMaugmented KGQA. Section 3 introduces GvD. Section 4 presents the experimental setup to evaluate GvD within the Text2SPARQL challenge. Section 5 introduces the competitors of the Text2SPARQL challenge. Section 6 discusses the results from the experiments and highlights diferences between our approach and the other participants of the Text2SPARQL challenge. Section 7 concludes the paper with limitations and directions for future research.

2. Related Work

The emergence of LLMs significantly impacted KG question answering as LLM-augmented approaches outperformed the results of prior approaches (e.g., [ 5 ]), becoming state-of-the-art. Multiple strategies for LLM-augmented systems for KG question answering exist, which we distinguish in the following. We consider approaches [ 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22 ] that generate a SPARQL query for a natural language question and find the question’s answer by executing the query, resulting in comprehensible answers based on facts.

Fine-tuning trains an LLM on a specific KG, inherently adding its content to the LLM’s knowledge. Approaches using fine-tuning [ 14, 19 ] require expensive resources to adjust the weights, are optimized for individual KGs, and require fine-tuning after each update of the KG.

Approaches [ 14, 16, 17, 21, 22 ] implementing few-shot learning [ 23 ] provide example questionsquery pairs to the LLM.

Retrieval-Augmented Generation (RAG) [ 24 ] allows an LLM to consider (up-to-date) information that is not inherent to the model by retrieving and considering additional resources after a question is asked. Actions, often referred to as tools, can be used to incorporate various resources. An action usually corresponds to an API call, providing information based on the question and receiving context information from the KG. Actions that implement RAG retrieve context information from an external knowledge source out of a given, finite set of information items. RAG actions in related work are entity/relation linking (retrieving IDs out of all IDs in the KG) [ 9, 10, 11, 12, 13, 14, 15, 16, 20, 21, 22 ], subgraph retrieval for an entity (out of the entire KG) [ 10, 11, 12, 13, 20, 21 ], subgraph retrieval specifically for a property (out of the entire KG) [ 10, 20 ], path discovery (out of all paths in the KG) [ 13 ], and SPARQL query execution (retrieving a combination of entities/relations from the KG) [ 10, 11, 12 ].

Other actions return context information that is not retrieved from an external knowledge source but instead generated based on the question’s information. Such actions in related work are entity/relation extraction from the question [ 9, 14, 15, 16 ], shape generation [ 18 ], SPARQL query generation [ 9 ], and SPARQL query validation [ 15, 16, 17, 22 ].

Besides actions, calling the LLM is a potential step in the question-answering process. In addition to the task of generating the SPARQL query, intermediate tasks to translate the question [ 18, 21 ], extract entities/relations [ 18 ], identify relevant classes and properties from the KG [ 15 ], filter retrieved

1See https://text2sparql.aksw.org/

subgraphs [ 10 ], and select the best SPARQL query among several candidates [ 9, 11 ] have been used.

Action and LLM calls may be combined in diferent ways. Figure 1 shows four strategies. Approaches [ 19 ] that find an answer directly execute a single LLM call (see Figure 1a) without considering up-to-date context information from the KG. Workflow approaches [ 9, 11, 14, 15, 16, 17, 18, 21 ] execute a predefined sequence of action and LLM calls (see Figure 1b). Plan-and-solve [ 25 ] approaches [ 22 ] initially call the LLM to draft a planned sequence of then executed action/LLM calls (see Figure 1c). ReAct [ 26 ] approaches [ 10, 12, 13, 20 ] iteratively alternate between calling the LLM to select the next action and executing the selected action, until the LLM decides to terminate, usually because the SPARQL query was found (see Figure 1d).

(a) Direct answer.

(b) Workflow. (c) Plan-and-solve [ 25 ].

(d) ReAct [ 26 ].

The current state-of-the-art for question answering on Wikidata is held by SPINACH [ 10 ] using a ReAct approach with actions for entity/relation linking, subgraph retrieval for an entity (enhanced with an LLM call to filter retrieved subgraphs), subgraph retrieval specifically for a property, and SPARQL query execution. SPINACH was developed explicitly for Wikidata, using three APIs provided by Wikidata and profiting from the LLM’s inherent knowledge on Wikidata entities from its training. In contrast, GvD requires only a SPARQL endpoint (with search extension) as API, supporting KG-agnostic operation.

3. Graf von Data Architecture

We provide a high-level overview of the GvD system and a description of its behaviour based on an example in Section 3.1. Afterwards, we provide a detailed description of the actions, prompts, and LLM selection for the agent in Section 3.2.

3.1. High Level Overview

GvD operates within a structured environment composed of two elements (see Figure 2): the agent and the environment. These components interact iteratively to process natural language questions and produce answers based on knowledge graph data.

Prompt

LLM Available Actions

Search Describe Query Success Fail

Sensor Actuator

A g e n t

S P A R Q L E n d p o ti n

Triple Store Keyword Search

Plugin

Answer (SPARQL Query and Natural Language

Answer)

User

E n v i r o n m e n t

• Agent: The agent orchestrates the question-answering process. It consists of three subcomponents: (i) an LLM, which processes input to select actions; (ii) a sensor, which parses and appends input data to the prompt; and (iii) an actuator, which executes the selected actions. • Environment: The environment includes a User and a SPARQL endpoint. The SPARQL endpoint serves as the interface between the agent and the KG. The endpoint provides a standardised access point to the triple store and the keyword search, enabling GvD to work with diferent KGs without modification or additional infrastructure. 1. Initial Search (Step 1): The agent identifies the key term “K367 Strain Encoder” from the user question. Afterwards, the agent calls the search action for the term to identify related URIs from the KG. 2. Entity Expansion (Step 2-3): Once the relevant URIs are located, the agent retrieves their RDF description to identify compatible products and associated relationships (ns2:compatibleProduct). 3. Intermediate Steps (Step ...): The agent extends the search using additional keywords and/or explores connected entities to discover a solution path. 4. Final Query Step (Step n−1): After gathering all relevant data (compatible products and their suppliers), the agent composes and executes a SPARQL query to count the number of unique suppliers. 5. Completion step (Step n): The agent receives the final result (6 suppliers) and returns both the SPARQL query and the natural language answer to the user, terminating.

3.2. Agent Implementation

The agent implements the ReAct framework to iteratively interact with its environment. Within each interaction loop, the agent can execute multiple actions in parallel, allowing eficient and flexible graph exploration. The agent consists of three components: the sensor, the actuator, and the LLM.

The sensor is responsible for perceiving and parsing all incoming information, either from the user’s initial question or as feedback from the environment (search results, RDF descriptions, query outputs, User: How many suppliers can deliver alternative compatible products for the K367 Strain Encoder? Graf von Data (Trace) Step 1 Think: The user is asking for the number of suppliers that can deliver alternative compatible products for a specific item, the K367 Strain Encoder. The first step is to sAecat:rcshefaorrcthh(e"KK336677 SSttrraaiinn EEnnccooddeer.r") Observation: The search for K367 Strain Encoder returned the following matches. http://ld.company.org/prod-instances/hw-K367-1320550 5.6152534 http://ld.company.org/prod-instances/bom-part-9-K367-1320550 3.1478124 ...

Step 2 Observation: The description of http://ld.company.org/prod-instances/hw-K367-1320550 is as follows: @prefix ns1: <http://ld.company.org/prod-vocab/> . @prefix ns2: <http://ld.company.org/prod-instances/> . @prefix rdf: <http://www.w3.org/2000/01/22-rdf-syntax-ns> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . ns2:hw-K367-1320550 ns1:compatibleProduct ns2:hw-C119-5354812, ns2:hw-K845-4116844, ns2:hw-Q248-7597886, ns2:hw-R658-8902629, ns2:hw-S271-9518696, ns2:hw-T808-9704137 ; ns1:depth_mm 16 ; ns1:hasCategory ns2:prod-cat-Encoder, ns2:prod-cat-Strain ; ns1:hasProductManager ns2:empl-Dietlinde.Boehme%2540company.org ; ns1:hasSupplier ns2:suppl-7c4b3772-ee27-4915-8166-359d068db153 ; ns1:height_mm 62 ; ns1:id "K367-1320550" ; ns1:name "Strain Encoder" ; ns1:price ns2:price-hw-K367-1320550-EUR ; ns1:reliabilityIndex 0.929 ; ns1:weight_g 8 ; ns1:width_mm 27 ;0 rdf:type ns1:Hardware ; rdfs:label "K367-1320550 - Strain Encoder" .

Think: The search results include several URIs that seem related to the K367 Strain Encoder. The first URI, `http://ld.company.org/prod-instances/hw-K367-1320550`, appears to be the most relevant as it directly mentions the K367 Strain Encoder.

Think: The description of the K367 Strain Encoder includes a list of compatible products. To find out how many suppliers can deliver these alternative compatible products, I need to describe each of the compatible products to find their suppliers.

Act: describe("http://ld.company.org/prod-instances/hw-C119-5354812") | describe("http://ld.company.org/prod-instances/hw-K845-4116844") Step 3 Observation: ...

Step ...

Think: ...

Act: ...

Observation: ...

Step n-1 Step n Think: Now that I have the descriptions of all the compatible products for the K367 Strain Encoder, I can see that each product has a `hasSupplier` property that points to a specific supplier URI.

Act: query("PREFIX ns1: <http://ld.company.org/prod-vocab/> SELECT (COUNT(DISTINCT ?supplier) AS ?numSuppliers) WHERE { <http://ld.company.org/prodinstances/hw-K367-1320550> ns1:compatibleProduct ?compatibleProduct . ?compatibleProduct ns1:hasSupplier ?supplier . }") Observation: The query "PREFIX ns1: <http://ld.company.org/prod-vocab/> SELECT (COUNT(DISTINCT ?supplier) AS ?supplierCount) WHERE { <http://ld.company.org/prod-instances/hw-K367-1320550> ns1:compatibleProduct ?compatibleProduct . ?compatibleProduct ns1:hasSupplier ?supplier . }" returned the following results: [{'supplierCount': '"6"^^http://www.w3.org/2001/XMLSchema#integer'}] Think: The query has confirmed that there are 6 unique suppliers that can deliver alternative compatible products for the K367 Strain Encoder. This matches the information I gathered from the descriptions of the compatible products. I have now successfully determined the number of suppliers that can deliver alternative compatible products for the K367 Strain Encoder.

Act: success("There are 6 suppliers that can deliver alternative compatible products for the K367 Strain Encoder.") Graf von Data (Answer) User Response: For the question "How many suppliers can deliver alternative compatible products for the K367 Strain Encoder?" we used the following SPARQL query: PREFIX ns1: <http://ld.company.org/prod-vocab/> SELECT (COUNT(DISTINCT ?supplier) AS ?supplierCount) WHERE { <http://ld.company.org/prod-instances/hw-K367-1320550> ns1:compatibleProduct ?compatibleProduct .

?compatibleProduct ns1:hasSupplier ?supplier . } and found that: There are 6 suppliers that can deliver alternative compatible products for the K367 Strain Encoder or error messages). The sensor transforms this information into structured text and appends it to the prompt, extending the history. The updated prompt serves as the internal state for the LLM’s decisionmaking.

The prompt is the primary mechanism to direct the LLM towards robust and eficient query generation. GvD implements a zero-shot prompt that only includes high-level instructions. The initial prompt consists of the following five sections: 1. Role Definition: This section instructs the model to act as an autonomous agent tasked with answering user questions by exploring the KG and generating appropriate SPARQL queries. 2. Loop Definition: The loop definition outlines the strategy based on the ReAct framework, allowing the agent to alternate between processing environmental feedback and taking actions, facilitating controlled and explainable behaviour. The loop is composed of three components: (i) Think to formulate the decision-making of the agent, (ii) Act, in which the agent selects and executes the following action, and (iii) Observation to perceive data from the environment. 3. Actions: Each action is described in terms of its input requirements, its intended purpose, and the expected output format. The actions are keyword search, entity description, SPARQL query execution, and termination (success or failure). By explicitly defining these actions, the prompt provides the agent with a clear operational vocabulary for interacting with the KG. 4. Stage Description: The stage description suggests an ”explore and expand” approach, in which the agent begins with a coarse-grained search and incrementally builds upon the results of previous actions. 5. General Guidelines: The guidelines provide recommendations that support correct and eficient system behaviour. For example, the agent is advised to avoid relying on known namespaces (such as DBpedia), always to use fully expanded URIs when composing SPARQL queries, and to be aware of the maximum number of interaction loops. The heuristics help the agent to remain compatible with diferent KGs and prevent common failure modes.

The agent can interact with its environment through the actuator. The actuator parses the textual action call and executes the action on the environment. The actuator has five actions available. Two actions specify termination conditions (success and failure) to enable the agent to stop the query generation. The success condition allows the agent to signal a found query based on the information from the graph. The fail condition indicates that the agent found either no or an unsatisfactory query. The other three actions enable the agent to interact with any KG: 1. Search action: Is an entity linking action that matches entity URIs from natural language keywords by the similarity score of the keyword and entity textual descriptions. The action uses keyword search plugins that are widely available for RDF triple stores2, to enable search based on text similarity within the KG through a SPARQL SELECT query. The similarity score can be specified on any literal. The search extension creates an index, such as a Lucene Index, using the literals to facilitate faster retrieval. 2. Describe action: Is a subgraph retrieval method that returns a subgraph for a provided entity URI according to defined entity boundaries. The boundaries vary based on the entity’s type. The specific boundaries for class entities [ 28 ] are shown in Figure 5 and expressed by a SPARQL query with the union of the graph patterns in Listings 1, 3, 4, and 5 from Appendix A. The specific boundaries for object property entities [ 28 ] are shown in Figure 6 and expressed by a SPARQL query with the union of the graph patterns in Listings 1, 7, 8, and 6 from Appendix A. The entity boundaries for all other entities are shown in Figure 4 and expressed by a SPARQL query with the union of the graph patterns in Listings 1 and 2 from Appendix A. For all outgoing (see Listing 1) and incoming (see Listing 2) triples, we prune property occurrences that exist over ten times (i.e., remove additional triples with that property) to limit the total number of triples in the subset. 3. Query action: Executes a SPARQL query specified by the LLM. The results are returned as a

JSON result set. 2See e.g., for Apache Jena: http://loopasam.github.io/jena-doc/documentation/query/text-query.html, GraphDB: https: //graphdb.ontotext.com/documentation/11.0/full-text-search.html, Blazegraph: https://github.com/blazegraph/database/ wiki/FullTextSearch

The actuator executes actions (including termination conditions) selected by the LLM. A previous study [ 29 ] found that selected actions should be evaluated before execution to prevent failures, unnecessary repetitions, and hallucinations. Thus, the actuator includes a function to validate each action before execution and provides, in the case of a failure, feedback to the sensor. The control function ensures that only syntactically valid and non-repeated actions are executed against the KG.

For the LLM selection, we prioritised diferent LLMs according to the organisational demands.Thus, we evaluated diferent open-weight models considering three key metrics: performance, robustness, and model size. We employed SupplyBench3 for the evaluation of LLMs. SupplyBench is a benchmark for enterprise-focused KGQA systems, featuring a KG with question/query pairs of varying complexity. Our experimental findings revealed distinct performance patterns across model sizes. Small LLMs with fewer than 10 billion parameters showed unstable agent loop behaviour. Large LLMs with more than 100 billion parameters showed higher performance but required prohibitively higher computational resources. Medium-sized LLMs with approximately 70 billion parameters have achieved the optimal balance between performance and eficiency. Specifically, current state-of-the-art medium-sized models, including Llama 3.3 70B Instruct and Qwen 2.5 72B Instruct, successfully managed the robustness requirements for efective agent loops and parallel action planning. LLM-KG-Bench-3 [ 30 ] confirmed our findings, as Llama 3.3 70B and Qwen 2.5 72B demonstrated strong performance across various KG-specific tasks.

4. Experimental Setup

We participated in the Text2SPARQL challenge for the evaluation of GvD. For the Text2SPARQL challenge, GvD and the associated Text2SPARQL APIs were self-hosted among two Apache Jena Fuseki4 servers extended by Jena Full Text Search5. The two APIs provided the corporate and DBpedia SPARQL endpoints to GvD. Jena Full Text Search used Lucene indexes based on rdfs:label values in the respective KGs and the default analyser and tokeniser.

The configuration of GvD limited the number of interaction loops to eight and the response times for the search action to 10 seconds, describe action to 20 seconds, and query action to 30 seconds.

We submitted two versions of GvD to the Text2SPARQL challenge, which difered only in the LLM used. One used Llama 3.3 70B Instruct6 and the other Qwen 2.5 72B Instruct7. We used both models hosted at Chat AI8 [ 31 ]. Regarding model parameters, the temperature (influencing the output creativity) was set to 0 (lowest creativity), and the top_p (influencing the output variety) was set to 1 (highest variety).

All experiments were conducted with one run by the Text2SPARQL organizers. The organizers evaluated GvD with the metrics precision, recall, F1, and for SPARQL queries with sorted results, NDCG (Normalised Discounted Cumulative Gain). The organizers did not diferentiate between diferent types of queries (e.g. ASK, SELECT). We were not able to reproduce the results from the organizers with our experimental setup9. This may be caused by changed or unknown configuration parameters of the organizers or Chat AI.

5. Competitor Profiles

Nine groups participated in the Text2SPARQL challenge with twelve competing implementations. Three competitors withdrew from the competition after the results were revealed. Therefore, we excluded them from the evaluation. Table 1 lists the remaining participants with their KGQA strategy. 3See https://github.com/wintechis/supplybench 4See https://jena.apache.org/documentation/fuseki2/index.html 5See https://jena.apache.org/documentation/query/text-query.html 6See https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct 7See https://huggingface.co/Qwen/Qwen2.5-72B-Instruct 8See https://docs.hpc.gwdg.de/services/chat-ai/index.html 9https://github.com/AKSW/text2sparql.aksw.org/issues/101

Organisation Institute of Applied Informatics and Formal Description Methods (AIFB) [ 18 ] Artificial Intelligence Research Institute (AIRI) [ 21 ] DBpedia Group (DBpedia) [ 19 ] Fraunhofer IIS (IIS) ETi @ Institute for Applied Informatics (InfAI) [ 20 ] WSE Research Group @ Leipzig University of Applied Sciences (WSE) [ 22 ] KGQA Approach Workflow Workflow Direct ReAct ReAct Plan-and-Solve

The Institute of Applied Informations and Formal Description Methods (AIFB) of the Karlsruher Institute of Technology developed a four-step workflow approach [ 18 ] involving language translation, entity extraction, ShEx shape generation via shexer [ 32 ], and finally SPARQL generation.

The Artificial Intelligence Research Institute (AIRI) proposed Text-to-RDF Executor (TREx) [ 21 ], a workflow-based approach that begins with language translation but then diverges into KG-specific processing pipelines. The DBpedia KG pipeline includes additional steps for query rewriting, entity linking through additional dependencies, and query refinement through question-query examples provided through the LCQuad10 and QALD-911 datasets to improve question generation performance.

The DBpedia Group’s approaches [ 19 ] fine-tuned three open-weight models, CodeGen 2.5 350M [ 33 ] (DBpedia-CG), StarCoder 1B [ 34 ] (DBpedia-SC), and CodeLlama 7b [ 35 ] (DBpedia-CL) using the NSpM data set12 containing 8 million natural language question-SPARQL pairs.

The Institute for Applied Informatics (INFAI) developed a ReAct agent approach [ 20 ]. While sharing GvD’s loop definition, the system relies on a vector database as an additional infrastructure. The approach has access to six actions. It has three actions - entity search, property search and class search - for entity and relation linking, two actions - property example and knowledge graph entry - for subgraph retrieval and a query action.

The Web & Software Engineering (WSE) Research Group of the Leipzig University of Applied Sciences developed the mKGQAgent [ 22 ]. The mKGQAgent is a plan-and-solve approach comprising both ofline and online phases. The ofline phase creates an experience pool from training data to generate successful sequences of actions, while the online phase uses this pool to guide question analysis and step-by-step action execution with GPT-4o. The experience pool is stored within a vector database, using the multilingual e5 text embeddings [ 36 ] to store successful plans.

6. Results and Discussion

This section evaluates GvD‘s performance in three parts: a quantitative evaluation based on the Text2SPARQL challenge (Sec. 6.1), a qualitative comparison with Text2SPARQL challenge competitors regarding usability criteria in an organisational context, and a qualitative investigation of system errors of GvD (Sec. 6.3).

6.1. Performance Evaluation

We evaluate GvD’s efectiveness through quantitative analysis of the Text2SPARQL challenge performance results, examining results across the Corporate KG and DBpedia benchmarks to assess system performance in SPARQL query generation from natural language questions.

Table 2 shows that the ReAct agent approaches achieve the best performance on the Corporate KG benchmark. GvD with the Qwen 2.5 72b Instruct model (IIS-Q) achieved the highest recall (0.452) and F1-score (0.444). The approach from INFAI achieved the highest performance in terms of NDCG score (1.0) and precision (0.542). 10https://github.com/AskNowQA/LC-QuAD 11https://github.com/ag-sc/QALD/tree/master 12https://github.com/LiberAI/NSpM/tree/master

GvD with the Llama 3.3 70b Instruct model (IIS-L) followed behind the other two competitors. While individual metric leadership varies across approaches, the consistent dominance of ReAct agent approach over the other approaches suggests that flexible exploration capabilities provide advantages for KGQA applications without prior knowledge.

6.2. Comparison to Organisational Demands

To evaluate GvD’s suitability for organisational usage, we compare the Text2SPARQL challenge competitors across three demands derived from organisational needs: transparent answer generation, organisational control, and flexible deployment. Table 5 summarises the evaluation of GvD and the competitor systems.

The AIFB team employed a modular four-step workflow that supports generalisation across KGs. However, their dependence on the closed-weight GPT-4o model from OpenAI limits organisational control and may introduce privacy concerns. Additionally, their use of ShEx shapes to provide schema information about the KG during query construction may obscure how answers are generated, reducing transparency and making error analysis dificult for end users.

The AIRI group created specialised pipelines for each KG. This approach limits the flexibility to apply the system easily in new domains and increases configuration overhead. Additionally, their reliance on the closed-weight GPT-4 model further restricts organisational control and prevents on-premise usage.

The DBpedia Group uses fine-tuned, open-weight models, which support organisational control by enabling local hosting and avoiding third-party dependencies. However, the model operates in a static inference mode solely based on training data rather than considering up-to-date information, which limits the transparency of the answer generation. The reliance on small language models lowers deployment demands, but enforces continuous retraining if the KG changes.

The mKGQAgent from the WSE group relies on a precomputed experience pool to support plan generation. This ofline component helps with generating more transparent answers, as plans are composed of verifiable steps. However, the need for training data that resembles the target domain reduces the system’s ability to generalise, thereby increasing deployment overhead. The usage of GPT4o again limits organisational control due to the reliance on a closed-weight LLM.

The INFAI system adopts a ReAct agent approach similar to GvD, but relies on a vector database for resolving entity identifiers. This dependency increases the deployment efort, particularly for organisations aiming to reduce infrastructure complexity. The usage of OpenAI’s GPT-4.1 mini model reduces organisational control, introducing external dependencies and may raise data privacy concerns. Trustworthiness was not explicitly reported, but the reasoning traces may be available, showing how the agent came up with a result.

In contrast, GvD is explicitly designed to address all three organisational demands. It ensures transparent answers by generating structured reasoning traces that show how a query was derived. It supports full organisational control by using open-weight LLMs that can be deployed locally, avoiding external dependencies and supporting data privacy. Furthermore, GvD has a high deployment flexibility, as it is not optimised for a specific KG or requires infrastructure beyond a SPARQL endpoint with a keyword search plugin.

6.3. Error Analysis

We conducted an error analysis to identify current system limitations and opportunities for improvement. We created a dataset of 40 problematic cases by examining the execution logs from questions generating at least three errors. The error distribution revealed language and benchmark dependencies. Only 3 (7.5%) cases were caused by the Corporate KG benchmark compared to 37 (92.5%) from the DBpedia benchmark. Furthermore, 23 of 37 (62%) of the errors from the DBpedia benchmark were caused by Spanish questions.

In the analysis of the errors, we made three observations: • Data Parsing Failures: Unsupported formatting in SPARQL queries (through the query action) and RDF responses (through the describe action) are in 95% (38/40) cases. These failures stem from missing namespace definitions for standard vocabularies (e.g., missing rdfs: prefix) and inadequate URI escaping (e.g., dbr:New_York_(City) vs. dbr:New_York_\(City\)). • Function Repetition Errors: The LLM repeatedly tried to execute identical functions with identical arguments in 77.5% of cases (31/40), suggesting currently insuficient feedback mechanisms to prevent unproductive exploration loops. • Stable Agentic Loop: We observed no case with a loss of the agentic loop control or failure to execute a proper action call, indicating that ReAct agent approaches may be adequate for KGQA.

7. Conclusion

This paper tackles the accessibility and query challenges of KGs within organisations for non-technical users. We introduced Graf von Data (GvD), an LLM-augmented agent that bridges this accessibility gap by translating natural language questions into executable SPARQL queries while fulfilling organisational usage demands.

GvD’s innovation lies in its knowledge graph-agnostic approach, which operates exclusively through a SPARQL endpoints with keyword search plugin, enabling seamless integration across diverse KG environments without further customisation. This approach, combined with our use of openweight models, distinguishes GvD among the Text2SPARQL challenge competitors. Despite this focus on organisational demands, GvD achieved the highest F1-score on the Corporate KG benchmark and competitive results in the DBpedia benchmark.

An evaluation of error cases presents opportunities for future enhancement. Future research directions may include expanding the actions of GvD through additional SPARQL interface mappings, integrating specialised entity recognition capabilities, and exploring hybrid approaches combining keyword search with embedding-based indexing for improved entity linking.

Acknowledgments

This work was partially funded by the German Federal Ministry for Economic Afairs and Climate Action (BMWK) through the Antrieb 4.0 project (Grant No. 13IK015B).

The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU). The hardware is funded by the German Research Foundation (DFG).

Declaration on Generative AI

During the preparation of this work, the author(s) used Grammarly to: Grammar and spelling check. After using these tool(s)/service(s), the authors reviewed and edited the content as needed and take full responsibility for the publication’s content.

A. Graph Patterns for Entity Boundaries

<{uri}> ?p ?o . ?s ?p <{uri}> .

Listing 1: Graph pattern: Outgoing triples of the focal entity. Listing 2: Graph pattern: Incoming triples of the focal entity. Listing 3: Graph pattern: Subclasses of the focal class.

?subclass rdfs:subClassOf <{uri}> .

OPTIONAL { ?subclass rdf:type ?type . } OPTIONAL { ?subclass rdfs:label ?label . } <{uri}> rdfs:subClassOf ?parentclass . OPTIONAL { ?parentclass rdf:type ?type . } OPTIONAL { ?parentclass rdfs:label ?label . }

Listing 4: Graph pattern: Parent classes of the focal class. Listing 5: Graph pattern: Properties having the focal class as domain or range plus classes in domain or range of such properties.

?property ?domainrange <{uri}> .

FILTER ( ?domainrange IN ( rdfs:domain , rdfs:range )) OPTIONAL { ?property rdf:type ?p_type . } OPTIONAL { ?property rdfs:label ?p_label . } OPTIONAL { ?property ?rangedomain ?class .

FILTER ( ?rangedomain IN ( rdfs:domain , rdfs:range )) OPTIONAL { ?class rdfs:label ?c_label . } OPTIONAL { ?class rdf:type ?c_type . } }

Listing 6: Graph pattern: Classes in domain or range of the focal object property. <{uri}> ?domainrange ?class .

FILTER ( ?domainrange IN ( rdfs:domain , rdfs:range )) OPTIONAL { ?class rdfs:label ?label . } OPTIONAL { ?class rdf:type ?label . } Listing 7: Graph pattern: Subproperties of the focal object property plus classes in domain or range of subproperties. ?subproperty rdfs:subPropertyOf <{uri}> .

OPTIONAL { ?subproperty rdf:type ?p_type . } OPTIONAL { ?subproperty rdfs:label ?p_label . } OPTIONAL { ?subproperty ?rangedomain ?class .

FILTER ( ?rangedomain IN ( rdfs:domain , rdfs:range )) OPTIONAL { ?class rdf:type ?c_type . } OPTIONAL { ?class rdfs:label ?c_label . } } <{uri}> rdfs:subPropertyOf ?parentproperty .

OPTIONAL { ?parentproperty rdf:type ?p_type . } OPTIONAL { ?parentproperty rdfs:label ?p_label . } OPTIONAL { ?parentproperty ?rangedomain ?class .

FILTER ( ?rangedomain IN ( rdfs:domain , rdfs:range )) OPTIONAL { ?class rdf:type ?c_type . } OPTIONAL { ?class rdfs:label ?c_label . } } Listing 8: Graph pattern: Parent properties of the focal object property plus classes in domain or range of parent properties.

[1]

E. G.

Kalaycı ,

I. Grangel

González ,

Lösch ,

Xiao , A. ul-Mehdi , E.

Kharlamov , D.

Calvanese , Semantic Integration of Bosch Manufacturing Data Using Virtual Knowledge Graphs , in: The Semantic Web - ISWC 2020 , volume 12507 of LNCS , Springer International Publishing, 2020 , pp. 464 - 481 . doi: 10 .1007/978- 3- 030 - 62466- 8_ 29 .

[2]

Hubauer ,

Lamparter ,

Haase ,

Herzig , Use Cases of the Industrial Knowledge Graph at Siemens, in: ISWC 2018 Posters & Demonstrations, Industry and Blue Sky Ideas Tracks , Monterey, USA, 2018 .

[3]

Liebig ,

Maisenbacher ,

Opitz ,

J. R.

Seyler , G. Sudra,

Wissmann , Building a Knowledge Graph for Products and Solutions in the Automation Industry , in: Joint Proceedings of the 1st International Workshop on Knowledge Graph Building and 1st International Workshop on Large Scale RDF Analytics co-located with 16th Extended Semantic Web Conference (ESWC 2019 ), 2019 . URL: https://ceur-ws. org/ Vol- 2489 /paper2.pdf.

[4]

Noy ,

Gao ,

Jain ,

Narayanan ,

Patterson ,

Taylor , Industry-scale Knowledge Graphs: Lessons and Challenges , ACM Queue 17 ( 2019 ) 48 - 75 . doi: 10 .1145/3329781.3332266.

[5]

Diefenbach ,

M. D.

Wilde ,

Alipio , Wikibase as an Infrastructure for Knowledge Graphs: The EU Knowledge Graph , in: The Semantic Web - ISWC 2021 , volume 12922 of LNCS , Springer International Publishing, 2021 , pp. 631 - 647 . doi: 10 .1007/978- 3- 030 - 88361- 4_ 37 .

[6]

Kaufmann , A. Bernstein, How Useful Are Natural Language Interfaces to the Semantic Web for Casual End-Users? , in: The Semantic Web , volume 4825 , Springer, 2007 , pp. 281 - 294 . doi: 10 . 1007/978- 3- 540 - 76298- 0_ 21 .

[7]

Chen ,

Gascó-Hernandez ,

Esteve , The Adoption and Implementation of Artificial Intelligence Chatbots in Public Organizations: Evidence from U.S. State Governments , The American Review of Public Administration 54 ( 2024 ) 255 - 270 . doi: 10 .1177/02750740231200522.

[8]

Bruckhaus , RAG Does Not Work for Enterprises, 2024 . doi: 10 .48550/arXiv.2406.04369. arXiv: 2406 . 04369 .

[9]

D. M.

Schmidt ,

M. F.

Elahi ,

Cimiano , Lexicalization Is All You Need: Examining the Impact of Lexical Knowledge in a Compositional QALD System , in: Knowledge Engineering and Knowledge Management , Springer Nature, 2025 , pp. 102 - 122 . doi: 10 .1007/978- 3- 031 - 77792- 9 _ 7 .

[10]

Liu ,

Semnani ,

Triedman ,

Xu ,

I. D.

Zhao , M. Lam, SPINACH: SPARQL-Based Information Navigation for Challenging Real-World Questions, in: Findings of the Association for Computational Linguistics: EMNLP 2024, Association for Computational Linguistics , 2024 , pp. 15977 - 16001 . doi: 10 .18653/v1/ 2024 .findings- emnlp.938.

[11]

C. V. S.

Avila ,

M. A.

Casanova ,

V. M. P.

Vidal , A Framework for Question Answering on Knowledge Graphs Using Large Language Models , in: The Semantic Web: ESWC 2024 Satellite Events , volume 15344 of LNCS , Springer, 2025 , pp. 168 - 172 . doi: 10 .1007/978- 3- 031 - 78952- 6_ 20 .

[12]

Xiong ,

Bao ,

Zhao , Interactive-KBQA: Multi-Turn Interactions for Knowledge Base Question Answering with Large Language Models , 2024 . doi: 10 .48550/arXiv.2402.15131. arXiv: 2402 . 15131 .

[13]

Sun ,

Tao ,

Li ,

Arakawa , ODA: Observation-Driven Agent for integrating LLMs and Knowledge Graphs , 2024 . doi: 10 .48550/arXiv.2404.07677. arXiv: 2404 . 07677 .

[14]

Xu , S. Liu,

Culhane , E. Pertseva, M.-H. Wu , S.

Semnani , M.

Lam , Fine-tuned LLMs Know More, Hallucinate Less with Few-Shot Sequence-to-Sequence Semantic Parsing over Wikidata , in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics , 2023 , pp. 5778 - 5791 . doi: 10 .18653/v1/ 2023 . emnlp- main.353.

[15]

Piao ,

Mountantonakis ,

Papadakos ,

Sonawane , A. OMahony, Toward Exploring Knowledge Graphs with LLMs , in: Joint Proceedings of Posters, Demos, Workshops, and Tutorials of the 20th International Conference on Semantic Systems (SEMANTiCS 2024 ), volume Vol- 3759 , CEUR-WS, Amsterdam, The Netherlands, 2024 .

[16] H. M. Zahera , M.

Ali , M. A.

Sherif , D.

Moussallem , A.-C. Ngonga Ngomo, Generating SPARQL from Natural Language Using Chain-of-Thoughts Prompting, in: Knowledge Graphs in the Age of Language Models and Neuro-Symbolic

, volume 60 of Studies on the Semantic Web , IOS Press, Amsterdam, 2024 , pp. 353 - 368 . doi: 10 .3233/SSW240028.

[17]

Kovriguina ,

Teucher ,

Radyush ,

Mouromtsev , SPARQLGEN: One-Shot Prompt-based Approach for SPARQL Query Generation , in: Proceedings of the Posters and Demo Track of the 19th International Conference on Semantic Systems (SEMANTiCS 2023 ), volume Vol- 3526 , CEUR, Leipzig, Germany, 2023 .

[18]

Wardenga , T. Käfer, Challenge submission: AIFB, https://text2sparql.aksw.org/assets/talks/ 2 -AIFB-tobias-kaefer-kit .pdf, 2025 . Accessed 2025- 06 -13.

[19]

Shahinmoghadam ,

Soru ,

Josh , S. Tiwari, DBpedia Group at the 1st Text2SPARQL Challenge , https://text2sparql.aksw.org/assets/talks/3 -DBPEDIA-tommaso-soru .pdf, 2025 . Accessed 2025- 06 -13.

[20]

Gerber ,

Bühmann , L.-P. Meyer, C. Stadler, Felix Brei,

Frey , K. Bulert, ETi / InfAI: Text2Sparql Agent, https://text2sparql.aksw.org/assets/talks/5-INFAI-lars - peter-meyer.pdf, 2025 . Accessed 2025- 06 -13.

[21]

Somov ,

Berezin ,

Avdeev , TEXT2SPARQL: TREx (Text-to-RDF Executor) , https: //text2sparql.aksw.org/assets/talks/6 -MIPT-oleg-somov-daniil-berezin-roman-avdeev-airi .pdf, 2025 . Accessed 2025- 06 -13.

[22]

Perevalov ,

Both , mKGQAgent: Multilingual Human-inspired Question Answering over Language Graphs , https://text2sparql.aksw.org/assets/talks/7 -WSE-andreas-both-htwk . mp4 , 2025 . Accessed 2025- 06 -13.

[23]

Fei-Fei ,

Fergus ,

Perona , One-shot learning of object categories , IEEE Transactions on Pattern Analysis and Machine Intelligence 28 ( 2006 ) 594 - 611 . doi: 10 .1109/TPAMI. 2006 . 79 .

[24]

Lewis ,

Perez ,

Piktus ,

Petroni ,

Karpukhin ,

Goyal ,

Küttler ,

Lewis , W.-t. Yih,

Rocktäschel ,

Riedel ,

Kiela , Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , in: NIPS'20: Proceedings of the 34th International Conference on Neural Information Processing Systems , Curran Associates Inc., Red

Hook

, NY , United States , 2020 , pp. 9459 - 9474 .

[25]

Wang ,

Xu ,

Lan ,

Hu ,

Lan , R. K.-W. Lee , E.-P. Lim , Plan- and -Solve Prompting : Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models , in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , 2023 .

[26]

Yao ,

Zhao ,

Yu ,

Du , I. Shafran,

Narasimhan , Y. Cao, ReAct: Synergizing Reasoning and Acting in Language Models , in: The Eleventh International Conference on Learning Representations ICLR 2023 , Kigali, Rwanda, 2023 . doi: 10 .48550/arXiv.2210.03629.

[27]

OMG

Unified Modeling Language (OMG UML) , Version 2.5.1 , 2017 .

[28]

Henselmann ,

Dorsch ,

Harth , Impact of Knowledge Graph Representations on Question Answering with Language Models , in: Advanced Information Systems Engineering Workshops , volume 556 of LNBIP , Springer, Vienna, Austria, 2025 , pp. 81 - 92 . doi: 10 .1007/ 978- 3- 031 - 94931- 9 _ 7 .

[29]

Schmid ,

Freund ,

Harth , Adaptive Planning on the Web: Using LLMs and Afordances for Web Agents , in: Knowledge Graphs and Semantic Web , volume 15459 of LNCS , Springer Nature Switzerland, 2025 , pp. 93 - 108 . doi: 10 .1007/978- 3- 031 - 81221- 7 _ 7 .

[30]

L.-P.

Meyer , J. Frey , D.

Heim , F.

Brei , C.

Stadler , K.

Junghanns , M.

Martin, LLM-KG-Bench 3.0: A Compass for Semantic Technology Capabilities in the Ocean of LLMs , volume 15719 of LNCS , 2025 , pp. 280 - 296 . doi: 10 .1007/978- 3- 031 - 94578- 6_ 16 .

[31]

Doosthosseini ,

Decker ,

Nolte ,

J. M.

Kunkel , Chat

A Seamless

Slurm-Native Solution for HPC-Based Services , 2024 . doi: 10 .48550/arXiv.2407.00110.

[32]

Fernandez-Álvarez ,

J. E.

Labra-Gayo ,

Gayo-Avello , Automatic extraction of shapes using sheXer, Knowledge-Based Systems 238 ( 2022 ). doi: 10 .1016/j.knosys. 2021 . 107975 .

[33]

Nijkamp ,

Hayashi ,

Xiong ,

Savarese ,

Zhou , CodeGen2: Lessons for Training LLMs on Programming and Natural Languages , 2023 . doi: 10 .48550/arXiv.2305.02309.

[34]

Li ,

L. B.

Allal ,

Zi ,

Muennighof ,

Kocetkov ,

Mou ,

Marone ,

Akiki ,

Li ,

Chim ,

Liu , E. Zheltonozhskii,

T. Y.

Zhuo ,

Wang ,

Dehaene ,

Davaadorj ,

Lamy-Poirier ,

Monteiro ,

Shliazhko ,

Gontier ,

Meade ,

Zebaze , M.-H. Yee , L. K.

Umapathi , J.

Zhu , B.

Lipkin , M.

Oblokulov , Z.

Wang , R.

Murthy , J.

Stillerman , S. S.

Patel , D.

Abulkhanov , M.

Zocca , M.

Dey , Z.

Zhang , N.

Fahmy , U.

Bhattacharyya , W.

Yu , S.

Singh , S.

Luccioni , P.

Villegas , M.

Kunakov , F.

Zhdanov , M.

Romero , T.

Lee , N.

Timor , J.

Ding , C.

Schlesinger , H.

Schoelkopf , J.

Ebert , T.

Dao , M.

Mishra , A.

Gu , J.

Robinson , C. J. Anderson , B.

Dolan-Gavitt , D.

Contractor , S.

Reddy , D.

Fried , D.

Bahdanau , Y.

Jernite , C. M.

Ferrandis , S.

Hughes , T.

Wolf , A.

Guha , L. von Werra , H. de Vries, StarCoder: May the source be with you! , 2023 . doi: 10 .48550/arXiv.2305.06161.

[35]

Rozière ,

Gehring ,

Gloeckle ,

Sootla ,

Gat ,

X. E.

Tan ,

Adi , J. Liu,

Sauvestre ,

Remez ,

Rapin ,

Kozhevnikov , I. Evtimov ,

Bitton ,

Bhatt ,

C. C.

Ferrer ,

Grattafiori ,

Xiong ,

Défossez ,

Copet ,

Azhar ,

Touvron ,

Martin ,

Usunier ,

Scialom , G. Synnaeve, Code Llama: Open Foundation Models for Code, 2024 . doi: 10 .48550/arXiv.2308.12950.

[36]

Wang ,

Yang ,

Huang ,

Yang ,

Majumder ,

Wei , Multilingual E5 Text Embeddings: A Technical Report , 2024 . doi: 10 .48550/arXiv.2402.05672.

[37]

Yang ,

Zhang ,

Hui ,

Zheng ,

Yu ,

Li ,

Liu ,

Huang ,

Wei ,

Lin ,

Yang ,

Tu ,

Zhang ,

Yang ,

Zhou ,

Lin ,

Dang ,

Lu ,

Bao ,

Yang ,

Yu ,

Li ,

Xue ,

Zhang ,

Zhu ,

Men ,

Lin ,

Li ,

Tang ,

Xia ,

Ren ,

Fan ,

Su ,

Zhang ,

Wan ,

Liu ,

Cui ,

Zhang ,

Qiu , Qwen2.5 Technical Report , 2025 . doi: 10 .48550/arXiv. 2412.15115.