<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jaycent G. Ongris</string-name>
          <email>jaycent.gunawan@ui.ac.id</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eduardus Tjitrahardja</string-name>
          <email>eduardus.tjitrahardja@ui.ac.id</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fariz Darari</string-name>
          <email>fariz@ui.ac.id</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fajar J. Ekaputra</string-name>
          <email>fajar.ekaputra@wu.ac.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>SPARQL queries.</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Large Language Models, Knowledge Graphs, Retrieval-Augmented Generation, GraphRAG</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Computer Science</institution>
          ,
          <addr-line>Universitas Indonesia, Depok 16424</addr-line>
          ,
          <country country="ID">Indonesia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Data</institution>
          ,
          <addr-line>Process, and Knowledge Management, WU, Vienna 1020</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The rise of large language models (LLMs) has advanced information retrieval, yet issues like limited knowledge updating, lack of transparency and interpretability, as well as hallucinations persist. Retrieval-augmented generation (RAG) addresses these problems, though it still lacks interpretability due to reliance on opaque vector-based representations. Our work presents a RAG framework using a knowledge graph (KG) as the primary knowledge base to address this problem, relying solely on open-source components to enable user customization. Our pipeline comprises multiple stages: ( ) a translation module for multilingual support, ( ) entity linking, ( ) knowledge retrieval through verbalized triples or SPARQL query generation, and ( ) answer generation, which incorporates ontology (properties and classes) retrieval. We evaluate our system on Wikidata, DBpedia, and a domain-specific KG. With the optimal configuration determined through an ablation study, the system achieves Jaccard similarity scores of 0.458, 0.517, and 0.976 for each respective KG. The ablation study further reveals that ontology retrieval is the most crucial component in providing context to the LLM in generating 4th International Workshop on LLM-Integrated Knowledge Graph Generation from Text (Text2KG) Co-located with the Extended †These authors contributed equally.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Recent advancements in large language models (LLMs) have transformed information retrieval (IR) [1],
powering services like ChatGPT, Gemini, and Copilot for tasks such as question-answering (QA) and
text summarization. LLMs excel in IR due to their ability to semantically understand and generate
natural language by leveraging knowledge internalized in their parameters [2]. While fine-tuning and
in-context learning can further enhance their adaptability, LLMs face challenges like knowledge update
dificulties, lack of transparency and interpretability, as well as
hallucination [3], raising concerns about
their reliability in critical domains.</p>
      <p>To overcome these problems while still maintaining the excellent performance of LLMs,
retrievalaugmented generation (RAG) was introduced [3]. RAG incorporates external non-parametric memory in
the form of a retriever module that fetches relevant information. The retriever module allows expansion
or update of knowledge as the knowledge is not embedded in the parameters. Moreover, this module
also allows direct inspection of the retrieved data, partially addressing the black-box problem. While
RAG does not guarantee that hallucinations will disappear entirely, it reduces them by not depending
solely on the knowledge embedded within the LLMs.</p>
      <p>The original naïve RAG approach, as proposed in [3], employs Dense Passage Retriever (DPR) as
its non-parametric retrieval module [4]. This method encodes both queries and documents into dense
embeddings, such as those generated using BERTBASE, as implemented in [3]. The maximum inner
product search (MIPS) algorithm is then used to retrieve the top- most relevant documents for a given
query. However, despite outperforming traditional scoring algorithms like BM25 [4], DPR still sufers
from the black-box problem due to the inherent opacity of dense embeddings, which are generally
less interpretable. Additionally, splitting large documents into smaller segments for encoding (i.e.,</p>
      <p>CEUR</p>
      <p>ceur-ws.org
chunking) can lead to the loss of contextual information and disrupt the coherence of the original
content, especially when related information is spread across diferent chunks [ 5].</p>
      <p>In order to address these issues, an alternative approach is to use a diferent knowledge base for
retrieval instead of relying on a corpus of text chunks. One of the most versatile options is the
knowledge graph (KG), which organizes information in a structured graph format. Leveraging a KG as
the knowledge base ofers several advantages: its entities are contextualized by their connections; its
graph structure supports graph theory algorithms (e.g., community detection) for extracting insights;
and it enables reasoning over retrieval results through graph walks that trace the connections between
entities [6].</p>
      <p>Given the interpretability and transparency challenges faced by most LLMs and naïve RAG, we
introduce FrOG: Framework of Open GraphRAG1. FrOG is a RAG system that utilizes a KG as the
primary knowledge base and relies entirely on open-source components. This approach enables direct
tracing of retrieved knowledge, enhancing the system’s reasoning process with greater transparency and
interpretability. Moreover, we leverage of-the-shelf open-source LLMs, allowing users to implement
the system directly without fine-tuning, which would otherwise require substantial computational
resources. By using open-source components, we enhance accessibility and flexibility, making it easier
to customize and improve the system for domain-specific applications. To narrow our focus, we limit
our research to factoid questions and define our research questions (RQs) as follows.
RQ1 How can RAG be implemented using a KG as the primary knowledge base (that is, GraphRAG)
while relying only on open-source components?
RQ2 Which open-source LLM serves as the best foundation for the GraphRAG architecture and achieves
the highest answer accuracy?
RQ3 Which specific component within the architecture plays the most critical role in ensuring accurate
answer generation?</p>
      <p>Our main contribution is the design and development of a GraphRAG pipeline built entirely with
open-source components. We evaluate our pipeline across multiple KGs, including Wikidata, DBpedia,
and a domain-specific KG, using various open-source LLMs to identify the best-performing model. Our
results indicate that Qwen2.5 7B (instruction-fine-tuned) achieves the highest accuracy in the optimal
configuration, with Jaccard similarity scores of 0.458, 0.517, and 0.976 for Wikidata, DBpedia, and the
domain-specific KG, respectively. A demonstration of the system’s capabilities in answering various
questions from the respective KG can be viewed here2. Additionally, our findings highlight the retrieval
of relevant KG resources—specifically entities, properties, and classes—as the most crucial component,
as it provides essential context to the LLM and significantly enhances answer quality.</p>
      <p>The rest of the paper is organized as follows: Section 2 reviews related work, Section 3 presents the
system architecture, Section 4 details the experimental setup, Section 5 discusses the evaluation results,
and Section 6 provides the conclusions and future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>GraphRAG is a retrieval-augmented generation (RAG) system that leverages KG as the knowledge base.
The implementation involves retrieving graph elements that contain knowledge pertinent to the given
query from a graph database [7]. GraphRAG provides several advantages over naïve RAG, including
the ability to capture connections between entities for a more comprehensive retrieval of relational
information, represent knowledge in a structured and eficient format rather than lengthy passages,
and understand a broader context, enabling efective query-focused summarization (QFS) tasks [7].</p>
      <p>The key diference between naïve RAG and GraphRAG lies in their knowledge bases and retrievers.
Generally, naïve RAG relies on textual data as the knowledge base and utilizes DPR as the retriever. In
contrast, GraphRAG employs various types of retrievers to extract information from a KG. Specifically,</p>
      <sec id="sec-2-1">
        <title>1https://github.com/Framework-of-Open-GraphRAG/FROG 2Demo: https://drive.google.com/drive/folders/1UPVF-2rXrrToSsWB84h8L18A_YrMn1g6?usp=drive_link</title>
        <p>Peng et al. [7] classify graph retrievers into three main categories: non-parametric retrievers, language
model (LM)-based retrievers, and graph neural network (GNN)-based retrievers. Non-parametric retrievers
apply heuristic rules and traditional graph search algorithms, LM-based retrievers leverage LMs (e.g.,
RoBERTa [8]) for natural language understanding, and GNN-based retrievers use GNNs to efectively
capture and interpret graph structures.</p>
        <p>Several recent end-to-end RAG approaches have incorporated KGs into their pipelines. For instance,
Microsoft GraphRAG [ 9] automatically constructs a KG from source documents by extracting entities,
relationships, and claims using an LLM. It then performs hierarchical community detection to organize
the graph into thematic subgraphs, with community-level summaries used during query time to generate
globally coherent answers. Another example by Cao et al. [10], LEGO-GraphRAG, incorporates an
extensive retrieval phase consisting of two modules: the optional subgraph-extraction module to narrow
the graph search space and the essential path-retrieval module to identify reasoning paths connecting
relevant entities to answer the query.</p>
        <p>Beyond the aforementioned approaches which directly work on the KG level, another method is to
perform retrieval by querying the graph directly, i.e., query-based GraphRAG, where the query language
can be either SPARQL for RDF data model or Cypher for Neo4j. In this approach, query generation is the
most critical task and can be performed manually, semi-automatically, using schema-based methods, or
through deep learning (DL) [11]. Recent advancements in DL and LLMs have increasingly favored the
latter. For instance, [11] introduced SGPT, which employs a stack of Transformer encoders to encode
both linguistic features of a question and its subgraph information before feeding them into GPT-2
[12] to generate the appropriate query. Other methods [13, 14] require no training, relying solely on
in-context learning. A similar approach to ours is SPINACH [15], except that it utilizes a proprietary
LLM to generate SPARQL queries.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Pipeline Architecture</title>
      <p>We aim to develop a pipeline for question-answering over KGs. Given a question as input, the pipeline
retrieves relevant information from the KG and generates an answer as output. Our approach is built
entirely on open-source components, including all tools and LLMs, ensuring transparency, reproducibility,
and adaptability.</p>
      <p>Our pipeline extends our prior work [16] by introducing new features, including multilingual support
and the integration of an external vector database for retrieving relevant properties and classes needed
for generating SPARQL queries. By incorporating the latter feature, we eliminate the need to explicitly
include the entire set of resources in the prompt, which could otherwise degrade the LLM
performance [17]. We also address the flexibility and scalability limitations identified in [ 16] by supporting
diferent types of KGs, including both open and enterprise KGs. Last but not least, to further enhance
system accuracy, particularly for simple questions (one-hop questions), we integrate text-based retrieval
using verbalized triples. The architecture is illustrated in Figure 1, with each component’s role (cf. blue
boxes) explained below.</p>
      <sec id="sec-3-1">
        <title>3.1. Translation</title>
        <p>Since the target KGs are in English, we incorporate a translation component to convert user questions
from their original language into English, ensuring compatibility. The translation process begins with
language detection to check if the input query is already in English. If not, the query is automatically
translated before advancing to the next stages. For implementation, we use the googletrans3 library
in Python, an open-source and lightweight text translation library that utilizes the Google Translate
API.
resU</p>
        <p>T oitlsnar
yitnE L ni k ni g
)deknaR(
rotceV/IPA
seitnE
rewsnA</p>
        <p>rewsnA ML
seitnE</p>
        <p>ML
1-poT
marG-N
xetnoC
notzilabreV
dzilabreV
smarG-N
ytralmis
=&gt;</p>
        <p>LQRAPS</p>
        <p>LQRAPS
+
ytralmis
sietporP</p>
        <p>ML</p>
        <p>LQRAPS ML
yglotnO rotceV</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Entity Linking</title>
        <p>This stage combines entity extraction, KG-specific entity retrieval, and entity disambiguation to map
entities detected in the question to relevant KG-specific entities. The process begins by extracting
entities from the question using an LLM with few-shot prompting. We instruct the LLM to rank the
extracted entities by importance, prioritizing the most relevant one for the subsequent stage. Next,
for each extracted entity, the top- KG-specific entities and their URIs are retrieved either via an API
(e.g., Wikidata API4 or DBpedia Lookup5) for open KGs or through semantic search (where entities are
indexed in a vector database) for enterprise KGs. Finally, entity disambiguation is performed to identify
the most relevant KG-specific entity needed to answer the question. If available, we also include the
description of each KG-specific entity to provide more context to the LLM.</p>
        <p>For example, consider the question “How many children does Lionel Messi have?”
answered using the DBpedia KG with  = 4 . The entity extraction stage is expected to detect
Lionel Messi as the key entity. The top four KG-specific entities are then retrieved using
DBpedia Lookup via keyword search, returning the following entities: dbr:Lionel_Messi,
dbr:Messi–Ronaldo_rivalry, dbr:List_of_international_goals_scored_by_Lionel_Messi,
and dbr:List_of_career_achievements_by_Lionel_Messi. Finally, the entity disambiguation stage
should correctly identify dbr:Lionel_Messi as the most relevant KG-specific entity, as it represents
“Lionel Messi”, the football player who is expected to have children.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Retrieval</title>
        <p>We distinguish between two types of questions, each requiring a diferent retrieval method. The first
type, simple questions, can be answered using one-hop information, e.g., “When was Lionel Messi
born?” and “What is the birthplace of Lionel Messi?” The second type, complex questions, may require
multi-hop retrieval or aggregate queries, such as “How many children does Lionel Messi have?” and
“When was the father of Lionel Messi’s wife born?”</p>
        <sec id="sec-3-3-1">
          <title>4https://www.wikidata.org/w/api.php 5https://lookup.dbpedia.org/api/search</title>
          <p>Knowledge Graph
:InterMiami
:ArgentinaNT
:hasPlayer
:hasCaptain
:LionelMessi
:hasChild
:hasChild
:hasChild
:hasNationality
:hasPosition
:MateoMessi
:CiroMessi
:Argentina
:Forward
all related
properties</p>
          <p>all triples
centered around
:LionelMessi</p>
          <p>Verbalization Engine
{s}'s {p} is {o}
verbalized sentences</p>
          <p>encode
all related
properties</p>
          <p>:hasChild Lionel Messi's has child is Thiago Messi
:hasNationality Lionel Messi's has nationality is Argentina
:hasPosition Lionel Messi's has position is forward
:hasPlayer Inter Miami's has player Lionel Messi
:hasCaptain Argentina National Team's has captain Lionel Messi
Property Verbalized Triple Representative</p>
          <p>User's Question
What is the nationality of Lionel Messi?</p>
          <p>encode</p>
          <p>Embedding Model
jina-embeddings-v3
VTEmbeddings</p>
          <p>QEmbedding</p>
          <p>MIPS Top-1
Similarity Search
:hasChild
:hasNationality
:hasPosition
:hasPlayer
:hasCaptain
Property
:ThiagoMessi, :MateoMessi, :CiroMessi
:Argentina
:Forward
:InterMiami
:ArgentinaNT
Real Data</p>
          <p>matched property score &gt;=  2: find property result
dfiantda :hasNationality Lionel Messi's has nationality is Argentina 0.90</p>
          <p>Top-1 Match with Score
matched data
:Argentina
score &lt;  2</p>
          <p>Complex Question</p>
          <p>Handling
3.3.1. Simple Questions
For simple questions, the system enhances eficiency by avoiding unnecessary SPARQL query generation
and instead employs a text retrieval approach. This method is particularly efective for queries that can
be answered using a single triple pattern from the KG.</p>
          <p>The process begins by identifying the most relevant KG-specific entity from the detected entities
in the question, as described in Section 3.2. Next, a single-hop breadth-first traversal is performed to
retrieve the immediate neighboring entities of the identified KG-specific entity. The output consists of a
set of triples representing the relationships between the entity and its neighbors. From this set of triples,
template-based verbalization is applied to convert each triple (s, p, o) into a sentence using the rule:
{s}'s {p} is {o}. To improve eficiency and reduce ambiguity, we ensure that each verbalized triple
maintains a unique property when the given entity appears as both subject and object. If multiple triples
share the same property but diferent objects, we generate a single representative sentence rather than
multiple redundant ones. For instance, consider the KG in Figure 2, where the entity :LionelMessi has
connections to three diferent entities through the property :hasChild. In this case, a single sentence
is generated using the first matching triple, such as Lionel Messi's has child is Thiago Messi.
Note that even if the generated sentence contains grammatical inaccuracies, the embeddings can still
efectively capture the intended semantics.</p>
          <p>The user’s question and all verbalized triples are then encoded into embeddings to perform a
semantic search with the QA objective, i.e., determining which generated sentence is the most probable
answer to the user’s question. The answer is returned only if the similarity score exceeds a predefined
threshold; otherwise, the system proceeds with SPARQL query generation. A complete illustration of
the verbalization-based retrieval mechanism is shown in Figure 2.
3.3.2. Complex Questions
For complex questions, we focus on generating accurate SPARQL queries using LLMs to retrieve answers.
Since queries from complex questions often involve multiple entities, properties, and classes, the system
must first retrieve the relevant resources and explicitly include them in the prompt to provide context
for generating SPARQL queries. Given that KG-specific entities have already been identified (cf. Section
3.2), the next step is to retrieve the most relevant properties and classes needed for query construction.
To achieve this, we employ ontology retrieval, i.e., semantic search (implemented as Weaviate’s hybrid
search6) over sets of property and class labels available in the KG. However, directly comparing a
long-form question like “What is the birth date of Tom Cruise?” to its corresponding short property
label “birth date” is not efective. To address this, we split the question into smaller n-gram chunks.
Additionally, we leverage an LLM to generate multiple property candidates based on the given question,
as not all properties are explicitly mentioned. For instance, the system may fail to match the property
“starred in” with any n-gram chunk from the question “What films does Tom Cruise play in?”</p>
          <p>Once the relevant resources are retrieved, SPARQL query generation is performed using an LLM with
few-shot prompting and chain-of-thought prompting to improve the model’s reasoning capabilities.
Finally, the generated SPARQL query is executed over the KG to obtain the necessary information
needed for answering the question.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Answer Generation</title>
        <p>The answer generation stage transforms the retrieved KG data into natural language responses. Since
the retrieved data is often represented as URIs, the system first resolves these URIs into human-readable
labels using SPARQL queries. Next, a generative LLM is employed to generate a corresponding natural
language response, using the original user query and retrieved data serve as context. For multilingual
support, the prompt explicitly includes the instruction "Answer in {user’s language} language".
This ensures that responses are generated in the same language as the user’s question.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>In this section, we present details about the knowledge bases dan datasets used to evaluate the system,
along with the evaluation metrics.</p>
      <sec id="sec-4-1">
        <title>4.1. Knowledge Base</title>
        <p>We evaluate our framework using two open KGs and one enterprise (local) KG. For the open KGs,
we use Wikidata and DBpedia, while for the enterprise KG, we use the curriculum KG. In addition to
the explanation we provide in this subsection, we also present the general comparison of each KG’s
characteristics in Table 1.</p>
        <p>Wikidata. Wikidata [18] is an openly accessible, collaborative knowledge base developed by the
Wikimedia Foundation. Wikidata stores information in a structured format similar to RDF triples [19],
which can be retrieved via SPARQL queries. To utilize this knowledge base, we first obtain all properties
(URIs and their human-readable labels) through SPARQL queries and store them in an ofline vector
database. These properties are later retrieved and used as context for SPARQL query generation (cf.
Section 3.3.2). For other types of resources, i.e., entities and classes, we use the Wikidata API4 to search
for related resources.</p>
        <p>DBpedia. DBpedia [20] is an open RDF KG constructed from Wikipedia7. Similar to Wikidata,
DBpedia supports SPARQL querying for data retrieval. To extract properties and classes (URIs and their
human-readable labels), we use DBpedia’s T-Box Ontology8. However, a key limitation is that it only
provides resources with the prefix http://dbpedia.org/ontology/, which also restricts our system’s</p>
        <sec id="sec-4-1-1">
          <title>6https://weaviate.io/blog/hybrid-search-explained 7https://www.wikipedia.org/ 8https://www.dbpedia.org/resources/ontology/</title>
          <p>Characteristics</p>
          <p>Triples Count</p>
          <p>Category</p>
          <p>Domain
URI Representation</p>
          <p>Host</p>
          <p>Wikidata
1.67B10
Open KG
General
Cryptic
Remote</p>
          <p>DBpedia
1.15B10
Open KG</p>
          <p>General
Human-readable</p>
          <p>Remote</p>
          <p>Curriculum KG</p>
          <p>874
Enterprise KG</p>
          <p>Education</p>
          <p>Human-readable
Local (using rdflib)
coverage for this KG. Additionally, we utilize DBpedia Lookup5 to search for relevant entities within
the KG.</p>
          <p>Curriculum KG. We use Curriculum KG [21] as a representative of a local/enterprise KG. This KG
contains information about the Computer Science curriculum at Fasilkom UI (the Faculty of Computer
Science, Universitas Indonesia). The KG is extracted from the publicly available curriculum guidebook9
using GLiNER [22]. The original dataset is exported as a text file, with each row containing a triple
representing information about a specific course, written in Indonesian. We apply additional
preprocessing steps to translate the dataset into English and convert it into RDF triples. All resources (i.e.,
properties, classes, and entities) are stored in ofline vector databases to facilitate semantic search.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Dataset</title>
        <p>4.2.1. QALD-9-Plus
For open KGs (Wikidata and DBpedia), we utilize the QALD-9-Plus dataset [23], while for local KGs,
we generate a custom KG-based QA dataset using our own dataset generator.</p>
        <p>We evaluate our framework for Wikidata and DBpedia using QALD-9-Plus [23]. QALD-9-Plus is
a multilingual knowledge graph question-answering (KGQA) benchmark dataset for Wikidata and
DBpedia. It covers 10 languages (incl. English). The dataset consists of two splits: train split (408 unique
questions) and test split (150 unique questions).</p>
        <p>We use this dataset for two purposes: evaluation and benchmarking. For evaluation, we use the
train split only with the following refinements. First, entries are filtered to exclude queries using
resources outside the dbo: prefix or those returning only a single resource or numerical value. Next,
each DBpedia entry is matched with an equivalent Wikidata entry to ensure consistency. In order to
streamline evaluation, we downsample the dataset into 14 entries for in-context learning in SPARQL
generation (cf. Section 3.3.2) and 65 entries for system evaluation.</p>
        <p>To obtain a representative subset, we apply a cosine similarity downsampling technique, selecting
the least similar entries to ensure diversity across question clusters. The process follows these steps:
1. Encode all sentences using a sentence embedding model (all-mpnet-base-v211), which is based
on the pre-trained MPNet [24] model.
2. Compute pairwise cosine similarities for all sentences.
3. Calculate the average cosine similarity for each sentence.
4. Sort sentences in ascending order based on their average cosine similarity scores.
5. Select the top- questions with the lowest similarity scores, ensuring that they are the least similar
to the rest of the dataset.
9https://scele.cs.ui.ac.id/pluginfile.php/1279/block_html/content/buku_panduan_kur_2024_ilmu_komputer.pdf
10As of April 26th, 2025.
11https://huggingface.co/sentence-transformers/all-mpnet-base-v2</p>
        <p>Category
Simple 1
Simple 2
Complex 1
Complex 2
{ s p ?o . }
{ ?s p o . }
{ ?s p1 o1; p2 o2 . }
{ ?s p1 ?o1 . ?o1 p2 o2 . }</p>
        <p>For benchmarking against other graph-based RAG approaches, we use the full test split to compare
system performance. However, due to DBpedia’s constraints in our system (which only support queries
within the dbo: prefix), we conduct benchmarking exclusively on the Wikidata version of the dataset
to ensure comprehensive evaluation.
4.2.2. KG-based Dataset Generator
Given the KG-agnostic nature of this framework, we provide a semi-automatic dataset generator that
produces data for training and testing, particularly for local KGs. This system supports the generation
of four types of questions, as described in Table 2, and also allows for the creation of queries with the
COUNT clause, leveraging the same triple patterns. The output of this system is a dataset consisting of
three columns: the question in natural language, the corresponding SPARQL query, and the category of
the question. Below are the five main steps required to generate a question:
1. Entity selection. Randomly selects an entity (i.e., an instance of a class) from the given KG.
2. Random walk. Performs a random walk on the KG based on the question category, starting
from the previously selected node. Users can optionally exclude specific schematic properties
like rdfs:domain and rdfs:range to focus more on the concrete data graph.
3. Resource label resolver. Resolves natural language labels for properties and entities found in
the triples.
4. SPARQL query formation. Wraps the retrieved triples into a SPARQL query using a pre-defined
template based on the question category.
5. Question generation. Generates a natural language question from the generated SPARQL
query and resource labels using an LLM. This process employs a zero-shot prompt template [25],
leveraging of-the-shelf instruction-fine-tuned LLM.
6. (Optional) Manual refinement . Allows for manual review and adjustments of generated
questions if they are deemed ambiguous.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evaluation Metrics</title>
        <p>We evaluate our system performance by by comparing the execution results of the generated SPARQL
queries against the ground truth query execution results. The primary evaluation metric used is Jaccard
Similarity, defined as in Equation 1 for two sets  and  .</p>
        <p>
          Jaccard(, ) = | ∩ | (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
| ∪ |
        </p>
        <p>In our implementation, the original SPARQL query execution results preserve row ordering,
particularly when queries include the ORDER BY clause. To ensure fair comparison, we transform both the
predicted and ground truth execution results into sets before applying Jaccard Similarity. This
transformation ensures that row ordering does not afect the evaluation. Additionally, each row is treated as a
tuple, where elements are ordered alphabetically to maintain consistency. Thus, in Equation 1,  and 
represent sets of tuples rather than ordered lists. An illustration of this evaluation schema is shown in
Figure 3.</p>
        <p>Additionally, for benchmarking purposes, following [23], we also use F1 Macro as an evaluation
metric. Using the same preprocessing steps as illustrated in Figure 3, we first compute precision and
recall, as defined in Equation</p>
        <p>2 and Equation 3, respectively, for a question  . The F1 score for  is then
calculated using Equation 4. Finally, for a set of questions q = { 1,  2, ...,   }, F1 Macro is computed
based on Equation 5.</p>
        <p>precision() =
recall() =
number of correct system answers for</p>
        <p>number of system answers for 
number of correct system answers for 
number of gold standard answers for 
F1() =
2 × precision() × recall()</p>
        <p>precision() + recall()
F1 Macro(q) =
∑ F1(  )

1
 =1
(2)
(3)
(4)
(5)</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Analysis</title>
      <p>In this section, we present the experimental results obtained using the three aforementioned knowledge
bases. We also provide benchmarking results comparing our system with several existing approaches
and conduct an ablation study to identify the most crucial component within the pipeline. All LLMs
referenced in this section correspond to their instruction-fine-tuned versions.</p>
      <sec id="sec-5-1">
        <title>5.1. Experiment Results</title>
        <p>Coder 7B [28], and Qwen2.5 7B [29]—on datasets tailored to specific knowledge bases: the downsampled
QALD-9-Plus for Wikidata and DBpedia, and the Curriculum KG for local KG experiments. All models
were hosted locally. The experiments were conducted using the pipeline’s default configuration, i.e.,
all features tested in the ablation analysis (cf. Section 5.3), such as verbalization, chain-of-thought
prompting, and few-shot prompting were activated. Further error analysis related to the experiment is
presented in our technical report [30].</p>
        <p>Configuration
Mistral NeMo 12B
LLaMA 3.1 8B
Qwen2.5 Coder 7B
Qwen2.5 7B
Mistral NeMo 12B
LLaMA 3.1 8B
Qwen2.5 Coder 7B
Qwen2.5 7B
Mistral NeMo 12B
LLaMA 3.1 8B
Qwen2.5 Coder 7B
Qwen2.5 7B</p>
        <p>Knowledge Base
Wikidata
Wikidata
Wikidata
Wikidata
DBpedia
DBpedia
DBpedia
DBpedia
Curriculum KG
Curriculum KG
Curriculum KG
Curriculum KG
Wikidata. In our evaluation, Qwen2.5 7B outperformed other models, achieving a Jaccard similarity of
0.458. It successfully handled entity-rich queries, such as “In which films directed by Garry Marshall was
Julia Roberts starring?”, where the model correctly mapped entities and properties, yielding an accurate
SPARQL query. It also demonstrated strong performance on aggregation queries, such as “How many
programming languages are there?”, correctly forming a count-based SPARQL query. Furthermore,
it efectively generated meaningful verbalizations, as seen in “In which programming language is
GIMP written?”, which allowed the system to correctly identify the relevant entity, wd:Q15777 (C
programming language).</p>
        <p>We also observed several failure cases, which reveal the model’s limitations. In “Give me all Australian
metalcore bands,” incorrect entity retrieval led to an incomplete query due to the model’s lack of semantic
understanding of implicit properties. Similarly, in “Show a list of soccer clubs that play in the Bundesliga,”
while relevant entities were retrieved, the model mistakenly used an incorrect property (wdt:P6399),
resulting in an erroneous query. These findings emphasize the importance of precise entity-property
retrieval and LLM prioritization, which is further analyzed in Subsection 5.3.4.</p>
        <p>Moreover, we also tested the multilingual capabilities of Qwen2.5 7B on the Indonesian-translated
Wikidata QALD-9-Plus, where its Jaccard similarity dropped to 0.362. This suggests that while the model
can handle multilingual queries, its accuracy declined due to translation-induced paraphrasing. For
example, the English question “Show me all the breweries in Australia.” was translated into “Tunjukkan
semua pabrik bir di Australia.” and then back-translated as “Show all beer factories in Australia.”, causing
the system to fail to retrieve wd:Q131734 (“brewery”). Despite these limitations, the results confirm the
system’s multilingual capability, particularly in Indonesian.</p>
        <p>DBpedia. Qwen2.5 7B achieved the highest Jaccard similarity (0.517), outperforming other models
on the DBpedia dataset. Its improved performance compared to Wikidata is likely due to DBpedia’s
human-readable URIs (e.g., dbr:Australia vs. wd:Q408), which facilitated entity retrieval. The model
successfully generated accurate SPARQL queries, such as “Show a list of soccer clubs that play in the
Bundesliga,” and “How many programming languages are there?”, correctly extracting entities, properties,
and classes. Verbalization also performed well, as demonstrated in “In which programming language
is GIMP written?”, where the verbalized output, “GIMP’s programming language is C (programming
language),” achieved a similarity score of 0.74, accurately identifying dbr:C_(programming_language).</p>
        <p>Despite these improvements, challenges persisted, particularly in entity-property extraction. In “Give
me all Australian metalcore bands,” the model retrieved dbr:Australia and dbr:Metalcore but failed
to extract dbo:hometown and dbo:country as it lacked semantic understanding of implicit relationships,
leading to an incorrect query. Similarly, in “List all boardgames by GMT,” the pipeline misinterpreted
“GMT,” returning entities like “Greenwich Mean Time” instead of dbr:GMT_Games, causing incorrect
query generation. These findings underscore DBpedia’s advantages while also emphasizing the need
for more precise entity-property alignment.</p>
        <p>Curriculum KG. In this evaluation, Qwen2.5 7B and Mistral NeMo achieved the highest Jaccard
similarity (0.805), with LLaMA 3.1 8B and Qwen2.5 Coder 7B following closely behind (0.778). The slight
diference in performance was due to Qwen2.5 Coder 7B failing to answer “What courses have Research
Methodology and Scientific Writing as prerequisites with a report as an evaluation method?” and
LLaMA 3.1 8B struggling with “How many prerequisite courses does ’Algorithm Design and Analysis’
have?”, whereas Mistral NeMo and Qwen2.5 7B answered both correctly.</p>
        <p>A closer analysis reveals that a recurring issue across all models was linked to the verbalization
component. For example, in “How many evaluation methods of ’Internet of Things’?”, the verbalized
output “Internet of Things’s has evaluation method is Task” had a similarity score of 0.636, exceeding
the similarity threshold of 0.6 and incorrectly returning http://example.org/group_project and
http://example.org/task instead of correctly aggregating them into the correct answer (2). This
highlights a limitation in the verbalization process, where overly simplistic or mismatched
interpretations lead to errors. Given these limitations, further evaluation is necessary, particularly in the
verbalization ablation study (cf. Subsection 5.3.1), to better assess model performance and mitigate
these challenges.
5.1.1. General Analysis
This subsection provides an analysis of the experimental results from the perspectives of both the LLMs
and the knowledge bases.</p>
        <p>LLMs. The experimental results indicate that Qwen2.5 7B consistently outperforms other models
across datasets, achieving the highest Jaccard similarity scores on Wikidata (0.458), DBpedia (0.517),
and the curriculum KG (0.805), where it performed on par with Mistral NeMo. This strong performance
highlights Qwen2.5 7B’s versatility in processing natural language instructions and generating diverse
outputs, including free text, structured data, and code.</p>
        <p>One likely reason for this performance is Qwen2.5 7B’s extensive pre-training on a high-quality
corpus comprising 18 trillion tokens, compared to only 15 trillion tokens used for LLaMA 3.1 8B. The
Qwen2.5 corpus was curated through advanced data filtering and mixture strategies, balancing domain
representation by downsampling content from overrepresented or low-quality domains and upsampling
high-value sources such as scientific, technical, and academic texts. Additionally, the pre-training data
was enriched with domain-specific code and mathematics datasets as used in the training for Qwen2.5
Coder and Qwen2.5 Math, alongside high-quality synthetic data generated and filtered using other
large Qwen models. This diverse and expansive pre-training corpus has exposed Qwen2.5 7B to a
broader range of linguistic patterns, factual knowledge, and contextual nuances, leading to improved
accuracy. Furthermore, Qwen Team [29] reported that the instruction-fine-tuned 72B variant of Qwen2.5
significantly outperforms LLaMA 3.1 70B on multiple benchmarks, further reinforcing its superior
ability to follow instructions and generate precise responses.</p>
        <p>Interestingly, despite initial expectations, the Coder variant of Qwen2.5 7B underperforms compared
to its general-purpose counterpart. Qwen2.5 Coder 7B was trained on a dataset where 70% of the data
is code-related, covering 5.2 trillion tokens, while only 20% of its corpus consists of general text. While
this specialization enhances its coding proficiency, it appears to hinder its broader understanding of
natural language, which is crucial for SPARQL query generation. The generation process not only
requires code synthesis but also a deep comprehension of user queries and their KG context, an area
where the general-purpose Qwen2.5 7B model excels due to its more diverse training data.</p>
        <p>For Mistral NeMo, drawing definitive conclusions is challenging due to the lack of detailed
documentation on its training data. However, its relatively strong performance may be attributed to its
larger parameter count (12B), compared to the other models (7B–8B). Larger models generally capture
more complex patterns and relationships, which could contribute to improved performance. Without
further information on its dataset, this remains speculative, but model size likely plays a key role in its
efectiveness.</p>
        <p>Knowledge Bases.
of knowledge bases:</p>
        <p>We identified two key factors afecting system performance from the perspective
1. URI Representation: The way entities are represented within a KG plays a crucial role in model
performance. When URIs follow natural language conventions, as in DBpedia and the curriculum
KG, models tend to perform better. Since LLMs are trained on natural language, human-readable
URIs enhance entity linking and property retrieval. Conversely, cryptic or non-intuitive URIs
make interpretation more dificult, reducing model accuracy.
2. Size and Homogeneity: The scale and complexity of a KG impact system performance. Larger
and more diverse KGs (e.g., Wikidata and DBpedia) introduce a wider variety of entities,
relationships, and question types. While this enhances versatility, it also increases query complexity,
making SPARQL generation more challenging. In contrast, smaller and more homogeneous
KGs like the curriculum KG limit the scope of possible queries, reducing query complexity and
improving performance.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Comparison to Other Graph-based RAG Approaches</title>
        <p>We evaluate the performance of our system on the QALD-9-Plus test set [23], comparing it against
several established approaches. Specifically, we include SPINACH [ 15], QAnswer [31], Platypus [32],
and DeepPavlov [33] in our comparison. Among these, we consider SPINACH as a representative of
generative LLM-based systems, while the others are categorized as non-generative LLM-based systems.</p>
        <p>Due to the limitation in handling resources outside the dbo: prefix for DBpedia, we perform the
comparative evaluation using only the Wikidata version of the dataset. The evaluation is conducted
using our best-performing LLM, Qwen2.5 7B, and the F1 Macro metric (Equation 5). As shown in Table
4, our system’s performance is lower than SPINACH [15] and QAnswer [31]. The primary advantage
of SPINACH comes from its use of a state-of-the-art proprietary LLM, GPT-4o. Meanwhile, QAnswer
relies on pre-defined SPARQL templates, which help minimize syntactic or logical errors. However, this
template-based approach lacks the flexibility to handle more complex or out-of-distribution queries.
In contrast, our system generates SPARQL queries dynamically using an LLM, allowing for greater
generalization across a wider range of questions and making it more adaptable to open-domain scenarios.
The trade-of, however, is that generative methods are inherently more prone to hallucinations and
incorrect query formulations, which can negatively impact overall performance.</p>
        <p>System
SPINACH GPT-4o [15]
QAnswer [31]
FrOG with Qwen2.5 7B Instruct (ours)
Platypus [32]</p>
        <p>DeepPavlov [33]</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Ablation Study</title>
        <p>We evaluate the impact of key components in our pipeline—verbalization, chain-of-thought (CoT),
few-shot examples, and vector-based ontology retrieval—by systematically removing each component
to measure its contribution to overall performance. The experiments are conducted using the
bestperforming model for each knowledge base, Qwen2.5 7B. A more detailed analysis of the ablation study
can be found in our technical report [30].
5.3.1. Verbalization</p>
        <p>Configuration
Qwen2.5 7B
Qwen2.5 7B w/o Verbalization
Qwen2.5 7B
Qwen2.5 7B w/o Verbalization
Qwen2.5 7B</p>
        <p>Qwen2.5 7B w/o Verbalization</p>
        <p>Table 5 shows the impact of verbalization on system performance. Its removal led to a significant
performance drop on Wikidata (0.458 → 0.334) and a minor decrease on DBpedia (0.517 → 0.516).
While the drop on DBpedia is minor, verbalization proves essential for complex KGs, aiding LLMs in
interpreting simple natural language queries. Without it, models struggle with single-hop queries. By
converting extracted triples into readable sentences, verbalization ensures better semantic alignment
for accurate query generation.</p>
        <p>Interestingly, omitting verbalization improved accuracy on the curriculum KG (0.805 → 0.949). This
suggests that verbalization acted as a bottleneck, limiting aggregation-based queries and failing to
retrieve efective templates. For simpler, structured KGs, direct SPARQL query generation proved more
efective.</p>
        <p>These findings indicate that verbalization should be selectively applied, benefiting complex KGs but
introducing redundancy for domain-specific datasets with straightforward queries.
5.3.2. Chain-of-Thought (CoT)</p>
        <p>Configuration
Qwen2.5 7B
Qwen2.5 7B w/o CoT
Qwen2.5 7B
Qwen2.5 7B w/o CoT
Qwen2.5 7B
Qwen2.5 7B w/o Verbalization
Qwen2.5 7B w/o CoT</p>
        <p>Qwen2.5 7B w/o Verbalization &amp; CoT</p>
        <p>Table 6 shows the impact of removing CoT across datasets. For Wikidata and DBpedia, excluding
CoT resulted in a performance drop (0.458 → 0.436 and 0.517 → 0.455, respectively), indicating its role
in improving SPARQL query generation. While helpful, CoT is not essential, mainly benefiting complex
or open-ended queries.</p>
        <p>In contrast, for the Curriculum KG, removing CoT had minimal impact or even improved results (with
verbalization: 0.805 → 0.808, without verbalization: 0.949 → 0.976). This suggests CoT introduces
unnecessary complexity in simpler datasets with direct queries, where intermediate reasoning is
redundant.</p>
        <p>These findings emphasize that CoT’s efectiveness is dataset-dependent. For structured KGs like
Wikidata and DBpedia, CoT enhances reasoning and query formulation, while for simpler datasets like
the Curriculum KG, omitting CoT can streamline query generation and improve eficiency.</p>
        <p>Knowledge Base
Wikidata
Wikidata
DBpedia
DBpedia
Curriculum KG
Curriculum KG
Curriculum KG
Curriculum KG
5.3.3. Few Shots</p>
        <p>Table 7 demonstrates the significant impact of few-shot learning on SPARQL query generation.
Removing few shots caused notable performance drops across all datasets: Wikidata (0.458 → 0.342),
DBpedia (0.517 → 0.410), and Curriculum KG (0.805 → 0.724). The efect was even more pronounced
in Curriculum KG when verbalization was also removed (0.949 → 0.651).</p>
        <p>Few shots guide the model in structuring SPARQL queries by providing dataset-specific examples.
They help align model outputs with dataset constraints, such as ensuring URIs are returned in
QALD-9Plus or handling custom entity relationships in Curriculum KG. This is especially crucial for
domainspecific datasets, where pre-trained knowledge alone is insuficient to generate accurate queries.</p>
        <p>For general datasets like Wikidata and DBpedia, few shots refine query accuracy. For custom datasets
like Curriculum KG, few shots are essential for adapting to unique ontologies. The results underscore the
necessity of few-shot learning in bridging the gap between generic LLM capabilities and dataset-specific
query requirements.
5.3.4. Ontology Retrieval</p>
        <p>Configuration
Qwen2.5 7B
Qwen2.5 7B w/o Ontology Retrieval
Qwen2.5 7B
Qwen2.5 7B w/o Ontology Retrieval
Qwen2.5 7B
Qwen2.5 7B w/o Verbalization
Qwen2.5 7B w/o Ontology Retrieval</p>
        <p>Qwen2.5 7B w/o Verbalization &amp; Ontology Retrieval</p>
        <p>Table 8 highlights the crucial role of ontology (class and property) retrieval in SPARQL query
generation. Removing this component led to significant performance declines across all datasets:
Wikidata (0.458 → 0.377), DBpedia (0.517 → 0.381), and Curriculum KG (0.805 → 0.183). Notably,
the Curriculum KG score plummeted to 0.000 when both verbalization and ontology retrieval were
removed, indicating the model’s complete inability to generate meaningful queries.</p>
        <p>Ontology retrieval is essential for identifying relevant classes and properties, ensuring accurate
query construction. The drastic impact on Curriculum KG underscores its necessity for domain-specific
datasets, where the LLM lacks prior exposure to the underlying ontology. In contrast, Wikidata and
DBpedia, which the model may have partially encountered during training, showed smaller but still
significant declines, emphasizing the retrieval step’s importance in refining ontological alignment.</p>
        <p>Interestingly, the Curriculum KG score remained at 0.183 when verbalization was retained, suggesting
that verbalization can partially mitigate retrieval loss by inferring class and property relationships in
natural language. However, this compensation is limited, as verbalization alone cannot handle complex
queries that require precise class and property relationships.
5.3.5. Ablation Study Summary
Ontology retrieval is the most critical component for query accuracy, particularly for domain-specific
datasets. Without it, the model struggles to identify relevant ontology elements, leading to severe
performance degradation. Even for open KGs like DBpedia and Wikidata, where the LLM has some
prior knowledge, retrieval remains indispensable for precise query generation.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Inference Time Evaluation</title>
        <p>We evaluated the inference time of the proposed pipeline on 5 queries per knowledge base from the
Indonesian test set, using the default configuration. Table 9 reports the average times across components
in seconds.</p>
        <p>Component
Translation
Entity Linking
Verbalization
Ontology Retrieval
SPARQL Generation
Answer Generation
Total Avg. Inference Time
Require SPARQL Generation Queries Avg. Time
Verbalization Only Queries Avg. Time</p>
        <p>Overall, Curriculum KG achieved the fastest inference times due to its smaller and simpler ontology.
Wikidata and DBpedia produced comparable results, although DBpedia required more time mainly due
to slower ontology retrieval.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Work</title>
      <p>In this research, we proposed FrOG, a framework of open GraphRAG. The framework consists of several
key components: translation, entity linking, retrieval, and answer generation. In the retrieval stage, we
diferentiate between simple (single-hop) and complex questions. For simple questions, we leverage
verbalization-based text retrieval to obtain relevant information, whereas for complex questions, we
integrate SPARQL query generation using an LLM.</p>
      <p>Our experiments demonstrate that Qwen2.5 7B achieves the best performance across all datasets
using the optimal configuration identified through ablation analysis: Wikidata ( 0.458), DBpedia (0.517),
and the Curriculum KG (0.976). Notably, simpler ontologies, such as the Curriculum KG, yield higher
performance. Furthermore, the multilingual capability of the system was tested with Indonesian queries,
achieving an accuracy of 0.362. Although lower than its English counterpart, this result highlights the
system’s ability to process multilingual queries through translation and retrieval mechanisms.</p>
      <p>Despite moderate performance on Wikidata and DBpedia, we believe potential accuracy could
be further improved, given the presence of inaccuracies within the dataset (see Appendix A). The
ablation study highlights ontology (class and property) retrieval as the most critical component, with
its removal causing substantial accuracy drops, particularly reducing curriculum KG performance to
zero. Interestingly, for simpler KGs, verbalization negatively impacts performance due to question
misinterpretation and dificulties handling aggregation and complex operations.</p>
      <p>Finally, this research successfully addresses the research questions by:
1. Developing FrOG, a RAG system that utilizes KGs as the primary knowledge base, with its pipeline
architecture detailed in Section 3.
2. Identifying Qwen2.5 7B as the best-performing LLM, as demonstrated in Section 5.1.1.
3. Determining the most influential component within the architecture (i.e., “class and property
retrieval” component) through the ablation study (cf. Section 5.3).</p>
      <p>Future Work. Future work in this research aims to address several limitations and lead to significant
improvements. One promising direction is to fine-tune LLMs for SPARQL query generation, which
would enhance their ability to understand query structures, entity relationships, and natural language
mappings, resulting in more precise query generation. Another avenue is to develop a language-native
pipeline, particularly for Indonesian, as the current system solely relies on translation, which may
introduce inaccuracies. A dedicated pipeline for Indonesian would eliminate translation errors and
better handle linguistic nuances, further improving overall performance.</p>
      <p>We also aim to explore the use of larger LLMs to boost query generation performance, with a focus
on investigating how model size interacts with task complexity or the type of knowledge base. A
deeper analysis of how parameter count influences performance will help in identifying the optimal
model size for a particular use case while addressing computational eficiency and resource constraints.
Another potential improvement involves introducing a classifier for verbalization applicability, such as
intent classification, to determine whether a question is best handled using verbalization. This would
optimize the query generation process by applying alternative strategies for complex queries. We further
realize that scaling the system to larger knowledge graphs is essential, as the current implementation is
limited by rdflib’s in-memory storage. Related to this, future work could explore several persistent
storage solutions, such as Apache Jena, GraphDB, or Virtuoso, to enhance scalability and eficiency in
supporting larger datasets. Additionally, we believe that extending this work beyond text-based KGs
would be an interesting direction, as many real-world KGs now incorporate multimodal information.
Finally, expanding the evaluation to include user studies and latency benchmarks will provide deeper
insights into the practical usability of this system.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This research was supported by the Wikidata Research Grant 2024 from Wikimedia Indonesia, whose
funding and commitment to open knowledge made this study possible. The work of Fajar J. Ekaputra is
supported by the Austrian Science Fund (FWF) Bilateral AI (Grant Nr. 10.55776/COE12) and Horizon
Europe PERKS (Grant Nr. 101120323).</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used GPT-4o in order to: Grammar and spelling
check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and
take(s) full responsibility for the publication’s content.
York, NY, USA, 2024, p. 526–537. URL: https://remote-lib.ui.ac.id:2075/10.1145/3637528.3671882.
doi:10.1145/3637528.3671882.
[2] A. Roberts, C. Rafel, N. Shazeer, How much knowledge can you pack into the parameters of a
language model?, 2020. URL: https://arxiv.org/abs/2002.08910. arXiv:2002.08910.
[3] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih,
T. Rocktäschel, S. Riedel, D. Kiela, Retrieval-augmented generation for knowledge-intensive nlp
tasks, in: Proceedings of the 34th International Conference on Neural Information Processing
Systems, NIPS ’20, Curran Associates Inc., Red Hook, NY, USA, 2020.
[4] V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, W. tau Yih, Dense
passage retrieval for open-domain question answering, 2020. URL: https://arxiv.org/abs/2004.04906.
arXiv:2004.04906.
[5] B. Sarmah, B. Hall, R. Rao, S. Patel, S. Pasquali, D. Mehta, Hybridrag: Integrating knowledge
graphs and vector retrieval augmented generation for eficient information extraction, 2024. URL:
https://arxiv.org/abs/2408.04948. arXiv:2408.04948.
[6] H. Abu-Rasheed, C. Weber, J. Zenkert, M. Dornhöfer, M. Fathi, Transferrable framework based
on knowledge graphs for generating explainable results in domain-specific, intelligent
information retrieval, Informatics 9 (2022). URL: https://www.mdpi.com/2227-9709/9/1/6. doi:10.3390/
informatics9010006.
[7] B. Peng, Y. Zhu, Y. Liu, X. Bo, H. Shi, C. Hong, Y. Zhang, S. Tang, Graph retrieval-augmented
generation: A survey, 2024. URL: https://arxiv.org/abs/2408.08921. arXiv:2408.08921.
[8] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
Roberta: A robustly optimized bert pretraining approach, 2019. URL: https://arxiv.org/abs/1907.
11692. arXiv:1907.11692.
[9] D. Edge, H. Trinh, N. Cheng, J. Bradley, A. Chao, A. Mody, S. Truitt, J. Larson, From local to global: A
graph rag approach to query-focused summarization, 2024. URL: https://arxiv.org/abs/2404.16130.
arXiv:2404.16130.
[10] Y. Cao, Z. Gao, Z. Li, X. Xie, K. Zhou, J. Xu, Lego-graphrag: Modularizing graph-based
retrievalaugmented generation for design space exploration, 2025. URL: https://arxiv.org/abs/2411.05844.
arXiv:2411.05844.
[11] M. R. A. H. Rony, U. Kumar, R. Teucher, L. Kovriguina, J. Lehmann, Sgpt: A generative approach
for sparql query generation from natural language questions, IEEE Access 10 (2022) 70712–70723.
doi:10.1109/ACCESS.2022.3188714.
[12] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models are unsupervised
multitask learners, OpenAI Blog (2019). URL: https://cdn.openai.com/better-language-models/
language_models_are_unsupervised_multitask_learners.pdf.
[13] V. Emonet, J. Bolleman, S. Duvaud, T. M. de Farias, A. C. Sima, Llm-based sparql query generation
from natural language over federated knowledge graphs, 2024. URL: https://arxiv.org/abs/2410.
06062. arXiv:2410.06062.
[14] L. Kovriguina, R. Teucher, D. Radyush, D. Mouromtsev, Sparqlgen: One-shot prompt-based
approach for sparql query generation, in: International Conference on Semantic Systems, 2023.</p>
      <p>URL: https://api.semanticscholar.org/CorpusID:265309659.
[15] S. Liu, S. J. Semnani, H. Triedman, J. Xu, I. D. Zhao, M. S. Lam, Spinach: Sparql-based information
navigation for challenging real-world questions, 2024. URL: https://arxiv.org/abs/2407.11417.
arXiv:2407.11417.
[16] J. Ongris, E. Tjitrahardja, F. Darari, F. Ekaputra, Towards an Open NLI LLM-based System for
KGs: A Case Study of Wikidata, in: The 7th International Seminar on Research of Information
Technology and Intelligent Systems (ISRITI), 2024.
[17] H. Jiang, Q. Wu, X. Luo, D. Li, C.-Y. Lin, Y. Yang, L. Qiu, LongLLMLingua: Accelerating and
enhancing LLMs in long context scenarios via prompt compression, in: L.-W. Ku, A. Martins, V. Srikumar
(Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), Association for Computational Linguistics, Bangkok, Thailand, 2024, pp.
1658–1677. URL: https://aclanthology.org/2024.acl-long.91. doi:10.18653/v1/2024.acl-long.91.</p>
    </sec>
    <sec id="sec-9">
      <title>A. Appendix: Dataset Issues</title>
      <p>
        During our experiments, we identified several dataset inconsistencies, particularly in the QALD-9-Plus
dataset:
1. Inaccurate Ground Truth (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ): The question “How many languages are spoken in Colombia?”
was associated with an incorrect ground truth SPARQL query: SELECT (COUNT(DISTINCT
?uri) AS ?c) WHERE { ?uri rdf:type dbo:Language . dbr:Colombia dbo:language
?uri }. However, there was no direct connection between dbr:Colombia and dbo:language,
leading to a result of 0, which is incorrect. Our system correctly generated the query: SELECT
COUNT(DISTINCT ?lang) WHERE { ?lang dbo:spokenIn dbr:Colombia }, which returned
97, the correct answer. However, since the ground truth was incorrect, our correct answer was
misclassified as wrong.
2. Inaccurate Ground Truth (2): The question “How many films did Hal Roach produce¿‘
was linked to an incomplete ground truth query: SELECT (COUNT(?uri) as ?c) WHERE {
?uri wdt:P162 wd:Q72792 . }. This query failed to restrict results to instances of films. In
contrast, our system generated: SELECT (COUNT(DISTINCT ?uri) AS ?count) WHERE { ?uri
wdt:P31 wd:Q11424 ; wdt:P162 wd:Q72792. }, which correctly filters for entities classified
as films ( wd:Q11424). However, due to the ground truth’s lack of specificity, our accurate result
was marked incorrect.
3. Ambiguous Question (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ): The question “How big is the total area of North Rhine-Westphalia?”
was ambiguous regarding the expected unit of measurement. The ground truth query
retrieved the area in square meters: SELECT ?tarea WHERE { dbr:North_Rhine-Westphalia
dbo:areaTotal ?tarea }. However, our system retrieved dbo:PopulatedPlace/areaTotal,
which provides the value in square kilometers. While both answers were correct, the unit
diference led to our result being classified as incorrect.
4. Ambiguous Question (2): The question “Give me all soccer clubs in Spain.” was also
ambiguous. Our system generated: SELECT DISTINCT ?uri WHERE { ?uri a dbo:SoccerClub
; dbo:country dbr:Spain }, which is semantically correct. However, the ground truth
query: SELECT DISTINCT ?uri WHERE { ?uri a dbo:SoccerClub { ?uri dbo:ground
dbr:Spain } UNION { ?uri &lt;http://dbpedia.org/property/ground&gt; ?ground FILTER
regex(?ground, "Spain") } }, applied additional constraints on stadium locations, making it
overly specific. As a result, our correct query was misclassified as incorrect.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , G. Wang,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Neural retrievers are biased towards llm-generated content</article-title>
          ,
          <source>in: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</source>
          , KDD '24,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>