=Paper= {{Paper |id=Vol-3759/paper1 |storemode=property |title=Toward Exploring Knowledge Graphs with LLMs |pdfUrl=https://ceur-ws.org/Vol-3759/paper1.pdf |volume=Vol-3759 |authors=Guangyuan Piao,Mike Mountantonakis,Panagiotis Papadakos,Pournima Sonawane,Aidan Omahony |dblpUrl=https://dblp.org/rec/conf/i-semantics/PiaoMPSO24 }} ==Toward Exploring Knowledge Graphs with LLMs== https://ceur-ws.org/Vol-3759/paper1.pdf

Toward Exploring Knowledge Graphs with LLMs
Guangyuan Piao1 , Mike Mountantonakis2 , Panagiotis Papadakos2 ,
Pournima Sonawane1 and Aidan OMahony1
1
Dell Technologies, Ireland
2
W3C, ERCIM, France

Abstract
Interacting with knowledge graphs (KGs) is challenging for non-technical users with information needs
who are unfamiliar with KG-specific query languages such as SPARQL and the underlying KG schema.
Previous KG question answering systems require ground-truth pairs of questions and queries or fine
tuning (Large) Language Models (LLMs) for a specific KG, which is time-consuming and demands deep
expertise. In this poster, we present a framework for exploring KGs for question answering using LLMs
in a zero-shot setting for non-technical end users, without the need for ground-truth pairs of questions
and queries or fine-tuning LLMs. Additionally, we evaluate an example implementation in a simple
yet challenging setting using LLMs exclusively based on the framework, without the extra effort of
maintaining the embeddings or indexes of entities from KG for retrieving relevant ones to a given
question. We share preliminary experimental results indicating that exploring a KG using LLM-generated
SPARQL queries with reasonable complexity is possible in such a challenging setting.

1. Introduction
Exploring knowledge graphs (KGs) for question answering poses challenges, particularly for non-
technical users lacking sufficient knowledge of KG-specific query languages such as SPARQL
and Cypher. In previous studies of KG question answering systems [1, 2], they either require
ground-truth pairs of questions and queries or fine-tuning of (Large) Language Models (LLMs)
for any KG of interest. However, creating such ground-truth data is time-consuming and
demands significant effort and expertise. The requirement for fine-tuning necessitates machine
learning expertise and restricts the use of many proprietary LLMs, such as ChatGPT, which is
only accessible through APIs but exhibits outstanding performance. The goal of the poster is to
present a general framework for exploring KGs with LLMs for question answering, without
these requirements (Section 2). In addition, we present an example implementation with all
components defined in the framework, along with preliminary experimental results (Section 3).
In contrast to building and maintaining indexes or embeddings of entities for retrieving relevant
ones from a KG, we focus on a simpler yet more challenging setting: using LLMs exclusively
by prompting the chosen LLM to automatically infer entity IRIs (Internationalized Resource
Identifiers). Finally, we discuss some challenges and future work in Section 4.

SEMANTiCS’24: International Conference on Semantic Systems, September 17–19, 2024, Amsterdam, Netherlands
$ guangyuan.piao@dell.com (G. Piao); mike.mountantonakis@ercim.eu (M. Mountantonakis);
panagiotis.papadakos@ercim.eu (P. Papadakos); pournima.sonawane@dell.com (P. Sonawane);
aidan.omahony@dell.com (A. OMahony)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
2. Framework of Question Answering with a KG using LLMs
Figure 1 illustrates our framework for question answering with a knowledge graph using
LLMs. The top three components depict a straightforward pipeline. Specifically, the top-left
component indicates the input to an LLM. In addition to a question, a user can provide any
user-input context, such as a description about the KG. The extracted context refers to any context
automatically retrieved from the system. This can include, but is not limited to, the KG schema
or a subset of triples in the KG that might be relevant to the question. The LLM component
refers to any LLMs, such as Code Llama [3], used for KG-specific query generation for answering
the question. The output component executes the provided query to obtain the answer. As
a special case, the LLM can also generate the output or asnwer directly based on the given
question and context derived from the KG without generating any queries as in KG-RAG [4].
In addition to these three basic components, the framework includes a set of optional com-
ponents indicated by dotted boxes. The context extractor aims to automatically extract any
useful context for answering the question. For example, it can extract a set of predicates or class
types that are relevant to the question. The context parser and enhancer process the output
of the extractor and enhance it by validating, updating or pruning as necessary. For example,
they can check whether the extracted predicates actually exist in the KG schema or revisit the
context extractor if necessary. KG-RAG [4] implements a context extractor based on embedding
similarities between the question sentence or a set of extracted entities and precomputed entity
embeddings with a small language model to retrieve relevant entities from the KG. A set of
triples associated with each entity is retrieved, and then parsed and pruned based on their
relevance to the given question. Auto-KGQA [5] implements the retrieval of relevant entities
by building and maintaining KG resource indexes for text- or embedding-based approaches.
For each entity, all its triples are retrieved along with their neighbors up to a predefined depth.
These triples are then parsed and pruned to construct a sub-graph containing the most relevant
triples to the question. The query parser and enhancer parse the LLM output, extract the
query, and refine the query if necessary. For instance, they can regenerate the query in cases
where it is not executable or returns no results. Auto-KGQA [5] prompts a LLM to generate
several SPARQL queries and parse the results, and then let the LLM choose the best one.
In the framework, edges highlighted in blue and orange indicate repeatable loops. For example,
the loop of 5 6 7 8 can be repeated multiple times, with each iteration providing a query
as extracted context. The query extracted from the previous loop can be used in the next query
generation process to enhance it. The framework also allows for the extraction of different
types of context. That is, one can have several sets of context extractors, parsers, and enhancers

Figure 1: The framework of question answering with a KG using LLMs. It can include several sets of
shaded components to extract different contexts. The colored loops can be repeated multiple times.
How many creatures do speak Related types: Beast, ... Related types: Beast, ...
all three languages: ... Related properties: hasSpells, ... Related properties: hasLanguages, ...

How many creatures do speak all three
languages: abyssal, sylvan and elven?
SELECT (COUNT(?creature) AS ?count)
SELECT ?creature WHERE {
WHERE {
?creature :hasLanguages :AbyssalL. Related types: Beast, ...
?creature :hasLanguages :AbyssalL.
?creature :hasLanguages :SylvanL. Related properties: hasLanguages, ...
?creature :hasLanguages :SylvanL.
?creature :hasLanguages :ElvenL. }
?creature :hasLanguages :ElvenL. }
.. triples: Sard a Beast, ...
Context

Figure 2: Example workflow for generating the SPARQL query of a question with the framework.

(shaded boxes in Figure 1). For instance, one set can be used for extracting the KG schema,
while another can be used for extracting a set of relevant triples to the question.

3. Example Implementation and Experimental Results
Here, we present an example implementation1 built upon the proposed framework with all
components, using Code Llama Instruct 7B [3] as our LLM. Specifically, we are interested in
exploring an implementation that relies solely on LLMs, i.e., without maintaining indexes or
embeddings of entities in the KG for retrieving relevant entities, which would otherwise require
extra effort and expertise [4, 5]. To this end, we consider questions with extracted context as
input to the LLM. We use two types of context extractors. The first extractor extracts class
types and predicates from the KG schema as context. The second one prompts the same LLM
to infer the top-k class types and predicates relevant to the question. Next, the context parser
and enhancer check the extracted class types and predicates, and rerun the context extractor if
non-existent class types or predicates are detected. Based on these extracted class types and
predicates, we can retrieve p triple(s) for each class type and predicate, which are provided as
context triples (𝑘=5 and 𝑝=1 in our experiments). Afterwards, the LLM generates the output for
the given question and extracted context. Subsequently, the query parser and enhancer parse
the output to obtain the query and enhance it if necessary. Again, we prompt the same LLM to
check the generated query and add, remove, or modify it if necessary. Steps 5 6 7 8 can be
repeated multiple times before finalizing the SPARQL query as our output. Figure 2 illustrates
an example workflow of generating the final SPARQL query for Q14 in Listing 1.
Experimental settings. We use a custom Bestiary KG [2] which contains diverse infor-
mation about over 4,000 creatures from a fantasy role-playing game, comprising 98,070 triples.
However, upon careful investigation of each question in the dataset, we noticed that the majority
of 100 questions from [2] require high-complexity SPARQL queries. In our challenging setting,
using those questions for evaluation may not be feasible (as the majority cannot be answered),
and could potentially hinder the exploration of new directions by beginning with those queries.
After careful investigation of those questions, we empirically chose eight questions that are
possible to answer in our setting but also have varying complexities. Listing 1 shows the set
of questions and the corresponding generated SPARQL queries for our discussion. It shows

1
The source code, prompt templates, and examples are available at https://github.com/parklize/LLM4SPARQL.
Listing 1: Eight example questions and their corresponding generated SPARQL queries.
Q0: which creatures not speaking draconic language do have chaotic good alignment?
SELECT ?creature WHERE {?creature a :Beast. ?creature :hasAlignment :chaoticGood.
FILTER NOT EXISTS {?creature :hasLanguages :DraconicL.}}

Q9: what creatures do have cold resist?
SELECT ?creature WHERE {?creature :hasResists :cold.}

Q13: what creatures do speak both common and undercommon languages?
SELECT ?creature WHERE {?creature :hasLanguages :CommonL. ?creature :hasLanguages :UndercommonL.}

Q14: how many creatures do speak all three languages: abyssal, sylvan and elven?
SELECT (COUNT(?creature) AS ?count) WHERE {?creature :hasLanguages :AbyssalL.
?creature :hasLanguages :SylvanL. ?creature :hasLanguages :ElvenL.}

Q58: what creatures speaking dwarven language do have armor class greater than 12?
SELECT ?creature WHERE {?creature a :Beast. ?creature :hasLanguages :DwarvenL. ?creature :hasACValue ?ac. FILTER( ?ac > 12)}

Q64: what is the average number of health points for creatures speaking gnome language?
SELECT (AVG(?hp) AS ?hp_AVG) WHERE {?creature rdf:type :Beast. ?creature :hasLanguages :GnomeL. ?creature :hasHPvalue ?hp.}

Q83: which creatures speaking necril and abyssal languages do have wisdom attribute more than 4?
SELECT ?creature WHERE {?creature rdf:type :Beast. ?creature :hasLanguages :NecrilL. ?creature :wis ?wis. FILTER(?wis > 4)}

Q94: what is the average dexterity attribute for Phoenix and Sleipnir?
SELECT (AVG(?dex) AS ?dex_AVG) WHERE {?beast rdf:type :Beast. ?beast :dex ?dex. FILTER(?beast = :Phoenix || ?beast = :Sleipnir)}

0.9 GraphSparqlQAChain
Acc@10

0.6 w/o query enhancer
0.3 w/ query enhancer
0.0
Q0 Q9 Q13 Q58Q14Q64 Q83 Q94
QID
Figure 3: Acc@10 for each of the eight questions, with the average indicated by a dashed line.

that even exclusively using LLMs, we can still answer questions of reasonable complexity in a
zero-shot setting. This includes questions with negation (Q0), aggregation (Q64), or even those
with multiple ?𝑠 𝑝 𝑜 patterns with filtering (Q83), where ?𝑠 indicates a variable.
As LLMs can produce different answers each time, we use 𝐴𝑐𝑐@10, which measures the
percentage of correct answers obtained out of 10 runs for a given question, to evaluate the
performance. The most relevant work to ours is GraphSparqlQAChain2 , which uses only the KG
schema as extracted context in the prompt to generate the query for a given question. We use
this as our baseline. In addition, we include two variants of our implementation – one without
the query enhancer and the other with the enhancer component in our framework.
Results. Figure 3 shows the results for the eight questions. As illustrated by dashed lines, the
average 𝐴𝑐𝑐@10 scores over these questions using GraphSparqlQAChain and the two variants –
one without and one with the enhancer – are 0.125, 0.326, and 0.831, respectively. As we can see
from the figure, GraphSparqlQAChain, which uses only the KG schema as the context, could not
generate the majority of queries because it is not aware of the IRI patterns of entities in the KG.
The results with and without the query enhancer component clearly indicate that the enhancer
consistently improves the quality of generated queries. For example, the 𝐴𝑐𝑐@10 is zero for
both Q14 and Q94 without the query enhancer, while with the enhancer, it increases to 0.9 and
1.0, respectively. Q14 in Listing 1 shows an example where the initial query (without blue part)
2
https://python.langchain.com/docs/use_cases/graph/graph_sparql_qa
has been enhanced (with blue part). For quantitative evaluation, we manually extended the
initial eight questions by adding 22 similar ones, resulting in a total of 30 questions. The average
𝐴𝑐𝑐@10 scores using GraphSparqlQAChain, without query enhancer, and with query enhancer
are 0.14, 0.22, and 0.57 respectively (𝛼 < .05). Although it is clear that the performance improves
with the query enhancer, it is worth noting that this improvement comes with the extra cost of
prompting LLMs n more times, where n is a predefined parameter indicating how many times
we want to repeat the enhancing process.

4. Discussion and Future Work
In this work, we presented a general framework for exploring KGs with LLMs. In addition, we
investigated an example implementation using all components defined in the framework in a
challenging setting, which exclusively uses LLMs in a zero-shot setting without ground-truth
data and fine-tuning. While the example implementation eliminates the need for maintaining
entity embeddings for embedding-based entity retrieval, it may result in hallucinations, where
non-existent entities are used as subjects or objects in the generated queries. In addition,
answering questions that require complex SPARQL queries is challenging due to the current
limitations of LLM in generating such queries. Further investigation with other LLMs, including
specialized open-source LLMs trained on open question-query datasets, is required. Additionally,
using all 100 questions from [2] for evaluation in our setting – without ground-truth and without
fine-tuning – is challenging. Those questions contain many complex queries, such as those
requiring regex patterns, and might exclude the interesting possibility of exploring this research
direction. Hence, a benchmark dataset with varying query complexities would be beneficial.

Acknowledgments
This work was funded by the GLACIATION Horizon Europe project (No. 101070141).

References
[1] S. Yang, M. Teng, X. Dong, F. Bo, Llm-based sparql generation with selected schema from
large scale knowledge base, in: CCKS, Springer, 2023, pp. 304–316.
[2] L. Kovriguina, R. Teucher, D. Radyush, D. Mouromtsev, Sparqlgen: One-shot prompt-based
approach for sparql query generation, in: SEMANTiCS, 2023.
[3] B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin,
et al., Code llama: Open foundation models for code, arXiv:2308.12950 (2023).
[4] K. Soman, P. W. Rose, J. H. Morris, R. E. Akbas, B. Smith, B. Peetoom, C. Villouta-Reyes,
G. Cerono, Y. Shi, A. Rizk-Jackson, S. Israni, C. A. Nelson, S. Huang, S. E. Baranzini,
Biomedical knowledge graph-optimized prompt generation for large language models,
arXiv:2311.17330 (2024).
[5] C. V. S. Avila, M. A. Casanova, V. M. Vidal, A framework for question answering on
knowledge graphs using large language models, in: ESWC, 2024.