<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>X (G. G. Klager);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Is GPT fit for KGQA? - Preliminary Results⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gerhard G. Klager</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Axel Polleres</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>WU Wien - Vienna University of Economics and Business</institution>
          ,
          <addr-line>Welthandelsplatz 1, Vienna, 1020</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>In this paper we report about preliminary results on running question answering benchmarks against the recently hyped conversational AI services such as ChatGPT: we focus on questions that are known to be possible to be answered by information in existing Knowledge graphs such as Wikidata. In a preliminary study we experiment, on the one hand, with questions from established KGQA benchmarks, and on the other hand, present a set of questions established in a student experiment, which should be particularly hard for Large Language Models (LLMs) to answer, mainly focusing on questions on recent events. In a second experiment, we assess how far GPT could be used for query generation in SPARQL. While our results are mostly negative for now, we hope to provide insights for further research in this direction, in terms of isolating and discussing the most obvious challenges and gaps, and to provide a research roadmap for a more extensive study planned as a current master thesis project.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Question Answering</kwd>
        <kwd>KGQA</kwd>
        <kwd>LLMs</kwd>
        <kwd>GPT</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        comparison between existing and new QA-systems [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        At the same time, with the recent success of OpenAI’s ChatGPT[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and its many competitors
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], we see many applications of such large language models (LLMs), not only restricted to
question answering alone, but also in producing more or less useful code in programming and
query languages. Facing these developments, we may ask ourselves both (a) if such LLMs can
act as serious contenders to bespoke KGQA systems, and (b) whether LLMS could be used as
a supportive technology for query formulation in the context of KGQA. However, literature
covering this subject is still scarce and end-to-end QA-systems using LLMs such as ChatGPT in
a synergetic combination with KGQA have not yet been proposed in abundance.
      </p>
      <p>The aim of this paper is therefore to fill this gap by exploring the possibilities of using LLMs
such as ChatGPT in the task of KGQA and to challenge the status quo of existing benchmarks
aimed at training and evaluating KGQA-systems.</p>
      <p>
        In particular, we are interested in answering the following questions:
1. How does LLM-based QA difer from established KGQA approaches and what are the
respective strengths, weaknesses and challenges of the two methods?
2. Which components used in KGQA-systems could be enhanced using LLMs?
3. What types of questions are found in existing benchmarks for KGQA approaches and in
how far can these be used in benchmarking LLM-based QA approaches?
4. How can a comprehensive benchmark for LLM-based, KGQA-based but also combined QA
approaches be constructed that is challenging the current weaknesses of both approaches?
In order to get closer to answering these questions, we have started with a comprehensive
literature review. The goal of this literature review is to establish an overview of the already
existing diferent QA-systems and benchmarks on the one hand and to lay the foundations of
the approaches chosen to create our proposed new benchmark and QA-system as well as their
characteristics and main components. Our initial collection of articles[
        <xref ref-type="bibr" rid="ref1 ref11 ref12 ref13 ref2 ref4 ref5 ref6 ref7 ref8">2, 4, 11, 1, 8, 12, 13, 7, 6, 14,
15, 5, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41</xref>
        ]
contains both proposed solutions to
(KG)QA as well as existing benchmarks.
      </p>
      <p>While still ongoing, we provide an overview of our current status of identified benchmarks
and their main characteristics, as well as (components of) KGQA systems in Section 2 below.
In the next step, we plan to compare existing benchmarks and systems from the literature for
KGQA with LLM based question-answering. This shall be done by systematically comparing
the questions available in each benchmark, wrt. the (syntactic) structures of the questions
and – if available – corresponding structured queries, topic domains, and other characteristics
indicating the complexity of the question answering tasks (such as for instance aggregations
needed, etc.) and comparing the results of established KGQA and LLM based QA on each of
these benchmarks.</p>
      <p>For this first preliminary study, we proceed by selecting a mini-benchmark from
• sample subsets from the SimpleQuestions [42] and QALD-5 [43] benchmark datasets,
• a small dataset we manually designed to challenge ChatGPT
both of which having in mind that – in principle – the respective answers should intuitively be
ifndable in existing KGs such as Wikidata [44]. We present this mini-benchmark in Section 3.</p>
      <p>While certainly not yet representative of the field, we aim at drawing some initial conclusions
about LLMs/GPTs capabilities and challenges with respect to two separate subtasks of KGQA
• directly answering the natural language queries from our mini-benchmark vs.
• translating the natural language queries to SPARQL
We summarize the performance on both these subtasks in Section 3.4 and derive hypothesis for
further investigation.</p>
      <p>We conclude in Section 4 with an outlook for further work that also include a workplan and
main tasks to be executed in an ongoing Master thesis, wherefore we eagerly look forward to
feedback during the TEXT2KG workshop. As an overall goal in our research agenda, we plan
to design and implement a new hybrid approach making use of LLMs, such as ChatGPT (or
also other emerging, and hopefully open LLMs) to improve upon the weaknesses of existing
KGQA-systems without “LLM support”.</p>
      <p>Additionally, we hope our findings will serve as a base for new QA benchmarks aimed at
improving the training and evaluation of future, combined LLM-and-KGQA-systems.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work on KGQA</title>
      <p>With the growing attention given to KGQA in recent years we can also observe a large growth
of literature covering KGQA-systems and in terms of diferent methods and benchmarks to
evaluate these systems. This section is dedicated to provide an overview of this literature and
laying the foundation for our planned research.</p>
      <sec id="sec-2-1">
        <title>2.1. KGQA Benchmarks</title>
        <p>The topic of benchmarking in QA ranges from the methods of creating benchmarking datasets,
to the types of questions and queries used in a dataset, to methods to evaluate those benchmarks
themselves.</p>
        <p>Probably the most widely used family of question answering datasets is represented by
the Question Answering over Linked Data (QALD from here on) campaign. This series of
challenges aims at providing benchmarks for all QA-systems designed at using natural language
requests of a user to retrieve information stored as structured data such as the RDF data format.
Additionally, the challenge aims at comparing current state-of-the-art QA-systems with regard
to their individual strengths and shortcomings. In order to participate in the current QALD
challenge users can simply run their QA-system using the current challenge’s dataset before
storing their results in an XML file and upload it to the challenge’s website [ 45]. The QALD
challenge is currently taking place in its 10ℎ iteration [39] and accordingly provides 10 datasets
that could be used within further elaborations of our study [? ], a detailed analysis of these
datasets is on our agenda; for the moment, we have considered a sample from the 5ℎ QALD [43]
in our preliminary experiments, see below.</p>
        <p>As a more manageable starting point, SimpleQuestions is a dataset containing 100k questions
aimed at training and evaluating QA-systems with regard to solving the simple question
answering problem, which consists answering questions that can be rephrased as (single triple)
queries that ask for all objects linked to a questions given subject by its given relationship.
In this context, simple QA is a term used referring to the simplicity of the reasoning process
necessary to answer questions [46]. While SimpleQuestions was originally designed to be run
over FreebBase, Diefenbach et al. [42] have adapted/extended the original SimpleQuestions
dataset to Wikidata recently, which we include in our preliminary study since it is possible to
be tested against a large KG, available via a public SPARQL endpoint.</p>
        <p>As for further relevant QA benchmarks, Berant et. al. [47] created a new QA dataset named
WebQuestions. The WebQuestions dataset acts as an extension to the FREE917 dataset (again
based on Freebase) aimed at evaluating QA-systems. The authors created this dataset due to
the FREE917 dataset requiring logical forms, making it inherently more dificult to scale it
up due to the requirement of having expertise in annotating logical forms. Using the Google
Suggest API, the authors obtained questions beginning with a wh-word (where, who, when,
etc.) and containing exactly one entity. For each question, five candidate queries have been
created. After collecting 1M questions in this process, 100k randomly selected questions have
been submitted to Amazon Mechanical Turk (AMT from here on) where workers answered
questions detecting duplicates and filtering out questions that could not be answered. The
remaining dataset contained 5.810 questions. In particular, the approach to crowd-source and
to cross-check labeling in order to compare humans against QA systems may also be useful for
us in further elaborations of our study.</p>
        <p>As an alternative to simple questions the Large-Scale Complex Question Answering Dataset
2.0 [40] (LC-QuAD 2.0 from here on) is an extension to the original LCQuAD dataset [41]
containing 30k complex questions as well as their corresponding paraphrased versions and
SPARQL queries. This dataset is both compatible with Wikidata and DBpedia (2018). The dataset
was created by generating a number of SPARQL queries before verbalizing them into natural
language questions using the AMT. Afterward, these questions have been paraphrased to create
additional natural language questions. The LC-Quad 2.0 dataset contains 10 diferent question
types ranging from single fact questions which will be answered by returning either a subject or
object to complex questions requiring complex patterns, temporal information to be answered,
etc.</p>
        <p>Another approach of how to create a benchmark has been taken by the creators of the
WDAquaCore0Questions dataset which represents a collection of questions asked by users
testing the demo of the WDAqua-core-0 QA-system for Wikidata [48].</p>
        <p>
          Jiang and Usbeck analyzed 25 KGQA datasets with regard to five diferent KGs. Their study
showed that many available KGQA datasets are unfit to train KGQA-systems due to their
underlying assumptions or that these datasets are outdated and based on discontinued KGs.
Additionally, the authors share light on the dificulties and high costs related to the generation
of new datasets. Therefore, they propose an automated method to re-split datasets enabling
their generalization as well as a method to analyze existing KGQA datasets with regard to their
generalizability [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>
          While many diferent benchmarks aimed at evaluating QA-systems for diferent KGs exist the
question of which benchmark one should use can be a dificult one to answer. To answer this
question Orogat, Liu, and El-Roby proposed CBench [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], a suite that enables users to analyze
existing benchmarks with regard to linguistic, syntactic, and structural properties of the dataset’s
questions and queries as well as to evaluate QA-systems. Additionally, the authors provide an
overview of diferent creation methods for benchmarks ranging from manual creation based on
heuristics to benchmarks created automatically from the KG in question.
        </p>
        <p>In light of recent developments, and while social media is full of examples, there is — to
the best of our knowledge — not yet a dedicated QA dataset originally tailored to LLMs and
GPT specifically. In order to fill this gap, we asked students of the Digital Economy masters’
program at the Vienna University of Economics and Business to generate a set of natural
language questions aimed at asking ChatGPT to formulate queries GPT-3 would fail upon but
suspected to be possible to answer with the information in publicly available KGs such as
Wikidata. As a hint, we emphasized that we suspect LLMs to struggle with (a) recent events
information beyond the training phase of the LLM (b) complex questions that require
nonobvious conceptual understanding and reasoning. The students’ task was to also find/formulate
the corresponding SPARQL queries and – in the light of recent advances of LLMs for also code
and query generation, attempt whether ChatGPT was able to create such queries. We report on
a selected subset of these questions in section 3 below.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Question Answering Systems</title>
        <p>With the existence of numerous QA-benchmarks it is no surprise that the literature presents an
abundance of diferent QA-systems as well. These systems range from ones limited to single
KGs to systems able to access multiple KGs, from language dependent to language independent
systems, and from simpler template-based systems to complex systems incorporating elements
of machine learning.</p>
        <p>
          Diefenbach et. al. propose a QA-system capable of querying multiple KGs
independent of the natural language used. Their approach has been evaluated on five well-known KGs
and five diferent languages using three diferent benchmarks. Their proposed QA-system first
performs entity recognition in terms of searching corresponding IRIs whose lexicalization is an
n-gram (consecutive elements in a text) in the asked natural language questions question. After
removing stop words from the set of IRIs queries that could represent possible interpretations
of the question are constructed before being ranked based on multiple aspects such as the
number of words matching the words in the original question. Next, a logistic regression based
on labeled SPARQL queries will be trained to compute the confidence score for each query.
Last, the highest ranked query above a certain threshold will be used to answer the question. If
no query with confidence above the threshold is found, the whole question will be deemed
unanswerable. During their study, the authors discovered performance diferences in their
approach wrt. diferent (natural) languages used and link these diferences to the quality of the
available data for each language [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>
          Vakulenko et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] take a quite diferent approach based on the usage of unsupervised
message passing (QAmp from here on) which consists of two phases: in the first phase called
question interpretation, the relevant sets of entities and predicates necessary for answering the
input question are again being identified and their confidence scores are being computed. In the
second phase, the so-called answer inference phase, these confidence scores a propagated and
aggregated over the underlying KG’s structure, providing a confidence distribution over a set of
possible answers which is then be used to locate the corresponding answer entities, rather than
translating the query to SPARQL.
        </p>
        <p>
          Yani et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] propose yet another a method to detect entities and their position on triples that
have been mentioned in a complex question. Their approach is capable of not only detecting
the entity name but also of determining in which triple the entity is located and if the given
entity is a head or tail of the triple.
        </p>
        <p>Shin et. al. [15] notice that QA systems sufer notably from the divergence of the unstructured
data composing natural language questions and the structured data composing a KG. Existing
approaches to deal with this issue use lexicons in order to cover diferently represented data.
Since these lexicons only consider representations for entity and relation mentions the authors
propose a new predicate constraint lexicon restricting subject and object types for a predicate.
This so-called Predicate Constraints based Question Answering (PCQA from here on) lexicon
does not make use of any templates. Rather the authors generated query graphs focusing on
matching relations in order to cover diverse types of questions.</p>
        <p>
          Another QA-system proposed by Liang et. al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] is based on the idea of splitting the process of
translating natural language questions into SPARQL queries into five sub-tasks. First, a random
forest model is trained to identify a question’s type. Next, various entity recognition and
property mapping tools are used to map the question’s phrases before all possible triple patterns
are created based on these mapped resources. Afterward, possible SPARQL are generated by
combining these triple patterns before a Tree-LSTM based ranking model is used to select the
most plausible SPARQL query representing the correct intention behind the natural language
question. Possible SPARQL queries are then constructed by combining these triple patterns
in the query generation step. In order to select the correct SPARQL query among a number
of candidate queries for each question, a ranking model based on Tree-LSTM is used in the
query ranking step. The ranking model takes into account both the syntactical structure of
the question and the tree representation of the candidate queries to select the most plausible
SPARQL query representing the correct intention for the respective question.
        </p>
        <p>As this short overview shows, the main tasks in many KGQA systems firstly involve
entity/property recognition and matching to respective IRIs in the KG. Secondly, some but not
all QA systems proceed by formulating SPARQL queries from these entities. Our following
preliminary experiment is therefore tailored to mainly challenge GPT in terms of whether these
tasks can be adequately supported by (currently existing) LLMs.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Benchmarking LLMs</title>
      <p>In order to evaluate the raw performance of large language models we decided to use two
of OpenAI’s large language models. The first model we used is the GPT 3.5-based ChatGPT.
Additionally, we used OpenAI’s older GPT 3-based davinci model to give a comparison to
ChatGPT’s results and to possibly detect structural characteristics of LLMs in the context of
question answering.</p>
      <p>For this we first let each system/model answer all natural questions directly and secondly
indirectly by first generating corresponding SPARQL queries for Wikidata, before subsequently
attempting to retrieve their results. This process was done for (i) the student dataset aimed at
providing questions that cannot be answered by ChatGPT, (ii) a subset of the SimpleQuestions
adaptation for Wikidata [42], and (iii) a subset of the QALD-5 dataset [43], with the student
dataset consisting of 14 questions and the two subsets consisting of 15 randomly drawn questions
from the original datasets. Extending this study to analogously test further KGQA benchmarks
is on our agenda.</p>
      <p>We limited our study to the Wikidata KG. This decision has been made for multiple reasons.
First, while other popular KGs, such as Freebase, stopped their operations, Wikidata is one of
the most popular, and most actively maintained KGs. Besides accounting for the KGs relevancy,
this could also mean that Wikidata is better suited to be used when answering information on
current events, an expected weakness of LLMs wrt. QA. Secondly, this study aims at uncovering
LLMs limitations wrt. KGQA. This renders Wikidata specifically challenging since the task
of entity recognition can be assumed to be harder for Wikidata than for other KGs such as
DBpedia. This is due Wikidata’s numeric identifiers, LLMs should not be able to derive the
correct identifiers directly from the question asked, which could be blurred by the inherent
semantics of language-based URIs [49] as used for instance in DBpedia. While limiting ourselves
to one KG for this preliminary study allowed us to obtain first insights results, expanding and
comparing our research wrt. to a comparison with other/multiple KGs is on our agenda.</p>
      <sec id="sec-3-1">
        <title>3.1. Question Answering, Query Generation and Query Execution</title>
        <p>All necessary computations and all necessary programming in this study has been done by R
scripts [50], using the openai package (version 0.4.0) [51] in combination with OpenAI’s API to
interact with ChatGPT and davinci. At this point, it must be noted that OpenAI’s API allows
the usage of diferent temperature options to control how deterministic the behavior of the
LLM should be. While the chosen setting might potentially have a significant influence on
the results, we used the default temperature setting of 1 in this study to replicate especially
ChatGPTs behavior when accessed through its Web interface, to which we have been able to
record diferences regardless. Secondly, upon trying to execute our scripts with a temperature
of 0, supposedly meaning fully deterministic behavior, a later mentioned problem of the LLMs
getting stuck at certain questions and repeating previous answers or query structures occurred
for most of the questions, rendering the results useless. At this point, we cannot confirm whether
this is a result of high server load or if the temperature setting is at least partially responsible
for this. Further investigation of the significance of the temperature setting is therefore on our
agenda. In this context, note that question 12 and 15 in the sample of the QALD-5 dataset (cf.
Table 3) are identical. Since we drew 15 random questions, all having diferent questions IDs,
this indicates that the original dataset contains duplicates. We did deliberately not remove those
for now, in order determine whether duplicates yield diferent answers: while slightly diferent
in their wording, the direct natural language answers’ content has been identical, and likewise
the generated SPARQL queries were identical in our preliminary experimental run. A more
in depth investigation in how far repeated "runs" of the experiment yield diferent results or
improvements, also in the context of adapting GPT’s temperature parameter, is on our agenda.</p>
        <p>In order to generate both the answers to a natural language question ( ) and its
corresponding SPARQL query the following natural language prompt was used:
“Please write me a SPARQL query on Wikidata without comments to answer the
following Question:  .”.</p>
        <p>Since the used LLMs usually add comments within their queries the passage “without comments”
has been added to eliminate these comments, which allows for easier processing and
subsequently easier execution of these queries. Finally, we used the WikidataQueryServiceR package
(version 1.0.0) [52] to execute the generated queries and to retrieve their results.</p>
        <p>We note that, besides the elimination of unwanted comments the selected prompt has been
carefully designed in a way that aims at preventing the prompt to influence the LLMs answers
besides their structural representation: as it was the aim of this preliminary study to determine
how LLM’s as a standalone option lend themselves to QA and KGQA tasks, we started with a
static, uniform prompt. The resulting findings should then form a foundation based on which
further insights on the topic of prompt engineering could be derived, i.e., how advanced methods
including specifically engineered prompts or hybrid QA approaches that dynamically generate
prompts could perform KGQA tasks. While exploring these questions further the lies outside of
the current study’s scope, we consider it a potentially important direction for future work.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Performance Evaluation</title>
        <p>In order to evaluate ChatGPT’s and GPT 3’s performance both their answers given in natural
language and the results of the queries generated by the LLMs have been assigned one out of
three possible grades: Correct, Incorrect, and No answer. Correct marks a case where the LLMs
were in fact able to answer the given question correctly. Incorrect results mark cases where
they were able to answer the question but did so incorrectly. Finally, no answer is assigned to
cases where the LLM’s where unable to generate an answer to the question asked. We assume
that this will be the case when the LLMs are asked about events happening outside of their
training period (hence after September 2021).</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Results</title>
        <sec id="sec-3-3-1">
          <title>3.3.1. Student Dataset</title>
          <p>In this section, we will show the results generated by ChatGPT and GPT 3 on the student dataset,
as well as our subsamples of the SimpleQuestions dataset and the QALD-5 challenge dataset.
As described earlier, we generated SPARQL queries for each of the questions in the
studentgenerated dataset, retrieved their results as well as the direct answers NLAs) to the NLQs given
by ChatGPT and GPT 3, and evaluated them. Table 1 shows the questions in this dataset.</p>
          <p>Unsurprisingly, ChatGPT was unable to answer most of the dataset’s questions correctly.
However, ChatGPT acknowledged its limitations wrt. dates and added a disclaimer at the
beginning of its answers stating that its knowledge is limited to dates up to September 2021.1
1Some further experiments also show that this behavior could seemingly – in the tested GPT versions – sometimes
be worked around by prompt reformulation, typically leading to a factually wrong answer.
Out of 14 questions with our standardized prompt, ChatGPT was able to provide 8 NLAs, three
out of which were correct.</p>
          <p>As expected, incorrect answers have been given wrt. changes that happened after 2021, such as
when ChatGPT was asked to name the "most recent Minecraft Java Edition version" (1.19.3 at the
time of writing) to which it responded with "As of September 2021, the most recent version of
Minecraft Java Edition is 1.17.1.".</p>
          <p>Another interesting observation is that ChatGPT seemingly can sometimes get “stuck” at
a given question. Consider the second and third questions in our student dataset: ChatGPT
stated that it is not capable of answering the second question Who won the football worldcup
2022? due to it not having taken place by ChatGPT’s knowledge. However, ChatGPT gave the
same answer to the third question Give me all Austrian female actors that are aged over 50?.
The same anomaly occurred with question 7 "What is the most recent MineCraft Java Edition
version? and question 8 How many people do live on earth?. We so far did not entirely clarify,
whether this behavior was due to an API issue, or due to the sequential nature of the model
itself, where diferent answers are obviously depending on the order of interactions. Regardless
of this, ChatGPT was able to generate SPARQL queries for all NLQs within the student dataset.
Out of these 14 queries 13 were syntactically correct and could indeed be executed. However, 10
out of these 13 queries returned no results. The three remaining queries returning results have
been the queries for question 1 "Who is the current president of the United States? " (Listing 1),
question 2 "Who won the football worldcup 2022? " ((Listing 2)), and question 8 "How many people
do live on earth? " (Listing 3).</p>
          <p>The former, shown in listing 1 correctly returned Joe Biden, and — looking at the ORDER BY
and LIMIT combination, indeed semantically attempts to retrieve the most recent president. We
should note though, that question 1 was – as opposed to the other student questions – provided
by the instructor upfront, as an example of a question that was correctly translated by GPT,
having in mind to find a likely common example question referring to current data, but also
probably available verbatim in SPARQL examples that the LLM has been trained upon.</p>
          <p>Listing 1: ChatGPT generated query for: Who is the current president of the united states?
SELECT ? p r e s i d e n t L a b e l
WHERE {
wd : Q30 p : P6 [ ps : P6 ? p r e s i d e n t ; pq : P580 ? s t a r t _ t i m e ] .</p>
          <p>FILTER NOT EXISTS { ? p r e s i d e n t p : P582 [ ] }
SERVICE w i k i b a s e : l a b e l { bd : s e r v i c e P a r a m</p>
          <p>w i k i b a s e : l a n g u a g e " [AUTO_LANGUAGE } , en " . }
} ORDER BY DESC ( ? s t a r t _ t i m e )
LIMIT 1</p>
          <p>Listing 2: ChatGPT generated query for: Who won the football worldcup 2022?
SELECT ? t e a m L a b e l
WHERE {
? cup wdt : P31 wd : Q16510064 ;
wdt : P585 ? d a t e ;
wdt : P1346 ? team .</p>
          <p>FILTER ( YEAR ( ? d a t e ) = 2 0 2 2 )</p>
          <p>SERVICE w i k i b a s e : l a b e l { bd : s e r v i c e P a r a m w i k i b a s e : l a n g u a g e " en " }
} LIMIT 10</p>
          <p>Listing 3: ChatGPT generated query for: How many people do live on earth?
SELECT (COUNT( ? item ) as ? count )
WHERE {</p>
          <p>? item wdt : P31 wd : Q5 .</p>
          <p>} LIMIT 10</p>
          <p>On the contrary, the query for question 2, shown in Listing 2, correctly retrieves (somewhat
arguably generalizing on the question) winners of sports events in 2022, which could be correct,
but among ChatGPTs chosen LIMIT of 10, only persons and no teams (and certainly not the
football worldcup winning team) were retrieved. We note that question 2 has been a common
example to "challenge" ChatGPT, where earlier incarnations (such as davinci) would answer
"Brazil" as a statistically probable, but wrong answer and ChatGPT would, as mentioned above,
refuse an answer, hinting on having only training data up to 2021.</p>
          <p>Finally, as for question 8 shown in Listing 2, while the query returned was able to return a
number, its result of 10546377 missed the target number of ∼ 8 billion people by a significant
amount and in fact, in this case interestingly, the resulting query simply counts all items that
belong to the class human within Wikidata.</p>
          <p>In order to test these results for robustness we executed the whole process a second time.
In this second run, the before-mentioned problem of ChatGPT getting stuck at a question
did not occur at all. Additionally, ChatGPT was now able to answer 11 out of 15, i.e. one
additional question, question 11 regarding "the oldest painting in the world" now delivered a
result. However, the number of correct answers only increased by one.</p>
          <p>Surprisingly, ChatGPT’s performance wrt. query generation sufered significantly, with the
chatbot on the second still being able to generate 15 queries, but out of which only 10 were
syntactically correct. This time, only one of these queries (question 1) returned the desired</p>
          <p>Listing 4: davinci generated query for: How many people do live on earth?
# d e f a u l t V i e w : Map
SELECT ? c o u n t r y (SAMPLE( ? p o p u l a t i o n ) AS ? t o t a l P o p u l a t i o n )
WHERE {
? c o u n t r y wdt : P36 wd : Q458 .</p>
          <p>? c o u n t r y wdt : P1082 ? p o p u l a t i o n .
} GROUP BY ? c o u n t r y
HAVING( ? t o t a l P o p u l a t i o n &gt; 0 )</p>
          <p>LIMIT 10
results.</p>
          <p>An important observation here is that ChatGPT was unable to generate queries for each
question when asked manually through its web interface during an initial tryout of the chatbot.
At this point, it must be noted that a new OpenAI account, with no prior interactions with
ChatGPT, has been used to generate the respective SPARQL queries using the OpenAI API (in
order to ensure no bias was added through personalized adaptation on the user account).</p>
          <p>Interestingly, upon using the GPT 3-based model davinci (as ChatGPT’s predecessor), we
were able to observe structural diferences between its results and the ones obtained by using
ChatGPT.</p>
          <p>First, the NLAs generated by davinci did not contain a disclaimer wrt. questions spanning
outside of its training time frame. Also, it becomes obvious that ChatGPT has been trained
with more recent data than davinci. While ChatGPT was able to correctly answer the question
For which team does Lionel Messi play? with Paris Saint-Germain (PSG), davinci answered this
question with Barcelona. We note that overall davinci produced a significantly larger number of
factually wrong answers than ChatGPT, which may be partially due to its outdated training
data, and partially due to the smaller training date leading to more "made up" answers. We do
note though, that a detailed investigation (comparing factually wrong vs. outdated answers vs.
“accidentally right” guesses ) in detail is yet on our agenda.</p>
          <p>Next, davinci was unable to fully comply with our limitation of not adding any comments to
the generated query, while ChatGPT consistently complied with our instructions. Additionally,
11 out of the 14 queries generated by davinci had some sort of syntactical error, making them
not executable. See the query for question 8 How many people do live on earth? as an example
for both of these phenomena:</p>
          <p>While this query can simply be syntactically fixed — by removing the whitespace between
the ? and the term population within the second line — it still diverts significantly from the
original intention of the asked question.</p>
          <p>Lastly, the problem of ChatGPT, i.e., getting stuck at a question during its first run, did
not occur when using davinci. We assume that this might be related to the amount of trafic
ChatGPT experienced while generating the queries, i.e., rather being related to an API problem
than the model itself, since this problem does not seem to be consistent in its occurrence and
did not occur when using ChatGPT via its web interface.</p>
          <p>Overall, davinci’s answers appeared, as expected, a lot more arbitrary and outdated than
those of the newer ChatGPT model in the preliminary study of our student dataset.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. SimpleQuestions</title>
          <p>We again generated SPARQL queries for each of the questions in the sample of the
SimpleQuestions dataset, retrieved their results as well as the answers to the NLQs given by ChatGPT and
davinci and evaluated them.</p>
          <p>Table 2 shows the questions listed in this sample of the SimpleQuestions dataset.2</p>
          <p>The results for this dataset mostly mirror the observations made for the students dataset.
However, this time 6 out of 15 answers generated by ChatGPT are results of the chatbot being
stuck at a previous question. Out of the 15 generated queries 14 queries were executable but
none of them returned any result.</p>
          <p>Again, a second run has been done which led to significantly better results. In this second run
ChatGPT did not get stuck on a single question and 10 of the provided answers were correct.
Similarly, all of the 15 generated queries were syntactically correct, yet still only three out of the
15 queries returned the correct results, while the query for question 14 What type of tv program
is the flintstones returned a result completely detached from the questions (commune of Italy)
which can however be interpreted as one possible answer to question 13 where is just married
iflmed? , hinting at the ChatGPT API got again stuck during the query generation this time on a
previous run.</p>
          <p>A possible reason for the comparatively higher success rate observed for this dataset (less
emphasis on current or recent data) could be not only its age, with the dataset being unchanged
since 2017, but also the general nature of its questions, with many questions being instructions
to give an example of something or having many correct possible answers. Take for instance
2Please note missing question marks or diferent capitalization stem from a random sample of the original dataset
without any modifications from our side.
Listing 5: ChatGPT generated query for: How many missions does the Soyuz programme have?
SELECT (COUNT( ? m i s s i o n ) AS ? c o u n t M i s s i o n s )
WHERE {
? m i s s i o n wdt : P31 wd : Q209343 .</p>
          <p>? m i s s i o n wdt : P361 wd : Q127846 .</p>
          <p>} LIMIT 10
question 11 what is a film in the crime fiction genre . With numerous crime fiction films existing
and their success or popularity not being a limitation stated in the question it can be assumed
that ChatGPT should be able to name at least one item fitting the definition of being a film and
belonging to the genre crime fiction. From this, we can expect ChatGPT to be able to answer
most questions in this dataset correctly that do not ask for elements subject to change, such as
the CEO of a company or which player currently plays for a certain team.
3.3.3. QALD-5
Lastly, we analogously again generated SPARQL queries for each of the questions in a sample
of the QALD-5 dataset (shown in Table 3), retrieved their results, and analyzed direct answers
to the NLQs given by ChatGPT and davinci.</p>
          <p>When directly answering the questions in this subset of QALD-5, ChatGPT did not get stuck
on any of the 15 questions and was able to answer 9 of the 15 questions correctly. However,
while ChatGPT was able to generate syntactically correct queries for all of the 15 questions, the
only query returning a result was the query for question 10 How many missions does the Soyuz
programme have?. Yet, the only element of this query closely related to the actual question is
the count function.
3Note: the duplicate question 12+15 were discussed in Section 3.1.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Summary of Results</title>
        <p>Summarizing the results of our initial experiments, overall, we admittedly are only at the start
of our research. Yet, we have already gained valuable insights into the potential and gaps when
trying to leverage LLMs for (factual) question answering, with a focus on questions the answers
of which should be retrievable from KGs.</p>
        <p>How does LLM-based QA difer from established KGQA approaches and what are the
respective strengths, weaknesses, and challenges of the two methods?
While LLMs show promising results wrt. QA it became clear that these models are
limited by various factors. Most importantly, both ChatGPT’s and davinci’s limitations in direct
question answering were mostly related to outdated training data, such that they performed
particularly well on older QA benchmarks.</p>
        <p>Unsurprisingly, the newer ChatGPT model performed significantly better on both the direct
question answering tasks and also in particular in terms of the syntactical correctness of
queries; we may expect further significant advances in the just released GPT4 model.
Additionally, some unexpected behaviors resulted from inexplicable efects of interacting with
the OpenAI APIs’, in terms of order-dependent answers that appeared to be actually "stuck"
answers to prior queries. Unfortunately, we could not yet determine whether these were related
to simple API bugs or due to the model; however really open LLMs would certainly allow
investigating order-dependency or alike in a much more transparent manner, than OpenAI’s
current, closed business model that in fact may switch to a paid only approach.
A summary of both ChatGPT’s and davinci’s results wrt. direct question answering can be
found in Tables 5 and 6.</p>
        <p>In terms of query formulation, ChatGPT produced a high share of syntactically correct queries,
but very few reflecting the actual question; we do hypothesize that this is largely due to a lack
of explicit entity recognition, i.e., recognizing correct relevant IRIS (i.e., in the case of Wikidata
relevant Q- and P-identifiers of entities and properties. A more in depth analysis of the resulting
queries in terms of semantic distance of the extracted identifiers, or investigating in how far
LLMS can be used for supporting the entity recognition subtask in isolation as part of a KGQA
pipeline is on our agenda.</p>
        <p>Especially for the latter point, one has to assume that correct query formulations so far
rather stem from verbatim SPARQL examples in the training data for common questions than
from an actual understanding of the entities and query structure. We may still assume that the
quality of such queries will improve in the future, even now already we encountered hardly
any syntactical errors.</p>
        <p>A summary of ChatGPT’s results wrt. query generation can be found in Table 7.
Which components used in KGQA-systems could be enhanced using LLMs?
While the LLMs in questions showed mediocre results by themselves they potentially
inherit the capabilities to improve already existing QA-approaches. We believe that LLMs
could provide especially useful in the task of entity recognition which forms part of many
existing KGQA-systems. Using LLMs to find synonyms for words occurring in the question
asked, extracting the questions underlying meaning, and using them in combination with query
generation templates or by implementing extensive prompt engineering to give the LLMs hints
on how to structure their queries.</p>
        <p>
          What types of questions are found in existing benchmarks for KGQA approaches and
in how far can these be used in benchmarking LLM-based QA approaches?
While diferent benchmarks use diferent categories to categorize their questions some
studies provide a holistic categorization of the questions and queries provided in diferent
benchmark datasets. A summary of the used (sub)datasets questions categorized in accordance
to CBench’s wh-questions classification [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] can be found in Table 4 while Tables 5 and 6 provide
a summary of the correctly answered questions by their type.
        </p>
        <p>The results of our study show that LLMs have a particularly hard time answering questions
forming some type of count or, resp., asking for all entities of a certain category (particularly
though, in terms of KGQA also because knowledge in common KGs is typically incomplete).
Aside from this, LLMs struggle to answer questions including recent events due to their limited
training period.</p>
        <p>An additional analysis wrt. to the questions’ categories and patterns in the LLMs’ results will
be conducted in future work.</p>
        <p>How can a comprehensive benchmark for LLM-based, KGQA-based but also
combined QA approaches be constructed that is challenging the current weaknesses of
both approaches?
While it became obvious that a comprehensive benchmark must contain questions
aimed at recent/current events, these types of questions are not only harder to fact-check
but inherit additional complications due to the resulting need of constant adaption of the
benchmark. Additionally, a comprehensive KGQA benchmark must include questions that
require the KGQA-system to form some sort of arithmetic or logical linking, such as counting
entities related to a word in the asked question, etc.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>In our preliminary study, we analyzed the performance of two of OpenAI’s large language
models davinci (GPT 3) and ChatGPT (GPT 3.5) against a set of questions established by students
and two subsets of the established benchmark datasets SimpleQuestions and QALD-5. We used
both models to answer the questions in each dataset, as well as to generate SPARQL queries
aimed at retrieving these answers from the knowledge base Wikidata. Our results demonstrate
the limitations of large language models, which mainly lies in their training time frame as
well as their stability. Additionally, we show that LLMs are in principle capable of generating
functioning queries. While being able to consistently generate structurally and syntactically
correct queries, they however demonstrate bad performance wrt. entity detection, resulting
in the generated queries not returning the desired results. Therefore, the question remains
open how large language models can be used in combination with existing question answering</p>
    </sec>
    <sec id="sec-5">
      <title>5. Tables</title>
      <p>What
When
Where
Which
Who
Whom
Whose
How
Yes/No
Requests
Topical
Sum
What
When
Where
Which
Who
Whom
Whose
How
Yes/No
Requests
Topical
Sum
Correctly answered questions GPT 3.
systems and specifically how existing approaches can be used to substitute the LLM’s deficits
regarding entity detection. The presented preliminary paper comprises the first results of a
recently started master thesis project. Starting from these initial insights, we look forward to
discussing routes ahead at the workshop and collecting feedback for our ongoing experiments.</p>
      <p>Student</p>
      <p>SimpleQuestions</p>
      <p>QALD-5</p>
      <p>Student</p>
      <p>SimpleQuestions</p>
      <p>QALD-5
s10844-022-00724-6.
[14] G. Mai, K. Janowicz, L. Cai, R. Zhu, B. Regalia, B. Yan, M. Shi, N. Lao, iSE-KGE/i : A
location-aware knowledge graph embedding model for geographic question answering
and spatial semantic lifting, Transactions in GIS 24 (2020) 623–655. URL: https://doi.org/
10.1111/tgis.12629. doi:10.1111/tgis.12629.
[15] S. Shin, X. Jin, J. Jung, K.-H. Lee, Predicate constraints based question answering over
knowledge graph, Information Processing &amp;amp Management 56 (2019) 445–462. URL:
https://doi.org/10.1016/j.ipm.2018.12.003. doi:10.1016/j.ipm.2018.12.003.
[16] J. Gomes, R. C. de Mello, V. Ströele, J. F. de Souza, A study of approaches to
answering complex questions over knowledge bases, Knowledge and Information Systems
64 (2022) 2849–2881. URL: https://doi.org/10.1007/s10115-022-01737-x. doi:10.1007/
s10115-022-01737-x.
[17] P. Do, T. H. V. Phan, Developing a BERT based triple classification model using knowledge
graph embedding for question answering system, Applied Intelligence 52 (2021) 636–651.</p>
      <p>URL: https://doi.org/10.1007/s10489-021-02460-w. doi:10.1007/s10489-021-02460-w.
[18] W. Jin, B. Zhao, H. Yu, X. Tao, R. Yin, G. Liu, Improving embedded knowledge graph
multi-hop question answering by introducing relational chain reasoning, Data Mining and
Knowledge Discovery 37 (2022) 255–288. URL: https://doi.org/10.1007/s10618-022-00891-8.
doi:10.1007/s10618-022-00891-8.
[19] R. Wang, M. Wang, J. Liu, M. Cochez, S. Decker, Structured query construction via
knowledge graph embedding, Knowledge and Information Systems 62 (2019) 1819–1846.</p>
      <p>URL: https://doi.org/10.1007/s10115-019-01401-x. doi:10.1007/s10115-019-01401-x.
[20] U. Sawant, S. Garg, S. Chakrabarti, G. Ramakrishnan, Neural architecture for
question answering using a knowledge graph and web corpus, Information Retrieval
Journal 22 (2019) 324–349. URL: https://doi.org/10.1007/s10791-018-9348-8. doi:10.1007/
s10791-018-9348-8.
[21] H. Jung, W. Kim, Automated conversion from natural language query to SPARQL query,
Journal of Intelligent Information Systems 55 (2020) 501–520. URL: https://doi.org/10.1007/
s10844-019-00589-2. doi:10.1007/s10844-019-00589-2.
[22] D. Diefenbach, V. Lopez, K. Singh, P. Maret, Core techniques of question answering systems
over knowledge bases: a survey, Knowledge and Information Systems 55 (2017) 529–569.</p>
      <p>URL: https://doi.org/10.1007/s10115-017-1100-y. doi:10.1007/s10115-017-1100-y.
[23] S. Kafle, N. de Silva, D. Dou, An overview of utilizing knowledge bases in neural networks
for question answering, Information Systems Frontiers 22 (2020) 1095–1111. URL: https:
//doi.org/10.1007/s10796-020-10035-2. doi:10.1007/s10796-020-10035-2.
[24] Y.-M. Kim, T.-H. Lee, S.-O. Na, Constructing novel datasets for intent detection and
ner in a korean healthcare advice system: guidelines and empirical results, Applied
Intelligence 53 (2022) 941–961. URL: https://doi.org/10.1007/s10489-022-03400-y. doi:10.
1007/s10489-022-03400-y.
[25] E. Erdem, M. Kuyu, S. Yagcioglu, A. Frank, L. Parcalabescu, B. Plank, A. Babii, O.
Turuta, A. Erdem, I. Calixto, E. Lloret, E.-S. Apostol, C.-O. Truică, B. Šandrih, S.
MartinčićIpšić, G. Berend, A. Gatt, G. Korvel, Neural natural language generation: A
survey on multilinguality, multimodality, controllability and learning, Journal of
Artificial Intelligence Research 73 (2022) 1131–1207. URL: https://doi.org/10.1613/jair.1.12918.
doi:10.1613/jair.1.12918.
[26] T. Adewumi, F. Liwicki, M. Liwicki, State-of-the-art in open-domain conversational AI:
A survey, Information 13 (2022) 298. URL: https://doi.org/10.3390/info13060298. doi:10.
3390/info13060298.
[27] D. M. Korngiebel, S. D. Mooney, Considering the possibilities and pitfalls of generative
pre-trained transformer 3 (GPT-3) in healthcare delivery, npj Digital Medicine 4 (2021).</p>
      <p>URL: https://doi.org/10.1038/s41746-021-00464-x. doi:10.1038/s41746-021-00464-x.
[28] Y. Matveev, O. Makhnytkina, P. Posokhov, A. Matveev, S. Skrylnikov, Personalizing
hybrid-based dialogue agents, Mathematics 10 (2022) 4657. URL: https://doi.org/10.3390/
math10244657. doi:10.3390/math10244657.
[29] Y. Yang, J. Cao, Y. Wen, P. Zhang, Multiturn dialogue generation by modeling
sentencelevel and discourse-level contexts, Scientific Reports 12 (2022). URL: https://doi.org/10.
1038/s41598-022-24787-1. doi:10.1038/s41598-022-24787-1.
[30] Z. Ahmad, A. Ekbal, S. Sengupta, P. Bhattacharyya, Neural response generation for task
completion using conversational knowledge graph, PLOS ONE 18 (2023) e0269856. URL:
https://doi.org/10.1371/journal.pone.0269856. doi:10.1371/journal.pone.0269856.
[31] D. Jannach, L. Chen, Conversational recommendation: A grand AI challenge, AI Magazine
43 (2022) 151–163. URL: https://doi.org/10.1002/aaai.12059. doi:10.1002/aaai.12059.
[32] S. Huh, Are ChatGPT's knowledge and interpretation ability comparable to those of
medical students in korea for taking a parasitology examination?: a descriptive study,
Journal of Educational Evaluation for Health Professions 20 (2023) 1. URL: https://doi.org/
10.3352/jeehp.2023.20.01. doi:10.3352/jeehp.2023.20.01.
[33] G. Caldarini, S. Jaf, K. McGarry, A literature survey of recent advances in
chatbots, Information 13 (2022) 41. URL: https://doi.org/10.3390/info13010041. doi:10.3390/
info13010041.
[34] V. Shankar, S. Parsana, An overview and empirical comparison of natural language
processing (NLP) models and an introduction to and empirical application of autoencoder
models in marketing, Journal of the Academy of Marketing Science 50 (2022) 1324–1350.</p>
      <p>URL: https://doi.org/10.1007/s11747-022-00840-3. doi:10.1007/s11747-022-00840-3.
[35] N. Alswaidan, M. E. B. Menai, A survey of state-of-the-art approaches for emotion
recognition in text, Knowledge and Information Systems 62 (2020) 2937–2987. URL:
https://doi.org/10.1007/s10115-020-01449-0. doi:10.1007/s10115-020-01449-0.
[36] S.-E. Kim, Y.-S. Lim, S.-B. Park, Strong influence of responses in training dialogue response
generator, Applied Sciences 11 (2021) 7415. URL: https://doi.org/10.3390/app11167415.
doi:10.3390/app11167415.
[37] N. Tsinganos, P. Fouliras, I. Mavridis, Applying BERT for early-stage recognition of
persistence in chat-based social engineering attacks, Applied Sciences 12 (2022) 12353.</p>
      <p>URL: https://doi.org/10.3390/app122312353. doi:10.3390/app122312353.
[38] H. Snyder, Literature review as a research methodology: An overview and guidelines,
Journal of Business Research 104 (2019) 333–339. URL: https://doi.org/10.1016/j.jbusres.
2019.07.039. doi:10.1016/j.jbusres.2019.07.039.
[39] R. Usbeck, X. Yan, A. Perevalov, L. Jiang, J. Schulz, A. Kraft, C. Moeller, J. Huang,
J. Reineke, A.-C. N. Ngomo, M. Saleem, A. Both, Semantic Web Journal under
submission (2023). URL: https://semantic-web-journal.net/content/qald-10-%E2%80%
94-10th-challengequestion-answering-over-linked-data.
[40] M. Dubey, D. Banerjee, A. Abdelkawi, J. Lehmann, Lc-quad 2.0: A large dataset for complex
question answering over wikidata and dbpedia, in: Proceedings of the 18th International
Semantic Web Conference (ISWC), Springer, 2019.
[41] P. Trivedi, G. Maheshwari, M. Dubey, J. Lehmann, Lc-quad: A corpus for complex question
answering over knowledge graphs, in: Proceedings of the 16th International Semantic
Web Conference (ISWC), Springer, 2017, pp. 210–218.
[42] D. Diefenbach, T. P. Tanon, K. D. Singh, P. Maret, Question answering benchmarks for
wikidata, in: Proceedings of the ISWC 2017 Posters &amp; Demonstrations and Industry Tracks
co-located with 16th International Semantic Web Conference (ISWC 2017), Vienna, Austria,
October 23rd - to - 25th, 2017., 2017. URL: http://ceur-ws.org/Vol-1963/paper555.pdf.
[43] C. Unger, C. Forescu, V. Lopez, A.-C. Ngonga Ngomo, E. Cabrio, P. Cimiano, S. Walter,</p>
      <p>Question answering over linked data (qald-5), 2015.
[44] D. Vrandečić, M. Krötzsch, Wikidata: A free collaborative knowledgebase, Commun. ACM
57 (2014) 78–85. URL: https://doi.org/10.1145/2629489. doi:10.1145/2629489.
[45] V. Lopez, C. Unger, P. Cimiano, E. Motta, Evaluating question answering over linked data,
Web Semantics Science Services And Agents On The World Wide Web 21 (2013) 3–13.
doi:10.1016/j.websem.2013.05.006.
[46] A. Bordes, N. Usunier, S. Chopra, J. Weston, Large-scale simple question answering with
memory networks, 2015. URL: https://arxiv.org/abs/1506.02075. doi:10.48550/ARXIV.
1506.02075.
[47] R. F. Jonathan Berant, Andrew Chou, P. Liang, Semantic parsing on freebase from
questionanswer pairs, Proceedings of the 2013 Conference on Empirical Methods in Natural
Language Processing (2013) 1533–1544.
[48] D063520, Github - wdaquacore0questions, 2017. URL: https://github.com/WDAqua/</p>
      <p>WDAquaCore0Questions, accessed on February 28, 2023.
[49] S. de Rooij, W. Beek, P. Bloem, F. van Harmelen, S. Schlobach, Are names meaningful?
quantifying social meaning on the semantic web, in: P. Groth, E. Simperl, A. J. G. Gray,
M. Sabou, M. Krötzsch, F. Lécué, F. Flöck, Y. Gil (Eds.), The Semantic Web - ISWC 2016 - 15th
International Semantic Web Conference, Kobe, Japan, October 17-21, 2016, Proceedings,
Part I, volume 9981 of Lecture Notes in Computer Science, 2016, pp. 184–199. URL: https:
//doi.org/10.1007/978-3-319-46523-4_12. doi:10.1007/978-3-319-46523-4\_12.
[50] R Core Team, R: A Language and Environment for Statistical Computing, R Foundation
for Statistical Computing, Vienna, Austria, 2022. URL: https://www.R-project.org/.
[51] I. Rudnytskyi, openai: R Wrapper for OpenAI API, 2023. URL: https://CRAN.R-project.org/
package=openai, r package version 0.4.0.
[52] M. Popov, WikidataQueryServiceR: API Client Library for ’Wikidata Query Service’, 2020.</p>
      <p>URL: https://CRAN.R-project.org/package=WikidataQueryServiceR, r package version
1.0.0.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Orogat</surname>
          </string-name>
          , I. Liu,
          <string-name>
            <given-names>A.</given-names>
            <surname>El-Roby</surname>
          </string-name>
          ,
          <article-title>Cbench: Towards better evaluation of question answering over knowledge graphs, 2021</article-title>
          . URL: https://arxiv.org/abs/2105.00811. doi:
          <volume>10</volume>
          .48550/ARXIV. 2105.00811.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Diefenbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Both</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Maret</surname>
          </string-name>
          ,
          <article-title>Towards a question answering system over the semantic web</article-title>
          ,
          <year>2018</year>
          . URL: https://arxiv.org/abs/
          <year>1803</year>
          .00832. doi:
          <volume>10</volume>
          .48550/ARXIV.
          <year>1803</year>
          .
          <volume>00832</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dhandapani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vadivel</surname>
          </string-name>
          ,
          <article-title>Question answering system over semantic web</article-title>
          ,
          <source>IEEE Access 9</source>
          (
          <year>2021</year>
          )
          <fpage>46900</fpage>
          -
          <lpage>46910</lpage>
          . URL: https://doi.org/10.1109/access.
          <year>2021</year>
          .
          <volume>3067942</volume>
          . doi:
          <volume>10</volume>
          .1109/ access.
          <year>2021</year>
          .
          <volume>3067942</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Vakulenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D. F.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          , M. de Rijke, M. Cochez,
          <article-title>Message passing for complex question answering over knowledge graphs</article-title>
          ,
          <source>in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2019</year>
          . URL: https://doi.org/10.1145/3357384.3358026. doi:
          <volume>10</volume>
          .1145/3357384.3358026.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stockinger</surname>
          </string-name>
          ,
          <string-name>
            <surname>T. M. de Farias</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Anisimova</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Gil</surname>
          </string-name>
          ,
          <article-title>Querying knowledge graphs in natural language</article-title>
          ,
          <source>Journal of Big Data</source>
          <volume>8</volume>
          (
          <year>2021</year>
          ). URL: https://doi.org/10.1186/ s40537-020-00383-w. doi:
          <volume>10</volume>
          .1186/s40537-020-00383-w.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bao</surname>
          </string-name>
          , L. Liu,
          <article-title>Simple question answering over knowledge graph enhanced by question pattern classification</article-title>
          ,
          <source>Knowledge and Information Systems</source>
          <volume>63</volume>
          (
          <year>2021</year>
          )
          <fpage>2741</fpage>
          -
          <lpage>2761</lpage>
          . URL: https://doi.org/10.1007/s10115-021-01609-w. doi:
          <volume>10</volume>
          .1007/ s10115-021-01609-w.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Yani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Krisnadhi</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Budi</surname>
          </string-name>
          ,
          <article-title>A better entity detection of question for knowledge graph question answering through extracting position-based patterns</article-title>
          ,
          <source>Journal of Big Data</source>
          <volume>9</volume>
          (
          <year>2022</year>
          ). URL: https://doi.org/10.1186/s40537-022-00631-1. doi:
          <volume>10</volume>
          .1186/s40537-022-00631-1.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaib</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. E.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. Z.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mahmood</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Conversational question answering: a survey</article-title>
          ,
          <source>Knowledge and Information Systems</source>
          <volume>64</volume>
          (
          <year>2022</year>
          )
          <fpage>3151</fpage>
          -
          <lpage>3195</lpage>
          . URL: https://doi.org/10.1007/s10115-022-01744-y. doi:
          <volume>10</volume>
          .1007/s10115-022-01744-y.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Mercer</surname>
          </string-name>
          ,
          <article-title>The rise of chat gpt: The future of conversational ai</article-title>
          , https://medium.com/@conan.
          <article-title>mercer/the-rise-of-chat-gpt-the-future-of-conversationalai-</article-title>
          <string-name>
            <surname>91622b9db303</surname>
          </string-name>
          ,
          <year>2023</year>
          .
          <source>Accessed on March 06</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Garg</surname>
          </string-name>
          ,
          <article-title>ChatGPT alternatives that will blow your mind in 2023 - writesonic</article-title>
          .com, https://writesonic.com/blog/chatgpt-alternatives/,
          <source>2023. Accessed on March 06</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Usbeck</surname>
          </string-name>
          ,
          <article-title>Knowledge graph question answering datasets and their generalizability</article-title>
          ,
          <source>in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , ACM,
          <year>2022</year>
          . URL: https://doi.org/10.1145/3477495.3531751. doi:
          <volume>10</volume>
          .1145/3477495.3531751.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Acheampong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Nunoo-Mensah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Transformer models for text-based emotion detection: a review of BERT-based approaches</article-title>
          ,
          <source>Artificial Intelligence Review</source>
          <volume>54</volume>
          (
          <year>2021</year>
          )
          <fpage>5789</fpage>
          -
          <lpage>5829</lpage>
          . URL: https://doi.org/10.1007/s10462-021-09958-2. doi:
          <volume>10</volume>
          .1007/ s10462-021-09958-2.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Abbasiantaeb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Momtazi</surname>
          </string-name>
          ,
          <article-title>Entity-aware answer sentence selection for question answering with transformer-based language models</article-title>
          ,
          <source>Journal of Intelligent Information Systems</source>
          <volume>59</volume>
          (
          <year>2022</year>
          )
          <fpage>755</fpage>
          -
          <lpage>777</lpage>
          . URL: https://doi.org/10.1007/s10844-022-00724-6. doi:
          <volume>10</volume>
          .1007/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>