<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>A. Gashkov); maria.eltsova@gmail.com (M. Eltsova);
aleksandr.perevalov@htwk-leipzig.de (A. Perevalov); andreas.both@htwk-leipzig.de (A. Both)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Instruction-Tuned Language Models as Judges for SPARQL Query Correctness in Knowledge Graph Question Answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aleksandr Gashkov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Eltsova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aleksandr Perevalov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas Both</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CBZ München GmbH</institution>
          ,
          <addr-line>Heilbronn</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Web &amp; Software Engineering (WSE) Research Group, Leipzig University of Applied Sciences (HTWK Leipzig)</institution>
          ,
          <addr-line>Karl-Liebknecht-Straße 132, 04277 Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Nowadays, the research community pays increasing attention to the challenge of trustworthy Knowledge Graph Question Answering (KGQA) systems due to the expectation of returning a high-quality and correct answer to the given natural-language question from continuously growing Knowledge Graphs (KGs). However, modern KGQA systems still generate a lot of incorrect SPARQL queries, leading to many incorrect answers presented to users. In this paper, we follow our long-term research agenda of providing an approach that advances the trustworthiness of KGQA systems while filtering out the incorrect query candidates (following the principle: no answer is better than a wrong answer). The approach presented in this paper is based on the use of LLMs that help to distinguish between correct and incorrect query candidates. Here, we aim to create a general approach that is, firstly, independent of the used (a) language(s), (b) KGs, (c) LLMs, and, secondly, can improve the answer quality of any KGQA system. For our experiments, we used LLMs from the following families: DeepSeek, Llama, Mistral, OpenAI, and Qwen. The LLMs were applied to the two state-of-the-art multilingual KGQA systems QAnswer and MST5 - as post-processing SPARQL query filters. The approach was evaluated using the multilingual Wikidata-based dataset QALD-9-plus. The experimental results indicate reasonable quality improvement for all languages when using the approach presented in this paper.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Question Answering over Knowledge Graphs</kwd>
        <kwd>SPARQL Validation</kwd>
        <kwd>Trustworthiness</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Multilingual Approach</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The use of large language models (LLMs) in many areas of NLP, including question answering
(QA), is the recent trend in the research community (e.g., [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5">1, 2, 3, 4, 5</xref>
        ]). KGQA systems are
dedicated to bridging the gap between Linked Data and end-users by converting
naturallanguage (NL) questions into structured queries (e.g., SPARQL queries). Multilingual KGQA aims
to retrieve answers from a KG for questions in multiple languages. During answer generation
for a question, a (monolingual or multilingual) KGQA system usually creates a ranked list of
Knowledge Graph Question Answering (KGQA) System
(blackbox)
computes a SPARQL query intended
      </p>
      <p>to reflect the user’s demands
traditional process
natural-language question q
(possible)
answer for
question q
query candidate:
Fetching data from
knowledge graph and</p>
      <p>generate
natural-language
answer</p>
      <p>Result for q:
Query candidate as a
SPARQL Query</p>
      <p>Query Filtering
no, does not
reflect the
intention of q
remove query
candidate</p>
      <p>Result for q:
Query candidate as a
SPARQL Query</p>
      <p>LLM-driven
query validator
predicted?</p>
      <p>In this
example, the
computed
SPARQL
query was
incorrect w.r.t.</p>
      <p>the given q.
yes, does
reflect the
intention of q
keep query
candidate
Implication in the specific exemplary scenario: For the current question q, the incorrect
answer is shown to the user (i.e., QA system will return “answer could not be computed”).</p>
      <p>
        SPARQL queries that are (hopefully) suitable for retrieving the correct answers to a particular
question from a knowledge graph. Thereafter, a ranked Top- of the retrieved answers becomes
visible to the end-users (usually  is 1). The KGQA is supposed to always generate a correct
answer based on the information from the deployed KG. However, even the best real-world
KGQA systems often provide erroneous answers: according to Zimina et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Precision of the
analyzed systems varies from 0.22 to 0.66. Hence, there is still the actual challenge that some of
the incorrect queries still could be prioritized over the correct ones. This leads to a decrease in the
quality and, therefore, the trustworthiness of a KGQA system. However, research often forgets
the fact that trustworthiness is important, especially if the KGQA quality is not high enough,
which is typically the case for non-English KGQA systems. Therefore, our approach aims at
removing incorrect SPARQL queries, such that, in the worst case, the user is presented with no
answer rather than the wrong one.
      </p>
      <p>This paper tackles the mentioned challenge by introducing a SPARQL query filtering approach
that uses LLMs to diferentiate between correct and incorrect SPARQL queries. The approach
is language- and KG-agnostic and can significantly improve the results of any KGQA system.
If the system generates a list of at least two query candidates, the improvement by using our
approach can be more considerable due to re-ranking the candidates in the list. The general
idea of the approach introduced by this paper is presented in Figure 1.</p>
      <p>Therefore, our research is aimed at answering the following research questions:
RQ1: To what extent is it possible to provide a generalized validation process for SPARQL
queries that increases the quality and trustworthiness of the answers of a KGQA system?
RQ2: How can we create a language- and KGQA-agnostic validation process?
RQ3: What is the possible best result using state-of-art LLMs?</p>
      <p>These research questions are intended to highlight the possibility of using LLMs for the task
of SPARQL query validation based on a NL question.</p>
      <p>
        Our approach was evaluated on the well-known multilingual QALD-9-plus [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ] dataset
(English, German, and Spanish languages) and two real KGQA systems (QAnswer [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and MST5
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]). The experiments were conducted by utilizing various LLMs of diferent sizes ranging
from 7B to 123B parameters. Obtained results show a strong impact on the quality regarding
both scores and all languages.
      </p>
      <p>This paper has the following structure. First, we describe the related work (see Section 2)
followed by the presentation of our approach in Section 3. Section 4 highlights the used QA
system, dataset, and LLMs and describes the setup and execution of our experiments, whose data
are evaluated and analyzed in Section 5. Section 6 briefly describes limitations and discusses
the results. Section 7 concludes the paper and outlines future work. The data is available in the
online appendix at: https://github.com/WSE-research/Validation-2025-Data.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        QA is one of the most important research fields in natural language processing (NLP). With
the advent of LLMs, the research interest in the QA problems has been increasing. Currently,
there are many papers covering the benefits of exploiting the LLMs for many KGQA tasks
(e.g., [
        <xref ref-type="bibr" rid="ref11 ref12 ref2 ref4 ref5">2, 4, 5, 11, 12</xref>
        ], etc.). However, the problem of the multilingualism in the field of
KGQA remains still underinvestigated but very important for both researchers and users. Most
research in the area of KGQA is still focused on monolingual (i.e., English) settings since both
building a large-scale KG and annotating QA data is expensive for each new language. Hence,
our search indicates that multilingualism in KGQA is still a major challenge due to, on the one
hand, the saturation of the KGQA field with work on English data (the inherent challenges of
translating datasets and the reliance on English-only knowledge bases) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and, on the other
hand, the scarcity of both the multilingual KGQA systems [
        <xref ref-type="bibr" rid="ref14 ref15 ref16">14, 15, 16</xref>
        ] and multilingual datasets
[
        <xref ref-type="bibr" rid="ref13 ref14 ref15 ref17 ref18 ref19 ref20 ref21">13, 14, 15, 17, 18, 19, 20, 21</xref>
        ]. However, recently, there has been a rising demand for multilingual
QA systems, which motivates researchers to focus on the problem of multilingual QA. To bridge
this gap, many multilingual solutions for QA use machine translation (MT) for translating input
questions (e.g., [
        <xref ref-type="bibr" rid="ref22 ref23">22, 23</xref>
        ]), which can be easily integrated into a monolingual system. However,
this way, it highly depends on the quality of the used MT methods and is not able to provide
users with a good quality [
        <xref ref-type="bibr" rid="ref21 ref22">21, 22</xref>
        ] due to the limitation of a small set of languages covered by
existing KGs [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        Other solutions utilize cross-lingual knowledge transfer or implement multilingual LMs (e.g.,
[
        <xref ref-type="bibr" rid="ref24 ref25 ref26 ref5">5, 24, 25, 26</xref>
        ]). Despite the promising way to cover a lack of multilingual data, these approaches
do not always produce acceptable results (e.g., increase of F1 score by 0-7%), as this can incur
the risk of negative transfer when there exists a large language shift [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. Another problem we
faced when analyzing the contribution of other researchers is that the data provided by them
could not be compared properly due to diferent focuses, languages, metrics used, etc.
      </p>
      <p>
        The problem of query validation is also a novel one in the field of KGQA. Query validation
is understood as the process of checking the validity of the provided query with respect to the
asked question, which can improve both the quality and performance of a QA system, being
beneficial for knowledge-intensive and expert-reliant tasks that require evidence to validate
generated text outputs. However, there is just a very limited number of studies on answer
and/or query validation in the context of KGQA systems (e.g., [
        <xref ref-type="bibr" rid="ref15 ref28">15, 28</xref>
        ]). On the other hand,
there appeared recently some approaches to semantic parsing by treating it as a problem of
semantic graph generation and re-ranking [
        <xref ref-type="bibr" rid="ref29 ref30 ref31">29, 30, 31</xref>
        ]. The recent implementation of LLMs in
similar tasks [
        <xref ref-type="bibr" rid="ref32 ref33 ref34 ref35">32, 33, 34, 35</xref>
        ] could be a promising direction for enhancing the answer (or query)
validation systems.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Approach</title>
      <p>Our approach deals with filtering incorrect SPARQL query candidates generated by a KGQA
system in response to a natural-language question. Questions are considered in multiple
languages, which generalizes our approach more. Our approach’s core is to employ
instructiontuned LLMs for binary classification tasks as filters eliminating incorrect SPARQL queries. In
this work, we tackle the problem of query validation considering a KGQA system as a
blackbox where the input is a question and the output is an answer in a natural-language form (cf.
Figure 1). It is providing a “user” with one answer (top-ranked candidate) and cannot afect a
KGQA system in any way. Hence, we do not evaluate the quality of an entire QA system, but only
the quality of a query validation module.</p>
      <p>Let  represent a KGQA system, s.t.,  :   → , where:
• Input:   denotes a natural-language question written in a specific language (e.g.,</p>
      <p>German), where  represents an identifier of the question in a dataset.
• Output:  = {SPARQL1, SPARQL2, . . . , SPARQL} represents the output of the KGQA
system for the question  .  is an ordered collection (i.e., list) of SPARQL query
candidates, which may be (1) empty, (2) contain one or multiple correct queries, or (3)
consist entirely of incorrect queries.</p>
      <p>Each question  has a list of ground truth answers  defined by a dataset (can be empty).
Afterward, a SPARQL query produced by a  returns another list of answers ′ as predicted.
Therefore, we evaluate correctness of a query with a function  that (1) takes answers
generated by a SPARQL query ′  and the ground truth answers  as input, (2) calculates the
F1 score over the provided answer sets, and (3) assigns a  = {, } that
indicates the correctness of the answer of this query as follows:
(, ′ ) =
{︃,</p>
      <p>if F1 score(, ′ ) = 1.0
, ℎ
(1)
Therefore, to enhance the QA quality by filtering SPARQL query candidates, we need to create
a function  that represents a binary classifier, s.t.,  : ( ,  ) → . Since
the filtering function  does not reorder the list but eliminates list items marked as incorrect, the
correct query can only be placed at the top of the list if all incorrect ones before it are removed.</p>
      <p>Verbalization and Binary Classification of SPARQL Queries . To create the filtering
function  , we exploit LLMs (cf. Section 4.1). Many KGs do not provide human-readable URIs of
their entities (e.g., Abraham Lincoln is denoted as Q911 in Wikidata), therefore, we suppose that
SPARQL queries for such KGs should be verbalized, i.e., transformed to a NL-like representations
while using labels of the corresponding entities from a given KG (e.g., Wikidata).
Review the provided SPARQL query and the question.</p>
      <p>The query:
SELECT DISTINCT ?s1
WHERE {
?s1 ?p1 wd:Q571 .</p>
      <p>?s1 wdt:P50 wd:Q2331679 .
}
The question: Wer ist der Autor des Buches Traumdeutung?
The labels in the query are:
wd:Q571 - Buch,
wdt:P50 - Autor,
wd:Q2331679 - Stanley Deser.</p>
      <sec id="sec-3-1">
        <title>Are the query and the question identical? Answer Yes or No.</title>
      </sec>
      <sec id="sec-3-2">
        <title>Review the provided SPARQL query and the question.</title>
        <p>The query:
SELECT DISTINCT ?uri WHERE { wd:Q319308 wdt:P166 ?uri . }
Which awards did Douglas Hofstadter win?
The labels in the query are:
wd:Q319308 - Douglas Hofstadtesr,
wdt:P166 - award received</p>
      </sec>
      <sec id="sec-3-3">
        <title>Are the query and the question identical? Answer Yes or No.</title>
        <p>For example, a NL question What country is Mount Everest in? has a following SPARQL
representation SELECT DISTINCT ?o1 WHERE { wd:Q513 wdt:P17 ?o1 . } and a
following low-level verbalization The query is: SELECT DISTINCT ?o1 WHERE { wd:Q513
wdt:P17 ?o1 . }. The labels in the query are: wd:Q513 - Mount Everest, wdt:P17 - country.</p>
        <p>Knowledge injection provided with a prompt grants information to LLMs about the textual
representation of the URI.</p>
        <p>Evaluation of QA Quality. For measuring the efect of our approach (e.g., the SPARQL query
ifltering) on QA quality, we use the relative metrics of answers quality which are calculated
based on the   and  before and after applying the approach (in this paper, we are
taking into consideration only top-1 query). It is worth mentioning that the QALD-9-plus
benchmark is supposed to have at least one correct answer for each question, and each top-1
SPARQL query generated by KGQA is treated as predicted correctly. In this particular case,
 @1 and @1 are always the same and equal to the  1@1 score.</p>
        <p>
          We use the Answer Trustworthiness Score   to estimate trustworthiness of QA system
(following the definition in [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]), where for all questions  in a dataset , a score per question
is computed, summed up, and normalized in the range of − 1 to +1. Following the statement
“no answer is better than wrong answer”, there is no penalty if a KGQA system returns no result
(i.e., systems showing fewer incorrect answers to users achieve a higher score).
        </p>
        <p>The QA system can certainly achieve the average ATS of 0 just by responding with no answer
to all questions in . To achieve the positive ATS, a QA system must provide more correct than
incorrect answers (cf. Figure 5). Thus, the ATS is more strict than other common metrics and
an ideal metric for measuring the quality of KGQA systems. In this paper, we use  , the
relative score, to measure the impact of the validation process.</p>
        <p>The second metric, relative recall (), shows how many correct answers were removed
from the answers pool. It counts from 0.0 to 1.0; the higher the metric is, the better quality the
validator has. The value of 0.0 means that all the right answers were removed from the answers
pool (cf. Figure 4).</p>
        <p>As mentioned above, in this particular case, all the metrics – Precision, Recall, and F1 – are
equal, therefore, we do not need to calculate Precision and F1.</p>
        <p>Quality and Validation process. It is obvious that the answers’ quality after validation is
strictly dependent on the quality of KGQA results.</p>
        <p>The baseline for the   is its value calculated for the QA before validation, treated also as
lower bound  . The highest bound for the   (or maximal achievable value) is calculated
for a perfect process outcome: all incorrect answers are removed, all correct are preserved (cf.
Figure 4, Figure 5). If QA system produces  correct answers and  incorrect, then bounds
can be calculated with the following formulas.</p>
        <p>=
 ℎ =
 − 
 +</p>
        <p>+</p>
        <p>After defining the bounds, we can set a new metric: relative   change. Let  ′ be the
  of  after validation. Then, the relative ATS change can be easily found as:
  =
 ′ −  
 ℎ −  
  is less dependent on the quality of KGQA and, thus, more robust and informative.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <sec id="sec-4-1">
        <title>4.1. Material</title>
        <p>In this part, we briefly describe the components used to manifest the experimental environment:
QA systems, Dataset, and LLMs.</p>
        <p>
          The KGQA systems QAnswer and MST5. Out of many existing QA systems, we have
chosen a state-of-the-art QAnswer because of its multilingualism (the system allows the use of 8
languages), support for multiple KGs (including Wikidata), robustness (cf. [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ]), portability, and
accessibility (cf. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]). Additionally, it demonstrates high precision and recall, e.g., a high answer
quality [
          <xref ref-type="bibr" rid="ref20 ref36 ref37 ref38 ref39">20, 37, 36, 38, 39</xref>
          ], as well as it provides an API for asking a question and receiving the
corresponding ranked query candidate list (up to 60 candidates) [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ].
        </p>
        <p>
          MST5 presents a new strategy for multilingual KGQA. It emphasizes incorporating and
utilizing additional knowledge, such as entity link tags and linguistic context, via a
transformerbased model [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The MST5 approach proposes that linguistic context and entity information is
extracted from the input NL question. Then, the extracted information is concatenated with the
input before being passed on to the language model. The language model generates the resulting
SPARQL query. MST5 significantly outperforms the competing systems (DeepPavlov-2023 [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ],
QAnswer [
          <xref ref-type="bibr" rid="ref36 ref41">36, 41</xref>
          ], etc.) on all supported languages but also achieves comparable results on
most supported languages except the low-resource languages [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>
          QALD-9-plus Dataset. The scarcity of datasets for KGQA, especially multilingual
benchmarks, is a crucial problem in the field, indicated in recent research (e.g., [
          <xref ref-type="bibr" rid="ref10 ref13 ref14 ref17 ref18 ref19 ref20 ref21">10, 13, 14, 17, 18,
19, 20, 21</xref>
          ], etc.). The QALD datasets represent a series of well-established benchmarks for
multilingual KGQA. QALD-9 [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] consists of 558 questions accompanied by a textual
representation in multiple languages, the corresponding SPARQL query (over DBpedia), the answer
entity URI, and the answer type. QALD-9-plus2 [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] is an extension of the QALD-9 dataset where
Spanish was added via [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], and the translation quality for existing languages was significantly
optimized by validations of native speakers. Therefore, the dataset supports English, German,
Russian, French, Spanish, Armenian, Belarusian, Lithuanian, Bashkir, and Ukrainian. Moreover,
QALD-9-plus also supports the Wikidata knowledge graph.
        </p>
        <p>
          Other multilingual datasets – RuBQ 2.0 [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ], MCWQ [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], and Mintaka [
          <xref ref-type="bibr" rid="ref42">42</xref>
          ] – have some flaws
and restrictions, e.g., (1) missing ground truth (Mintaka), (2) another set of languages or few
languages (RuBQ 2.0 and MCWQ), (3) machine-translated questions without any post-editing
(RuBQ 2.0), etc. These drawbacks do not allow us to use them in our experiments because of
the non-comparability of the data.
        </p>
        <p>LLMs. We used LLMs of five diferent publishers: Open AI, DeepSeek, Qwen, Mistral, and
Llama.</p>
        <p>
          OpenAI’s LLMs (e.g., GPT-4) represents a significant advancement in the field of AI, ofering
substantial improvements over its predecessors in terms of multimodal capabilities of processing
image and text inputs and producing text outputs, context window size, tokenization eficiency,
and processing speed [
          <xref ref-type="bibr" rid="ref43">43</xref>
          ]. This is a transformer-based model pre-trained to predict the next
token in a document. The model o1-mini is, according to its developers (Open AI)3, a
costeficient reasoning model that excels at STEM, especially math and coding. Both OpenAI models
are multilingual.
        </p>
        <p>
          Qwen 2 series released in 2024 is a versatile suite of foundational and instruction-tuned
language models, ranging from 0.5 to 72 billion parameters [
          <xref ref-type="bibr" rid="ref44">44</xref>
          ]. Qwen 2.5 is grounded in the
transformer architecture and trained using next-token prediction. During the experiments and
evaluation, the LM showed robust multilingual capabilities and was proficient in approximately
30 languages, including English, Spanish, and German.
2https://github.com/KGQA/QALD_9_plus
3https://openai.com/index/openai-o1-mini-advancing-cost-eficient-reasoning/
        </p>
        <p>Mistral Small 34 is a pre-trained and instructed model catered to those of generative AI tasks
that require robust language and instruction following performance. According to the developers,
this new model was designed to saturate performance at a size suitable for local deployment.
Particularly, Mistral Small 3 has far fewer layers than competing models, substantially reducing
the time per forward pass.</p>
        <p>Mistral Large 5 is a new cutting-edge text generation model that can be used for complex
multilingual reasoning tasks including “text understanding”. The developers claim that it is
natively fluent in English, French, Spanish, German, and Italian, with a nuanced understanding
of grammar and cultural context.</p>
        <p>The Meta Llama 3.36 multilingual large language model (LLM) is an instruction-tuned
generative model in 70B (text in/text out). The developers pointed out that the Llama 3.3
instruction-tuned text-only model is optimized for multilingual dialogue use cases. It is an
auto-regressive LM that uses an optimized transformer architecture.</p>
        <p>
          DeepSeek-R1 [
          <xref ref-type="bibr" rid="ref45">45</xref>
          ] incorporates multi-stage training and cold-start data before reinforcement
learning. According to its developers, DeepSeek-R1 is currently optimized for Chinese and
English, which may result in language mixing issues when handling queries in other languages.
The developers also observed that the model is sensitive to prompts and, therefore, advise users
to directly describe the problem and to specify the output format using a zero-shot setting for
optimal results.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experimental Design</title>
        <p>In our experiments, we used two sets of data, each obtained as the first SPARQL query given
respectively by QAnswer and MST5 in answer for each QALD-9-plus question (including both
test and train splits) in three languages: English, German, and Spanish. The first set obtained
by QAnswer consists of 130 correct and 308 incorrect queries for English, 104 correct and 461
incorrect queries for German, and 65 correct and 375 incorrect queries for Spanish. MST5
provided us with a set of 81 correct and 90 incorrect queries for English, 73 correct and 125
incorrect queries for German, and 81 correct and 135 incorrect queries for Spanish. QAnswer
produced more queries, both correct and incorrect, than MST5, moreover, the set of incorrect
queries is nearly three to five times larger than the set of correct ones.</p>
        <p>In step one 1, all queries were evaluated using Equation 1 to find the initial quality metrics
of both KGQA. Then, we formed a prompt (cf. Figure 2) with the knowledge injection, sent it to
all LLMs involved in the experiments, and evaluated the metrics after filtering.</p>
        <p>In the next step, 2, we performed the validation itself and evaluated quality after query
ifltering and validation efectiveness.</p>
        <p>As described in Section 4.1, we use five groups of LLMs, namely the model of OpenAI,
DeepSeek, Qwen, Llama, and Mistral. The detailed experimental setup for 1 and 2 is described
in the following subsections.
4https://mistral.ai/en/news/mistral-small-3
5https://mistral.ai/news/mistral-large
6https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct
4.3. 1– Query Evaluation
At this step, we execute all ground truth SPARQL queries from QALD and queries produced
by both KGQA systems on Wikidata to get the answer sets. SPARQL query returns a number
of rows, one for each match. Match can be an entity, a predicate, a literal, or a set of entities
and/or predicates and/or literals. The sets returned by the gold standard query and the candidate
must exact the same to evaluate the candidate as correct. As all the QALD questions have
a non-empty answer set, all candidates, both correct and incorrect, contribute to the metric
calculation.</p>
        <p>The models from all groups were taken “as-is” and were instructed with zero-shot prompts
that use the knowledge injection technique. The prompts contain a  , a raw  
and a ( , ) tuples, which is a knowledge injection part retrieved from Wikidata (see
Figure 2). Based on the aforementioned information, the models are instructed to generate “yes”
or “no”, corresponding to a correct or incorrect result. The temperature parameter was set to
0, where possible, and the other parameters were kept with default values.</p>
        <p>The GPT models were used via the oficial OpenAI Python library 7. Other models were
executed on a local server powered by two NVIDIA L40S GPUs (48GB VRAM).
4.4. 2– Query Filtering
For the evaluation of the efect of SPARQL query filtering, we calculate such metrics as @1
and relative change of Answer Trustworthiness Score ( @1). First, we do not estimate the
quality of the QA systems while evaluating only the quality of query validation. Therefore, the
traditional metrics, like Precision, Recall, F1 score, etc., are not applicable. Second, the idea of
most QA systems is that they provide the user with only one answer, usually generated from a
top SPARQL query in a ranked list. Hence, the main task of the classifier was to filter out the
incorrect answers. Third, since MST5 provides only one query candidate, we can use metrics
for only the top candidate (@1) to properly compare the applicability of our approach to two
diferent QA systems.</p>
        <p>We used three languages, both presented in the dataset and supported by QAnswer and MST5
(English, German, and Spanish), which have enough data for experiments. Both systems provide
good-quality queries.</p>
        <p>Unfortunately, we cannot automatically evaluate the semantics of a SPARQL query, so we
consider all semantic flaws leading to no response from Wikidata as unrecoverable errors.</p>
        <p>Query filtering was done as follows. If LLM answered “no” to the question “Are the question
and the query identical?”, the query was removed, and the list of queries became empty because
we took into consideration only the top one SPARQL query. If LLM answered “yes”, the query
was kept. For filtering, we used the same procedures as for the evaluation before.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation and Analysis</title>
      <p>7https://github.com/openai/openai-python
-200
-400
100
50
0
-50
(a) Results for QAnswer: Number of preserved correct and removed incorrect query candidates.
de de de de de ed de de de de de ed de en en en en en ne en en en en en ne en se se se se se se se se se se se se se
(b) Results for MST5: Number of preserved correct and removed incorrect query candidates.
the number of questions answered by the corresponding QA system (the negative values count
incorrect query candidates, the positive values – correct query candidates in the original answer of the
systems). The colors mean: the number of correct queries preserved during the filtering (as intended),
correct queries filtered out (not intended); number of incorrect queries preserved after filtering (not
intended), incorrect candidates filtered out (as intended). Therefore, the more correct query
candidates are preserved, the better is the result (in the ideal case, there is a complete solid green bar); the
more incorrect query candidates are removed (a perfect result: all incorrect queries would be removed,
indicated by a full light green bar below the zero line), the better is the performance of the query
validator.
(legend): ,  – number of correct and incorrect answers before filtering, respectively; ′,
′ – number of correct and incorrect answers after filtering, respectively;  @1, @1
– relative change of @1 and  @1 in the validation process. The statistics of the
validation process are demonstrated in Figure 4. We determine the best-performing model while
aiming at  @1 and @1.</p>
      <p>As the   reflects the idea of “no answer is better than a wrong answer”, the results after
ifltering demonstrate huge improvements, showing that our approach has a very strong impact
on the QA trustworthiness given the reference QAnswer and MST5 systems. Both Table 1 and
Figure 4 demonstrate that the LLMs of the smallest size (all models of 7B and 14B parameters)
tend to estimate all candidates as incorrect and, therefore, eliminate them. In this case, the
 =0 after the filtering, so the  @1 is not very high, moreover, the @1 usually equals
0 or is slightly above 0, i.e., users always get from a QA system an answer like “Sorry, the correct
answer could not be computed”, which could not suit them.</p>
      <p>The larger LLMs of all groups demonstrate much better improvements regarding both metrics,
however, they tend to preserve nearly all correct queries while also keeping many incorrect ones.
Therefore, demonstrating a pretty high value of  @1, they are losing in @1. Another
obvious thing here is that there were preserved less correct answers for English in contrast to
German and Spanish. Moreover, the  @1 for English is also lower than for German and
Spanish. The reason for this phenomenon could be the fact that there were fewer incorrect
and more correct queries before filtering for English, i.e., the initial quality of QA systems
(without validation) is higher for English than for other languages. Therefore, implementing
our validation approach into a QA system can also significantly improve its non-English output.</p>
      <p>Regarding the  @1 metrics, the best improvement demonstrates GPT-o1-mini,
however, the model preserves less than 50% of the correct answers (46.2% for English, 41.3% for
German, and 36.9% for Spanish on QAnswer, the values on MST5 are lower; the value of @1
demonstrates these facts).</p>
      <p>Regarding both metrics, Llama 3.3, Mistral Large 3, and Qwen 2.5 (72B) demonstrate the best
improvement for all languages. However, there is no LLM at the moment which is close to an
ideal result: all correct queries are preserved, and all incorrect ones are filtered out. We should
also point out that, according to our results, the implementation of our approach currently
grants more benefits to QAnswer.</p>
      <p>Figure 5 illustrates the nature of  @1. On the results obtained with all models of Qwen 2.5
8 for both QA systems and all three languages, this figure presents the lowest value (  @1),
the achieved value ( @1), and maximal value ( @1ℎ). We chose these models for
illustration because of their four diferent sizes and their demonstration of the main trends
described above. In other words, this graphic represents the relative improvement of a QA
system exploiting our approach. According to results presented in Table 1 and by Figure 5, our
approach could improve a QA system when applying a LLM of any publisher and size: even the
smaller models (7B and 14B) grants an increase of   for all languages on the both exploited
state-of-the-art QA systems while the larger models provide much higher quality in terms of
both metrics. Our approach is also able to provide the  @1 close to the maximal value of it
8The chart with all data for all models could be found in our online appendix https://github.com/WSE-research/
Validation-2025-Data.


 ′  ′ 
  
  ′  ′ 
   
(e.g., with Qwen 2.5 with 32B parameters for English on QAnswer and German on MST5, cf.
Figures 5a and 5d).</p>
    </sec>
    <sec id="sec-6">
      <title>6. Limitations and Discussion</title>
      <p>In this paper, our attention was concentrated on the ability of LLMs to play the role of “query
validator”, i.e., to distinguish between correct and incorrect query candidates generated by a
KGQA system from a natural-language question. Before discussing the results, we intend to
point out some limitations of this research. First of all, we utilized only one dataset,
QALD-9plus, because of the scarcity of good-quality multilingual benchmarks (Sections 2 and 4.1). The
second limitation was the use of only two modern QA systems, realizing diferent strategies
of answering questions over KG. Another limitation may concern the choice of only three
frequently used institutional languages from the Indo-European language family. Finally, we
used for this research only one variant of prompt and leave the exploration of this direction
for future work. However, being language-agnostic, portable, robust, and easy reproducible,
our approach provides a wide field to investigate all the limitations in future research, e.g.,
the experiments could be reproduced on other benchmarks (datasets, QA systems, LLMs)
and, therefore, other languages; diferent type of prompts (language-specific, LLM-specific,
multi-shot, etc.) may be used to identify the best solution for each case.</p>
      <p>In this study, we relayed on gold standard queries (ground-truth) only to evaluate the quality
of the validation. The modals used as validators do not depend on ground-truth.</p>
      <p>Our results prove that our approach has a strong impact on the validation quality in all three
languages. All LLMs demonstrate significant improvement w.r.t. both metrics used. While the
smallest models (7B parameters) show the trend to filter out all candidates, i.e., the   was
under 0 before the filtering and equals 0 after it, most of the larger models are able to filter out
more incorrect candidates by preserving the correct ones. Post-experiment analysis has shown
that the integration of larger LLMs into our approach further improves the overall quality of the
QA systems. These observations highlight a crucial problem concerning the size of LLMs: the
larger LLMs provide better output, however, they require more and more computing resources.
But are these costs correlated with the obtained results, and do they really provide the expected
quality improvement? The answers to these questions might be a problem of special research.
However, every researcher can decide which LLM to use for the specific research aims.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions and Future Work</title>
      <p>In this paper, we presented an easy-to-realize but efective approach for improving the quality
of Question Answering over Knowledge Graphs. In particular, our approach is able to remove
incorrect query candidates, s.t., the number of incorrect results shown to the users is significantly
reduced – an argument that strongly enhances the trustworthiness of QA systems. Moreover,
we concentrate our work on developing an approach that also applies to non-English questions
without using machine translation. Summing up, the unique features of our approach are as
follows:</p>
      <p>(1) The system-agnostic decision, which is built on top of the query candidates represented,
utilizes the SPARQL format as it is typical in the field of KGQA (answer to the RQ1). Hence,
our approach can be implemented into any existing QA system to improve its answer quality
(i.e., its trustworthiness).</p>
      <p>(2) We created a language-agnostic approach. Hence, it can be transferred to other languages
without changing the process itself. The only requirement is the representation of
languagespecific labels in the considered Knowledge Graph. Obtained results demonstrate that our
approach is applicable to other languages and will improve the quality of questions represented
in other languages as well, and with a higher increase of trustworthiness as for English (answer
to the RQ2).</p>
      <p>(3) All LLMs, both larger and smaller, can be exploited for our approach, so that users have
the choice of the technology being used. Our experiments show a strong quality improvement
for all the LLMs families we used for our research; besides, the larger models (32B parameters
or more) demonstrated more impressive results. However, their implementation might signify
a much higher computational time investment and/or cost-per-interaction (answer to RQ3).
In this research, we did not observe an advantage of exploiting the commercial LLMs over the
open-source LLMs.</p>
      <p>
        Future work might deal with experiments both with a language-specific prompt and an
LLM-specific prompt. Our approach could also be extended by using additional KG properties.
Moreover, an interesting direction would be a combined usage of LLMs by generating SPARQL
queries from NL questions and validating them (e.g., filtering out the incorrect ones).
Furthermore, a promising direction to improve the question answering results for non-English systems
would be to solve the problem of labels’ non-availability for not frequently used or low-resource
languages. Future studies will additionally include nDCG@k metrics (for a value of , e.g., set
to 5 or 10) to demonstrate more benefits of the proposed approach and its efectiveness in query
validation and filtering strategies (like [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]). Additionally, measuring the impact in comparison
to other quality-improving components (e.g., while integrating our approach as a component in
KGQA frameworks like Qanary [
        <xref ref-type="bibr" rid="ref46 ref47">46, 47, 48</xref>
        ]) is a promising topic while aiming for balancing
metrics like quality, costs, and runtime.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Grammarly to check grammar and
spelling. After using this tool/service, the authors reviewed and edited the content as needed
and take full responsibility for the publication’s content.
light-weight Qanary architecture, in: Web Engineering, Springer International Publishing,
2017, pp. 544–548. doi:10.1007/978-3-319-60131-1_40.
[48] A. Perevalov, A. Both, F. Gudat, P. Bräuning, J. Meesters, L. Gründel, M.-s. Bachmann,
S. Z. I. Naser, Qanary Builder: Addressing the reproducibility crisis in question answering
over knowledge graphs, in: International Semantic Web Conference (ISWC) – Posters and
Demos Track, 2023.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kocoń</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Cichecki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kaszyca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kochanek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Szydło</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Baran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bielaniewicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gruza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Janz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kanclerz</surname>
          </string-name>
          , et al.,
          <article-title>ChatGPT: Jack of all trades, master of none</article-title>
          ,
          <source>Information Fusion</source>
          <volume>99</volume>
          (
          <year>2023</year>
          )
          <fpage>101861</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Omar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Mangukiya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kalnis</surname>
          </string-name>
          , E. Mansour,
          <article-title>ChatGPT versus traditional question answering for knowledge graphs: Current status and future directions towards knowledge graph chatbots</article-title>
          ,
          <source>arXiv preprint arXiv:2302.06466</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Scells</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Koopman</surname>
          </string-name>
          , G. Zuccon,
          <article-title>Can ChatGPT write a good boolean query for systematic review literature search?</article-title>
          ,
          <source>in: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>1426</fpage>
          -
          <lpage>1436</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Qi,
          <article-title>Can ChatGPT replace traditional KBQA models? An in-depth analysis of the question answering performance of the GPT LLM family</article-title>
          , in: International Semantic Web Conference, Springer,
          <year>2023</year>
          , pp.
          <fpage>348</fpage>
          -
          <lpage>367</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          , G. Qi,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Z.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <article-title>Benchmarking large language models in complex question answering attribution using knowledge graphs</article-title>
          ,
          <source>arXiv preprint arXiv:2401.14640</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Zimina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Järvelin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Peltonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ranta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nummenmaa</surname>
          </string-name>
          ,
          <article-title>Traqula: Transparent Question Answering over RDF through linguistic analysis</article-title>
          ,
          <source>in: International Conference on Web Engineering</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Perevalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Diefenbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Usbeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Both</surname>
          </string-name>
          , QALD-9
          <article-title>-plus: A multilingual dataset for question answering over DBpedia and Wikidata translated by native speakers</article-title>
          ,
          <source>in: 2022 IEEE 16th International Conference on Semantic Computing (ICSC)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>229</fpage>
          -
          <lpage>234</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICSC52841.
          <year>2022</year>
          .
          <volume>00045</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Soruco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Collarana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Both</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Usbeck</surname>
          </string-name>
          , QALD-9
          <article-title>-ES: A Spanish Dataset for Question Answering Systems, Studies on the Semantic Web</article-title>
          , IOS Press BV,
          <year>2023</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>52</lpage>
          . doi:
          <volume>10</volume>
          . 3233/SSW230004.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Diefenbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Both</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Maret</surname>
          </string-name>
          ,
          <article-title>Towards a question answering system over the semantic web</article-title>
          ,
          <source>Semantic Web</source>
          <volume>11</volume>
          (
          <year>2020</year>
          )
          <fpage>421</fpage>
          -
          <lpage>439</lpage>
          . doi:
          <volume>10</volume>
          .3233/SW-190343.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vollmers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Zahera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moussallem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          , MST5
          <article-title>- multilingual question answering over knowledge graphs</article-title>
          ,
          <source>CoRR abs/2407</source>
          .06041 (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/ARXIV.2407.06041. arXiv:
          <volume>2407</volume>
          .
          <fpage>06041</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mecharnia</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>d'Aquin, Performance and limitations of fine-tuned LLMs in SPARQL query generation</article-title>
          ,
          <source>in: Proceedings of the Workshop on Generative AI and Knowledge Graphs (GenAIK)</source>
          ,
          <year>2025</year>
          , pp.
          <fpage>69</fpage>
          -
          <lpage>77</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Meloni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Motta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Osborne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Recupero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Salatino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vahdati</surname>
          </string-name>
          ,
          <article-title>Large language models for scientific question answering: An extensive analysis of the SciQA benchmark</article-title>
          , in: European Semantic Web Conference, Springer,
          <year>2024</year>
          , pp.
          <fpage>199</fpage>
          -
          <lpage>217</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aralikatte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hershcovich</surname>
          </string-name>
          ,
          <article-title>Multilingual compositional Wikidata questions</article-title>
          ,
          <source>arXiv preprint arXiv:2108.03509</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Perevalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Both</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <article-title>Multilingual question answering systems for knowledge graphs-a survey</article-title>
          ,
          <source>Semantic Web</source>
          <volume>15</volume>
          (
          <year>2024</year>
          )
          <fpage>2089</fpage>
          -
          <lpage>2124</lpage>
          . URL: https://www. semantic
          <article-title>-web-journal</article-title>
          .net/system/files/swj3530.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Perevalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gashkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eltsova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Both</surname>
          </string-name>
          ,
          <article-title>Language models as SPARQL query filtering for improving the quality of multilingual question answering over knowledge graphs</article-title>
          , in: K. Stefanidis,
          <string-name>
            <given-names>K.</given-names>
            <surname>Systä</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Heil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kondylakis</surname>
          </string-name>
          , E. Quintarelli (Eds.), Web Engineering, Springer Nature Switzerland, Cham,
          <year>2024</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L.-A.</given-names>
            <surname>Kafee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Biswas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Keet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. K.</given-names>
            <surname>Vakaj</surname>
          </string-name>
          , G. de Melo,
          <article-title>Multilingual knowledge graphs and low-resource languages: A review</article-title>
          ,
          <source>Transactions on Graph Data and Knowledge</source>
          <volume>1</volume>
          (
          <year>2023</year>
          )
          <fpage>10</fpage>
          -
          <lpage>1</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>R.</given-names>
            <surname>Usbeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. H.</given-names>
            <surname>Gusmita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Saleem, 9th challenge on question answering over linked data (QALD-9)</article-title>
          , in: Semdeep/NLIWoD@ISWC,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Saxena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chakrabarti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Talukdar</surname>
          </string-name>
          ,
          <article-title>Question answering over temporal knowledge graphs</article-title>
          ,
          <source>arXiv preprint arXiv:2106.01515</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aralikatte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hershcovich</surname>
          </string-name>
          ,
          <article-title>Compositional generalization in multilingual semantic parsing over Wikidata, Transactions of the Association for Computational Linguistics 10 (</article-title>
          <year>2022</year>
          )
          <fpage>937</fpage>
          -
          <lpage>955</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gashkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perevalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eltsova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Both</surname>
          </string-name>
          ,
          <article-title>Improving question answering quality through language feature-based SPARQL query candidate validation</article-title>
          , volume
          <volume>13261</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2022</year>
          , pp.
          <fpage>217</fpage>
          -
          <lpage>235</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>E.</given-names>
            <surname>Loginova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Varanasi</surname>
          </string-name>
          , G. Neumann,
          <article-title>Towards end-to-end multilingual question answering</article-title>
          ,
          <source>Information Systems Frontiers</source>
          <volume>23</volume>
          (
          <year>2021</year>
          )
          <fpage>227</fpage>
          -
          <lpage>241</lpage>
          . doi:
          <volume>10</volume>
          .1007/ s10796-020-09996-1.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Perevalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Both</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Diefenbach</surname>
          </string-name>
          , A.-C.
          <article-title>Ngonga Ngomo, Can machine translation be a reasonable alternative for multilingual question answering systems over knowledge graphs?</article-title>
          ,
          <source>in: Proceedings of the ACM Web Conference</source>
          <year>2022</year>
          , WWW '22,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery,
          <year>2022</year>
          , p.
          <fpage>977</fpage>
          -
          <lpage>986</lpage>
          . doi:
          <volume>10</volume>
          .1145/3485447.3511940.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>N.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perevalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kuchelev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moussallem</surname>
          </string-name>
          , A.
          <string-name>
            <surname>-C. Ngonga Ngomo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Both</surname>
          </string-name>
          ,
          <article-title>Lingua franca - entity-aware machine translation approach for question answering over knowledge graphs</article-title>
          , in: Knowledge Capture Conference, K-CAP '
          <fpage>23</fpage>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery,
          <year>2023</year>
          , p.
          <fpage>122</fpage>
          -
          <lpage>130</lpage>
          . doi:
          <volume>10</volume>
          .1145/3587259.3627567.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , D. Jiang,
          <article-title>Improving zero-shot cross-lingual transfer for multilingual question answering over knowledge graph</article-title>
          , in: K.
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rumshisky</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Hakkani-Tur</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Beltagy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Cotterell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
          </string-name>
          , Y. Zhou (Eds.),
          <source>Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>5822</fpage>
          -
          <lpage>5834</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .naacl-main.
          <volume>465</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>V.</given-names>
            <surname>Konovalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gulyaev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sorokin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kuratov</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Burtsev, Exploring the BERT crosslingual transfer for reading comprehension</article-title>
          ,
          <source>in: Computational Linguistics and Intellectual Technologies</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>445</fpage>
          -
          <lpage>453</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>T.</given-names>
            <surname>Pires</surname>
          </string-name>
          , How multilingual is multilingual BERT, arXiv preprint arXiv:
          <year>1906</year>
          .
          <volume>01502</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Headden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Learn to cross-lingual transfer with meta graph learning across heterogeneous languages</article-title>
          ,
          <source>in: Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>2290</fpage>
          -
          <lpage>2301</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>G.</given-names>
            <surname>Maheshwari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Trivedi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lukovnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <article-title>Learning to rank query graphs for complex question answering over knowledge graphs</article-title>
          , in: International semantic web conference, Springer,
          <year>2019</year>
          , pp.
          <fpage>487</fpage>
          -
          <lpage>504</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29] S. W.-t. Yih, M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Semantic parsing via staged query graph generation: Question answering with knowledge base</article-title>
          ,
          <source>in: Proceedings of the Joint Conference of the 53rd Annual Meeting of the ACL and the 7th International Joint Conference on Natural Language Processing of the AFNLP</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>M.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Hasan</surname>
          </string-name>
          , C. d. Santos,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Improved neural relation detection for knowledge base question answering</article-title>
          ,
          <source>in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1321</fpage>
          -
          <lpage>1331</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Täckström</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <article-title>A decomposable attention model for natural language inference</article-title>
          ,
          <source>in: Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>2249</fpage>
          -
          <lpage>2255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>N.</given-names>
            <surname>Wiratunga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Abeyratne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jayawardena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Massie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Nkisi-Orji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weerasinghe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fleisch</surname>
          </string-name>
          ,
          <article-title>CBR-RAG: case-based reasoning for retrieval augmented generation in LLMs for legal question answering</article-title>
          ,
          <source>in: International Conference on Case-Based Reasoning</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>445</fpage>
          -
          <lpage>460</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hakimov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Weiser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schlangen</surname>
          </string-name>
          ,
          <article-title>Evaluating modular dialogue system for form iflling using large language models</article-title>
          ,
          <source>in: Proceedings of the 1st Workshop on Simulating Conversational Intelligence in Chat (SCI-CHAT</source>
          <year>2024</year>
          ),
          <year>2024</year>
          , pp.
          <fpage>36</fpage>
          -
          <lpage>52</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ganesan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ravikumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Piplani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bhaumik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Padmanaban</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narasimhamurthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Adhikary</surname>
          </string-name>
          , S. Deshapogu,
          <source>Automated answer validation using text similarity</source>
          ,
          <source>arXiv preprint arXiv:2401.08688</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>A.</given-names>
            <surname>Perevalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gashkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eltsova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Both</surname>
          </string-name>
          , Understanding SPARQL Queries:
          <article-title>Are we already there? Multilingual Natural Language Generation based on SPARQL Queries and Large Language Models</article-title>
          , in: International Semantic Web Conference, Springer,
          <year>2024</year>
          , pp.
          <fpage>173</fpage>
          -
          <lpage>191</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>D.</given-names>
            <surname>Diefenbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Giménez-García</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Both</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Maret</surname>
          </string-name>
          , QAnswer KG:
          <article-title>designing a portable question answering system over RDF data</article-title>
          ,
          <source>in: European Semantic Web Conference</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>429</fpage>
          -
          <lpage>445</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -49461-2\_
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Bisen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Alemayehu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Maret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Creighton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gorman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kundi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mgwgwi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Muhlenbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dinca-Panaitescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>El Morr</surname>
          </string-name>
          ,
          <source>Evaluation of Search Methods on Community Documents, Metadata and Semantic Research</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>I.</given-names>
            <surname>Rybin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Korablinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Efimov</surname>
          </string-name>
          , P. Braslavski, RuBQ
          <volume>2</volume>
          .
          <article-title>0: An innovated Russian question answering dataset</article-title>
          ,
          <source>in: The Semantic Web: 18th International Conference, ESWC 2021</source>
          , Springer,
          <year>2021</year>
          , pp.
          <fpage>532</fpage>
          -
          <lpage>547</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>L.</given-names>
            <surname>Siciliani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lops</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Semeraro, MQALD: Evaluating the impact of modifiers in question answering over knowledge graphs</article-title>
          ,
          <source>Semantic Web</source>
          <volume>13</volume>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zharikova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kornev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ignatov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Talimanchuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Evseev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Petukhova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Smilga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Karpov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shishkina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kosenko</surname>
          </string-name>
          , et al.,
          <article-title>DeepPavlov dream: platform for building generative AI assistants, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics</article-title>
          (Volume
          <volume>3</volume>
          :
          <string-name>
            <surname>System</surname>
            <given-names>Demonstrations)</given-names>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>599</fpage>
          -
          <lpage>607</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>K.</given-names>
            <surname>Shivashankar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Benmaarouf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Steinmetz</surname>
          </string-name>
          ,
          <article-title>From graph to graph: AMR to SPARQL, in: Proceedings of the 7th Natural Language Interfaces for the Web of Data (NLIWoD) co-located with the 19th European Semantic Web Conference (ESWC</article-title>
          <year>2022</year>
          ),
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Aji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Safari</surname>
          </string-name>
          ,
          <article-title>Mintaka: A complex, natural, and multilingual dataset for end-to-end question answering</article-title>
          ,
          <source>in: Proceedings of the 29th International Conference on Computational Linguistics</source>
          ,
          <source>International Committee on Computational Linguistics</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1604</fpage>
          -
          <lpage>1619</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43] OpenAI, GPT-4
          <source>technical report, arXiv preprint arXiv:2303.08774</source>
          (
          <year>2023</year>
          ). arXiv:
          <volume>2303</volume>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          , et al.,
          <source>Qwen2 technical report, arXiv preprint arXiv:2407.10671</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Song,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , S. Ma,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bi</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>DeepSeek-R1</surname>
          </string-name>
          :
          <article-title>Incentivizing reasoning capability in LLMs via reinforcement learning</article-title>
          ,
          <source>arXiv preprint arXiv:2501.12948</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>A.</given-names>
            <surname>Both</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Diefenbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shekarpour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cherix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lange</surname>
          </string-name>
          ,
          <article-title>Qanary - a methodology for vocabulary-driven open question answering systems, in: The Semantic Web</article-title>
          .
          <source>Latest Advances and New Domains</source>
          , Springer International Publishing,
          <year>2016</year>
          , pp.
          <fpage>625</fpage>
          -
          <lpage>641</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>A.</given-names>
            <surname>Both</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Diefenbach</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Lytra</surname>
          </string-name>
          ,
          <article-title>Rapid engineering of QA systems using the</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>