<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Question Answering over DBpedia with Fine-tuned Autoregressive Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tommaso Soru</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Saurav Joshi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sanju Tiwari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mehrzad Shahinmoghadam</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anand Panchbhai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DBpedia Association;</institution>
          <addr-line>Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Liber AI Ltd;</institution>
          <addr-line>London</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Sharda University; Greater Noida</institution>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Southern California; Information Sciences Institute; Marina del Rey</institution>
          ,
          <addr-line>CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper presents a novel approach to question answering over DBpedia by fine-tuning autoregressive code generation models to translate natural language questions into SPARQL queries. The authors introduce the NSpM Dataset, a large-scale dataset of 7.7 million question-SPARQL pairs covering DBpedia's ontology. We ifne-tuned three open-source code generation LLMs - CodeGen-350M, StarCoder-1B, and CodeLlama-7B - on a curated subset of this dataset. Despite limited training time, StarCoder-1B achieved the best performance across evaluation metrics. Results from the 1st Text2SPARQL Challenge at ESWC 2025 reveal that while models partially learn SPARQL syntax, they struggle with query semantics and ranking quality.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Semantic Web</kwd>
        <kwd>SPARQL</kwd>
        <kwd>semantic parsing</kwd>
        <kwd>question answering</kwd>
        <kwd>neural networks</kwd>
        <kwd>large language models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In recent years, large knowledge graphs (such as DBpedia [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Wikidata [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and Freebase [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) have
grown to contain more than billions of semantic triples forming massive repositories of structured
knowledge. To democratize this knowledge, Question Answering (QA) systems have been developed to
let users retrieve information from these knowledge bases using natural language questions instead of
requiring them to write complex queries in formal languages like SPARQL [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ]. However, translating
natural language questions onto the rich ontology of a large graph and generating an executable
SPARQL statement remains highly challenging, because a system must resolve lexical ambiguity, link
surface terms to the correct entities and relations, preserve graph-structural constraints, and generate
syntactically valid queries that support multi-hop, and compositional reasoning [
        <xref ref-type="bibr" rid="ref7">7, 8, 9, 10, 11</xref>
        ].
      </p>
      <p>Early computational research on Text-to-SPARQL has predominantly relied on rule and template
based systems that align parsed questions with fixed SPARQL skeletons; representative examples are
the Template-Based Question Answering over Linked Data (TBSL) framework [12] and the
keyword-totemplate generator described in [13], which require extensive manual engineering and fail to generalize
to paraphrases, unseen ontologies, and multi-hop queries. Semi-automatic approaches—notably the
skeleton-grammar with neural-ranking model SPARQA [14] and the classifier-guided template-selection
method proposed in [15]—learn and rank patterns from data, reducing authoring efort yet still inherit
rigidity from their finite template data stores and remain brittle on out-of-distribution question forms.</p>
      <p>In this paper, we contribute to the Text-to-SPARQL research through two primary advancements. First,
we introduce the NSpM dataset, a massive, automatically-generated collection comprising 7.7 million
natural language question–SPARQL pairs that span all ontology classes within DBpedia. This dataset
is carefully assembled using an extensive systematic template discovery and generation framework,
augmented by question paraphrasing and enhanced with commonsense knowledge validation to ensure
logical coherence and coverage of diverse and compositional query patterns. Second, we fine-tune
three specialized open-source code large language models—specifically CodeGen-2.5 (350M), StarCoder
(1B), and CodeLlama (7B)—on our dataset, enhancing their ability to generate structurally accurate
and ontology-aware SPARQL queries. By leveraging pretrained models explicitly designed for code
generation, our fine-tuned models efectively internalize query syntax and semantics, significantly
improving practical applicability in querying large-scale knowledge graphs. Both the NSpM dataset
and our fine-tuned models are released as open artifacts to further research and innovation in neural
query generation over structured knowledge bases.</p>
      <p>The paper is structured as follows. In Section 2, we introduce the works relevant to this research
ifeld. Then in Section 3, we outline the approach from idea to execution. Section 4 discusses the
preand post-submission evaluations, which we elaborate on in Section 5. Finally, we conclude.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        At the time of writing, many recent works have applied neural sequence models to the Text-to-SPARQL
task. For example, a pointer-network model with relation-aware attention was introduced in [16],
and large pre-trained language models like T5, BART, or pointer-generator networks were shown to
efectively arrange entities and relations into SPARQL queries [ 17]. In a similar vein, SGPT (stacked
Transformer encoders + GPT-2) was proposed in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], but it often fails to capture complex multi-hop
graph relations, leading to mistakes in the generated triples. Prompt-based methods (e.g., one-shot
SPARQL generation by injecting relevant KG subgraphs into the model’s context window) have been
explored in [18], yet they still exhibit systematic errors such as the “triple-flip” (swapping subject
and object positions in a triple). A Triplet-Order-Sensitive pre-training for T5 was introduced in [11]
to correct subject/object ordering. However, all of these approaches rely on costly fine-tuning and
specialized decoding schemes, making them resource-intensive and potentially brittle across diferent
knowledge graphs [16, 18, 11].
      </p>
      <p>Recent work on neural Text-to-SPARQL, particularly over the DBpedia knowledge graph, has
produced several end-to-end models evaluated extensively on DBpedia-specific benchmarks. For example,
an encoder–decoder model that integrates KG structure directly into decoding architecture achieving
state-of-the-art F1 on LC-QuAD 1.0 [19]. A hybrid multi-head convolutional encoder combining CNN
and self-attention reported strong BLEU and F1 scores on QALD-9 and LC-QuAD 1.0 [20]. A universal
KG-QA system that casts question understanding as sequence generation achieved competitive
performance on LC-QuAD 1.0 and QALD-9 [21]. A two-stage ontology-guided prompting framework, which
ifrst predicts a SPARQL skeleton and then completes it with knowledge graph specific information,
achieved 79.1% F1 on LC-QuAD 1.0 [22]. Earlier work also proposed a silhouette-based two-step
architecture combining coarse query generation and graph search, showing substantial gains on LC-QuAD
1.0 [23].</p>
      <p>Large-scale QA benchmarks have been introduced to support training and evaluation. For example,
the LC-QuAD 2.0 dataset contains on the order of 30,000 complex natural-language questions paired
with SPARQL queries over DBpedia (and Wikidata) [9]. Its size, diversity, and dual-KG support make
it a unique resource for both academic research and industrial applications in KG-based dialogue,
interpretable AI and semantic search. The QALD challenge series provides smaller, multilingual
benchmarks; QALD-9-Plus extends QALD-9 with high-quality DBpedia question translations into
eight languages in KGQA and 5 underrepresented languages, i.e. Ukrainian, Armenian, Lithuanian,
Bashkir, Belarusian. [24]. More recently, the DBpedia Neural QA (DBNQA) corpus was released,
ofering approximately 894,499 English question–SPARQL pairs generated via templating for DBpedia
[25]. These resources cover a wide variety of query patterns and have become standard datasets for
benchmarking DBpedia QA methods.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Approach</title>
      <p>The central objective of our project was to evaluate the capability of pretrained autoregressive code
generation models to perform text-to-SPARQL translation. Our initial plan envisioned leveraging a
general-purpose GPT-style language model; however, upon review, the recent proliferation of
opensource models tailored to code generation prompted a pivot in strategy. These models, often trained on
rich corpora of source code, ofer potentially advantageous inductive biases for structured language
tasks like SPARQL query generation.</p>
      <sec id="sec-3-1">
        <title>3.1. Model Selection</title>
        <p>We performed a comparative review of several publicly available autoregressive models, focusing on:
• Model size and accessibility
• Training data composition and license constraints
• Prior performance on code-related tasks
Ultimately, we selected the following three models for fine-tuning:
• A Salesforce CodeGen model [26] with 350 million parameters
• A BigCode StarCoder model [27] with 1 billion parameters
• A Meta Code Llama model [28] with 7 billion parameters</p>
        <p>These models represent a diverse trade-of between model size, inference cost, and representational
capacity. We expected CodeLlama-7b in particular to provide a valuable upper-bound comparison due
to its substantial scale and recent development.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Dataset creation</title>
        <p>As an initial step, we created SPARQL query skeleton representations for core ontology classes within the
DBpedia ontology, such as Person and Work and grouped them into five linguistic families: subordinate,
conjunctive/disjunctive, comparative, superlative, and numeric. A T5-base encoder–decoder model
[29], fine-tuned for question generation, turned each representation into a fluent natural-language
template while preserving its structure. This process yielded examples such as “Where is &lt;A&gt; born?”
(subordinate), “Who is the parent of &lt;A&gt; and &lt;B&gt;?” (conjunctive), “Has &lt;A&gt; had a child?” (comparative),
“What is the highest elevation of &lt;A&gt;?” (superlative), and “How many children did &lt;A&gt; have?” (numeric).
Collectively, these templates cover single-entity attributes, multi-entity relations, boolean conditions,
minimum-maximum value queries, and explicit counting patterns.</p>
        <p>Subsequent steps focused on quality assurance to ensure we created a high-quality dataset for the
model to learn from and generalize. First, we computed cosine similarity to discard paraphrase pairs
falling below 0.90 threshold, thereby eliminating semantically divergent rewrites. Next, a commonsense
classifier removed logically implausible questions, and a duplicate-detection model suppressed
nearidentical variants. Placeholders were then instantiated with actual DBpedia URIs—sampling additional
entities for multi-entity questions and inserting values that guarantee coherent boolean or aggregate
answers. After validation on Person and Work, the identical pipeline was applied to all the ontologies
present in DBpedia such as Organisation, Place, and Event, yielding the NSpM corpus of 7.7 million
natural-language question and SPARQL query pairs that are used in our downstream Text-to-SPARQL
experiments.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Research Exploration</title>
        <sec id="sec-3-3-1">
          <title>3.3.1. Model architectures</title>
          <p>During the start of our research journey, we focused extensively on understanding the internal
architecture and design goals of the CodeGen family, particularly the 350M and 7B multi-language variants.
These models, built for translating natural language into executable code, ofered a promising avenue
for our SPARQL translation task. We also examined the NSQL models – fine-tuned descendants of
CodeGen trained specifically for SQL query generation – which provided useful architectural and
prompt formatting insights transferable to SPARQL [30].</p>
          <p>A significant insight emerged from analyzing the "the-stack" dataset, which underpins the CodeGen
models. Despite its impressive breadth, only ∼ 24,000 SPARQL-related samples were identified, compared
to millions for general code. This data scarcity emphasized the need for synthetic data augmentation
to supplement fine-tuning. Additionally, we studied NSQL’s technique of schema-augmented prompt
formatting and considered applying ontology-based embeddings from DBpedia to improve prompt
quality. On the engineering side, inference time for the 7B model exceeded 80 seconds on CPU,
necessitating deployment on dual RTX 3090 GPUs, which cut it to under 6 seconds.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Prompt engineering</title>
          <p>As we moved onto the next stage, the focus shifted to enhancing the formatting of prompts. We refined
the placement of task instructions and demarcated clear input-output separators. The inclusion of
schema metadata and prefixes was explored as a means to guide the model toward more accurate
generation of SPARQL queries. These changes were based on the hypothesis that clearer,
contextenriched prompts would improve the model’s understanding of the translation task.</p>
          <p>We also expanded our few-shot experimentation by varying domains and intent types. This allowed
us to test the generalization capabilities of the models when exposed to structurally and semantically
diverse examples. In parallel, we began evaluating model outputs using the QALD dataset to establish a
secondary benchmark beyond the Text2SPARQL dataset. This benchmarking efort is expected to shed
light on model adaptability across multiple query contexts.</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>3.3.3. Issues with instruct models</title>
          <p>We shifted our attention to models with instruction-following capabilities. The instruct-tuned variant of
CodeGen, namely codegen25-7b-instruct, was selected for experimentation under the hypothesis
that such models could better translate detailed user queries into structured outputs like SPARQL.
Preliminary trials revealed promising but mixed results – while the model demonstrated aptitude in
understanding instructions, it struggled with output length control and formatting.</p>
          <p>We addressed these issues through two means: first, by introducing the &lt;eom&gt; (End of Message)
token into the prompt, followed by a post-processing step to truncate outputs accordingly; and second,
by inspecting and adjusting the prompt-to-response formatting pipeline. This highlighted a broader
challenge – the pretraining corpus likely influenced the model to favor line-by-line SPARQL formatting
regardless of prompt style, revealing a deeper inductive bias in structure.</p>
        </sec>
        <sec id="sec-3-3-4">
          <title>3.3.4. Model fine-tuning</title>
          <p>We turned toward constructing a dedicated prompt-tuning dataset to support more targeted fine-tuning.
We chose the StarCoder-1b checkpoint as our initial model due to its manageable size and architecture
compatibility. The first step involved designing and implementing prompt formatting modules that
adhered to both the structural norms of code-based language models and the functional requirements
of DBpedia queries.</p>
          <p>In parallel, we began harvesting and transforming data from the NSpM dataset, which ofers a
substantial repository of over 7 million (NL question, SPARQL query) pairs. We extracted and refined a
subset of 100,000 samples for initial experiments, ensuring compatibility with model expectations. The
refinement process involved standardizing schema references, enforcing a clean prompt-completion
structure, and enriching each prompt with auxiliary context to improve model comprehension.</p>
          <p>We initiated the actual fine-tuning process on StarCoder-1B using our customized dataset and a
modified version of the oficial fine-tuning script. This required adjustments to handle the structure of
our NSpM-derived samples and integrate them with the BigCode and HuggingFace training pipelines.
Despite encountering expected hardware constraints, we successfully launched training and saved
intermediate checkpoints at steps 1,000 and 2,000 to facilitate upcoming evaluation rounds, where a
step is the fine-tuning of the model’s weights using a batch of question-query pairs. Early runtime
profiling also helped us estimate the full duration of training, prompting discussions around possible
trade-ofs in data size and training duration.</p>
          <p>All fine-tuned models are available for download on the HuggingFace platform and via their
transformers library.1</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Training Setup</title>
        <p>Fine-tuning was conducted using the HuggingFace Transformers library. Key parameters included:
• Learning rate: 5e-5
• Optimizer: AdamW
• Batch size: 8
• Max sequence length: 512</p>
        <p>Due to hardware resources and availability limitations, we were only able to complete approximately
20% of the originally scheduled training steps. This afected model convergence, especially for the larger
models, which typically require longer training horizons to fully adapt. CodeLlama-7b, in particular,
could not be fine-tuned in full also due to its high-memory footprint.</p>
        <p>In our setup, fine-tuning followed the standard causal language modeling objective, wherein the model
learns to predict the next token in the SPARQL output given the preceding tokens from the concatenated
prompt–completion pair. All models were initialized from their publicly released checkpoints, tokenized
using their native tokenizers, and trained with gradient accumulation to accommodate larger efective
batch sizes on the available GPUs.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation</title>
      <p>The development of this project started around two years before the submission to the Text2SPARQL
challenge. Due to this fact, we proceeded to our own evaluation ahead of that carried out by the
challenge organizers.</p>
      <sec id="sec-4-1">
        <title>4.1. Preliminary evaluation</title>
        <p>Before the oficial challenge dataset was made available, we conducted internal evaluations using
automatic metrics. Specifically, we employed the BLEU score to measure the lexical overlap between
generated SPARQL queries and their ground truth counterparts. While BLEU is commonly used in
natural language generation tasks, its application here provided a coarse approximation of semantic
similarity. Although it fails to capture query correctness in logical terms, it was useful in identifying
egregious deviations and monitoring general output fidelity during model development.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Challenge evaluation</title>
        <p>Besides the usual information retrieval metrics such as precision, recall, and F1 score, the performance
was computed also for the Normalized Discounted Cumulative Gain at K (NDCG). Precision, recall, and
F1-score are defined as follows:</p>
        <p>Precision =</p>
        <p>+  
Recall =</p>
        <p>+  
(1)
(2)</p>
        <p>2 · Precision · Recall
F1-score =</p>
        <p>Precision + Recall
where:
•   : True Positives.
•   : False Positives.</p>
        <p>•   : False Negatives.</p>
        <p>Discounted Cumulative Gain is defined as follows:</p>
        <p>DCG = ∑︁</p>
        <p>=1 log2( + 1)
NDCG =</p>
        <p>DCG</p>
        <p>IDCG
where:
• : Relevance score of the item at position .
• DCG: Discounted Cumulative Gain at rank . It measures ranking quality by assigning higher
importance to relevant items appearing earlier in the list.
• log2( + 1): Logarithmic discount factor, which reduces the contribution of lower-ranked items.
• IDCG: Ideal DCG at rank , i.e., the maximum possible DCG obtainable with a perfect ranking
of the top  results.
• NDCG: Normalized DCG, which ensures the score lies between 0 and 1. A value of 1.0 indicates
a perfect ranking.</p>
        <sec id="sec-4-2-1">
          <title>CodeGen-350m-ft StarCoder-1b-ft CodeLlama-7b-ft</title>
          <p>(3)
(4)
(5)</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>Metric</title>
          <p>Set Recall
Set Precision
Set F1 Score
NDCG
Set F1 + NDCG
Set F1 Score (EN)
Set Precision (EN)
Set Recall (EN)
NDCG (EN)
Set F1 + NDCG (EN)
Set F1 Score (ES)
Set Precision (ES)
Set Recall (ES)
NDCG (ES)
Set F1 + NDCG (ES)</p>
          <p>The results obtained by our three fine-tuned models on the held-out challenge test set can be seen
in Table 1. The table presents a comparative evaluation of three fine-tuned code generation models across
various retrieval metrics. StarCoder-1b-ft consistently outperforms the other configurations, achieving
the highest scores in set-based metrics such as Precision, Recall, and F1, as well as in their
languagespecific (EN/ES) variants. While CodeGen-350m-ft and CodeLlama-7b-ft exhibit similar performance in
most metrics, the latter lags behind in Spanish-specific evaluations. Notably, all configurations yield an
NDCG of 0, suggesting poor ranking quality despite acceptable set-based retrieval performance. Within
the Text2SPARQL challenge leaderboard, our models ranked between the 9th and 12th place.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>Performance was consistent across both languages. Despite promising intermediate results during
training, the models underperformed in generalization – likely due to incomplete fine-tuning.</p>
      <sec id="sec-5-1">
        <title>5.1. Qualitative Error Analysis</title>
        <p>The most frequent categories of failure were noted, as can be seen in the examples in this subsection:
• Incorrect variable bindings or predicate usage
• Hallucinated namespaces or entities
• Syntax errors in SPARQL structure</p>
        <p>Notably, the StarCoder-based model showed a higher rate of syntactically valid but semantically
incorrect queries, indicating partial learning of SPARQL structure.</p>
        <p>Prompt:</p>
        <p># Is the emerald number of parking spaces of claridge icon ?
Actual: ask where{dbr:Claridge_Icon dbo:numberOfParkingSpaces
dbr:The_Emerald_(building) }
Processed: ask where{dbr:Claridge_Icon dbo:numberOfParkingSpaces
dbr:Emerald }
--------------------------------------------------------------------------Prompt: # Did the george houston house have the NRHP reference number ?
Actual: ask where{dbr:George_Houston_House dbo:nrhpReferenceNumber ?x }
Processed: ask where{dbr:The_George_Houston_House</p>
        <p>dbo:nrhpReferenceNumber ?x }
--------------------------------------------------------------------------Prompt: # How should I know the ingredient names for mache ?
Actual: select ?x where{dbr:Mache_(food) dbo:ingredientName ?x }
Processed: select ?x where{dbr:Mache dbo:ingredientName ?x }
--------------------------------------------------------------------------Prompt: # Which was the year of the first ascent of laila peak and
michael peak ?
Actual: select ?x where{dbr:Laila_Peak_(Haramosh_Valley)
dbo:firstAscentYear ?x . dbr:Michael_Peak dbo:firstAscentYear
?x }
Processed: select ?x where{dbr:Michael_Peak dbo:firstAscentYear ?x.</p>
        <p>dbr:Laila_Peak dbo:firstAscentYear ?x }
--------------------------------------------------------------------------Prompt: # Is soviet union date of liberation of ravensbrück
concentration camp ?
Actual: ask where{dbr:Ravensbrück_concentration_camp dbo:liberationDate</p>
        <p>Soviet Union, 30 April 1945 }
Processed: ask where{dbr:Ravensbrück_Concentration_Camp</p>
        <p>dbo:dateOfLiberation dbr:Soviet_Union }
--------------------------------------------------------------------------Prompt: # What is the using country of boliviano ?
Actual: select ?x where{dbr:Boliviano_(1864–1963) dbo:usingCountry ?x }
Processed: select ?x where{dbr:Boliviano dbo:usingCountry ?x }
--------------------------------------------------------------------------Prompt: # Does bishop’s college have a Campus Type ?
Actual: ask where{dbr:Bishop’s_College,_Colombo dbo:campusType ?x }
Processed: ask where{dbr:Bishop’s_College dbo:campusType ?x }
--------------------------------------------------------------------------Prompt: # Is beer actually lone tree brewery product ?
Actual: ask where{dbr:Lone_Tree_Brewery dbo:product dbr:Beer }
Processed: ask where{dbr:Lone_Tree_Brewery dbo:product ?x }</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Reflections</title>
        <p>The outcomes of this project reflect both the potential and limitations of autoregressive code models
for structured query generation. While our selection of models was theoretically motivated, practical
constraints significantly impacted performance.</p>
        <p>One of the promising signs was that even with limited fine-tuning, models adapted partially to the
SPARQL grammar. This suggests that autoregressive code models, particularly those with instruction
tuning, can generalize to formal languages given suitable conditioning.</p>
        <p>Model performance across English and Spanish was roughly equivalent. Since none of the models were
explicitly multilingual, this implies that alignment was primarily pattern-based rather than grounded
in semantic understanding. Future work could incorporate multilingual pretraining or contrastive
SPARQL alignment strategies.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Future improvements</title>
        <p>• Secure longer training cycles for larger models to enable full convergence
• Explore SPARQL-specific prompting patterns to enforce output structure
• Integrate syntactic constraints or symbolic verification at decoding time
• Investigate adapter-based multilingual extensions</p>
        <p>Securing longer training cycles for larger models remains a critical step toward achieving full
convergence. Limited compute budgets often lead to premature stopping, which hinders the model’s
ability to internalize more nuanced patterns, particularly in complex language generation tasks.
Future eforts should allocate suficient computational resources to support multi-epoch training and
explore techniques such as curriculum learning and checkpoint averaging to accelerate and stabilize
convergence.</p>
        <p>Another promising direction involves exploring SPARQL-specific prompting patterns to enforce
output structure more reliably. Prompt engineering tailored to the syntactic and semantic characteristics
of SPARQL could significantly improve model alignment with query constraints. Strategies such as
providing canonical templates, embedding intermediate representations, or framing generation as
structured completion may help reduce syntactic errors and hallucinations.</p>
        <p>In addition, incorporating syntactic constraints or symbolic verification during decoding could
further enhance output validity. By integrating rule-based filters, grammar checkers, or programmatic
validators at inference time, the system can eliminate structurally invalid outputs before finalization.
This hybrid approach leverages both the generative strengths of autoregressive models and the precision
of symbolic systems, promoting trustworthiness in executable query generation.</p>
        <p>Lastly, investigating adapter-based multilingual extensions may allow eficient cross-lingual
generalization without retraining entire models. Adapters can be used to insert lightweight, language-specific
modules within a shared architecture, enabling parameter-eficient fine-tuning. This setup is particularly
advantageous in resource-constrained settings or when scaling support to additional languages with
limited data.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>We introduced a method for knowledge graph question answering based on fine-tuning large language
models for code generation. Moreover, we release a novel dataset of 7.7 million pairs of questions and
queries against the DBpedia knowledge base. Despite not performing splendidly, our approach was
potentially limited by the available resources. First, results could be further improved by utilizing 100%
of the data during fine-tuning; second, by securing longer training cycles for larger models to enable
full convergence. Exploring SPARQL-specific prompting patterns could help enforce output structure.
Lastly, integrating syntactic constraints or symbolic verification at decoding time may make the output
more robust.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work was carried out over Summer 2023 during the Google Summer of Code programme at DBpedia.
The research belongs to project A Neural Question Answering Model for DBpedia, which spanned from
2018 to 2023. We thank all mentors and contributors who took part in it.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>In the writing of this manuscript, we employed generative artificial intelligence for sentence polishing,
proofreading, and rephrasing. The authors take full accountability for the veracity of its content.
Linguistics, Dublin, Ireland, 2022, pp. 6101–6119. URL: https://aclanthology.org/2022.acl-long.422/.
doi:10.18653/v1/2022.acl-long.422.
[8] S. Liang, K. Stockinger, T. M. de Farias, M. Anisimova, M. Gil, Querying knowledge graphs in
natural language, J. Big Data 8 (2021) 3.
[9] M. Dubey, D. Banerjee, A. Abdelkawi, J. Lehmann, Lc-quad 2.0: A large dataset for complex question
answering over wikidata and dbpedia, in: The Semantic Web–ISWC 2019: 18th International
Semantic Web Conference, Auckland, New Zealand, October 26–30, 2019, Proceedings, Part II 18,
Springer, 2019, pp. 69–78.
[10] M. Mountantonakis, Y. Tzitzikas, Generating sparql queries over cidoc-crm using a two-stage
ontology path patterns method in llm prompts, J. Comput. Cult. Herit. 18 (2025). URL: https:
//doi.org/10.1145/3708326. doi:10.1145/3708326.
[11] J. Qi, C. Su, Z. Guo, L. Wu, Z. Shen, L. Fu, X. Wang, C. Zhou, Enhancing sparql query generation
for knowledge base question answering systems by learning to correct triplets, Applied Sciences
14 (2024). URL: https://www.mdpi.com/2076-3417/14/4/1521. doi:10.3390/app14041521.
[12] C. Unger, L. Bühmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber, P. Cimiano,
Templatebased question answering over rdf data, in: Proceedings of the 21st International Conference on
World Wide Web, WWW ’12, Association for Computing Machinery, New York, NY, USA, 2012, p.
639–648. URL: https://doi.org/10.1145/2187836.2187923. doi:10.1145/2187836.2187923.
[13] S. Shekarpour, S. Auer, A.-C. Ngonga Ngomo, D. Gerber, S. Hellmann, C. Stadler, Generating sparql
queries using templates, volume 11, 2013. doi:10.3233/WIA-130275.
[14] Y. Sun, L. Zhang, G. Cheng, Y. Qu, Sparqa: Skeleton-based semantic parsing for complex
questions over knowledge bases, Proceedings of the AAAI Conference on Artificial Intelligence
34 (2020) 8952–8959. URL: https://ojs.aaai.org/index.php/AAAI/article/view/6426. doi:10.1609/
aaai.v34i05.6426.
[15] A. Formica, I. Mele, F. Taglino, A template-based approach for question answering over knowledge
bases, Knowl. Inf. Syst. 66 (2023) 453–479. URL: https://doi.org/10.1007/s10115-023-01966-8.
doi:10.1007/s10115-023-01966-8.
[16] J. Zou, M. Yang, L. Zhang, Y. Xu, Q. Pan, F. Jiang, R. Qin, S. Wang, Y. He, S. Huang, et al., A chinese
multi-type complex questions answering dataset over wikidata, arXiv preprint arXiv:2111.06086
(2021).
[17] D. Banerjee, P. A. Nair, J. N. Kaur, R. Usbeck, C. Biemann, Modern baselines for sparql semantic
parsing, in: Proceedings of the 45th International ACM SIGIR Conference on Research and
Development in Information Retrieval, 2022, pp. 2260–2265.
[18] L. Kovriguina, R. Teucher, D. Radyush, D. Mouromtsev, Sparqlgen: One-shot prompt-based
approach for sparql query generation., in: SEMANTiCS (Posters &amp; Demos), 2023.
[19] J. Lee, H. Shin, Sparkle: Enhancing sparql generation with direct kg integration in decoding,</p>
      <p>Expert Systems with Applications (2025) 128263.
[20] Y.-H. Chen, E. J.-L. Lu, K.-H. Cheng, Integrating multi-head convolutional encoders with
crossattention for improved sparql query translation, arXiv preprint arXiv:2408.13432 (2024).
[21] R. Omar, I. Dhall, P. Kalnis, E. Mansour, A universal question-answering platform for knowledge
graphs, Proceedings of the ACM on Management of Data 1 (2023) 1–25.
[22] L. Jiang, J. Huang, C. Möller, R. Usbeck, Ontology-guided, hybrid prompt learning for generalization
in knowledge graph question answering, arXiv preprint arXiv:2502.03992 (2025).
[23] S. Purkayastha, S. Dana, D. Garg, D. Khandelwal, G. S. Bhargav, A deep neural approach to kgqa
via sparql silhouette generation, in: 2022 International Joint Conference on Neural Networks
(IJCNN), IEEE, 2022, pp. 1–8.
[24] A. Perevalov, D. Diefenbach, R. Usbeck, A. Both, Qald-9-plus: A multilingual dataset for
question answering over dbpedia and wikidata translated by native speakers, in: 2022 IEEE 16th
International Conference on Semantic Computing (ICSC), IEEE, 2022, pp. 229–234.
[25] A.-K. Hartmann, E. Marx, T. Soru, Generating a large dataset for neural question answering over
the dbpedia knowledge base, in: Workshop on Linked Data Management, co-located with the
W3C WEBBR, volume 2018, 2018.
[26] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, C. Xiong, Codegen: An open
large language model for code with multi-turn program synthesis, arXiv preprint arXiv:2203.13474
(2022).
[27] R. Li, L. B. Allal, Y. Zi, N. Muennighof, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim,
et al., Starcoder: may the source be with you!, arXiv preprint arXiv:2305.06161 (2023).
[28] B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, R. Sauvestre, T. Remez,
et al., Code llama: Open foundation models for code, arXiv preprint arXiv:2308.12950 (2023).
[29] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring
the limits of transfer learning with a unified text-to-text transformer, Journal of machine learning
research 21 (2020) 1–67.
[30] A. Chopra, R. Azam, Enhancing natural language query to sql query generation through
classification-based table selection, in: L. Iliadis, I. Maglogiannis, A. Papaleonidas, E. Pimenidis,
C. Jayne (Eds.), Engineering Applications of Neural Networks, Springer Nature Switzerland, Cham,
2024, pp. 152–165.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jakob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jentzsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Morsey</surname>
          </string-name>
          , P. van Kleef,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <article-title>Dbpedia - a large-scale, multilingual knowledge base extracted from wikipedia</article-title>
          ,
          <source>Semantic Web</source>
          <volume>6</volume>
          (
          <year>2015</year>
          )
          <fpage>167</fpage>
          -
          <lpage>195</lpage>
          . URL: https://journals.sagepub.com/doi/abs/10.3233/SW-140134. doi:
          <volume>10</volume>
          .3233/SW-140134. arXiv:https://journals.sagepub.com/doi/pdf/10.3233/SW-140134.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Vrandečić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krötzsch</surname>
          </string-name>
          ,
          <article-title>Wikidata: A free collaborative knowledgebase</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>57</volume>
          (
          <year>2014</year>
          )
          <fpage>78</fpage>
          -
          <lpage>85</lpage>
          . doi:
          <volume>10</volume>
          .1145/2629489.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Bollacker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Paritosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sturge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Taylor</surname>
          </string-name>
          , Freebase:
          <article-title>a collaboratively created graph database for structuring human knowledge</article-title>
          ,
          <source>in: Proceedings of the 2008 ACM SIGMOD international conference on Management of data</source>
          ,
          <year>2008</year>
          , pp.
          <fpage>1247</fpage>
          -
          <lpage>1250</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Soru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Marx</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moussallem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Publio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Valdestilhas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Esteves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. B.</given-names>
            <surname>Neto</surname>
          </string-name>
          ,
          <article-title>Sparql as a foreign language</article-title>
          , in: J.
          <string-name>
            <surname>D. Fernández</surname>
          </string-name>
          , S. Hellmann (Eds.),
          <source>Proceedings of the Posters and Demos Track of the 13th International Conference on Semantic Systems - SEMANTiCS2017, number 2044 in CEUR Workshop Proceedings</source>
          , Aachen,
          <year>2017</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-2044/paper14.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Soru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Marx</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Valdestilhas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Esteves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moussallem</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. Publio,</surname>
          </string-name>
          <article-title>Neural machine translation for query construction and composition</article-title>
          ,
          <source>2nd Workshop on Neural Abstract Machines and Program Induction (NAMPI v2) at ICML</source>
          <year>2018</year>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. R. A. H.</given-names>
            <surname>Rony</surname>
          </string-name>
          , U. Kumar,
          <string-name>
            <given-names>R.</given-names>
            <surname>Teucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kovriguina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <article-title>Sgpt: A generative approach for sparql query generation from natural language questions</article-title>
          ,
          <source>IEEE access 10</source>
          (
          <year>2022</year>
          )
          <fpage>70712</fpage>
          -
          <lpage>70723</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <surname>H. Zhang,</surname>
          </string-name>
          <article-title>KQA pro: A dataset with explicit compositional programs for complex question answering over knowledge base</article-title>
          , in: S. Muresan,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Villavicencio (Eds.),
          <source>Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          , Association for Computational
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>