=Paper= {{Paper |id=Vol-3707/D2R224_paper_5 |storemode=property |title=Leveraging Small Language Models for Text2SPARQL Tasks to Improve the Resilience of AI Assistance |pdfUrl=https://ceur-ws.org/Vol-3707/D2R224_paper_5.pdf |volume=Vol-3707 |authors=Felix Brei,Johannes Frey,Lars-Peter Meyer |dblpUrl=https://dblp.org/rec/conf/d2r2/BreiFM24 }} ==Leveraging Small Language Models for Text2SPARQL Tasks to Improve the Resilience of AI Assistance== https://ceur-ws.org/Vol-3707/D2R224_paper_5.pdf
                                Leveraging small language models for Text2SPARQL
                                tasks to improve the resilience of AI assistance
                                Felix Brei1,∗ , Johannes Frey2,3 and Lars-Peter Meyer1,3
                                1
                                  ETi Competence Center @ Institute for Applied Informatics, Germany, https:// cc-eti.org
                                2
                                  KMI Competence Center @ Institute for Applied Informatics, Germany, https:// kmi-leipzig.de
                                3
                                  Institute of Computer Science, Leipzig University, Germany, https:// cs.uni-leipzig.de


                                                                         Abstract
                                                                         In this work we will show that language models with less than one billion parameters can be used to
                                                                         translate natural language to SPARQL queries after fine-tuning. Using three different datasets ranging
                                                                         from academic to real world, we identify prerequisites that the training data must fulfill in order for the
                                                                         training to be successful. The goal is to empower users of semantic web technology to use AI assistance
                                                                         with affordable commodity hardware, making them more resilient against external factors.

                                                                         Keywords
                                                                         Language models, SPARQL generation, Question Answering




                                1. Introduction
                                The usage of Large Language Models (LLMs) has increased exponentially since the advent of
                                ChatGPT. According to Similarweb, the website of OpenAI alone was visited more than 1.6
                                billion times by February 20241 . In addition to that, Microsoft has launched several AI assistants
                                called ’Copilots’ which are based on LLMs 2,3 , as well as Google releasing their AI called Bard
                                which is now known as Gemini4,5 . This suggests that the big tech companies believe in the
                                potential of LLMs to become part of our daily lives, just like smartphones or computers in
                                general. But do they hold up to the expectations?
                                   Several test suites were derived to assess the generative capabilities of LLMs, for example
                                TruthfulQA[1], HellaSwag [2] or the Abstraction and Reasoning Corpus (ARC) [3]. These test
                                suites, among others, are run regularly on the latest entries to the LLM circus and the results
                                for open LLMs are presented publicly on the Huggingface OpenLLM leaderboard [4]. We can
                                see that the performance increases drastically over time, with Bloom [5] scoring an average of

                                Third International Workshop on Linked Data-driven Resilience Research (D2R2’24) co-located with ESWC 2024, May
                                27th, 2024, Hersonissos, Greece
                                ∗
                                    Corresponding author.
                                Envelope-Open brei@infai.org (F. Brei); frey@informatik.uni-leipzig.de (J. Frey); lpmeyer@infai.org (L. Meyer)
                                Orcid 0009-0008-5245-6655 (F. Brei); 0000-0003-3127-0815 (J. Frey); 0000-0001-5260-5181 (L. Meyer)
                                                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                            CEUR Workshop Proceedings (CEUR-WS.org)
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073




                                1
                                  https://www.similarweb.com/website/openai.com/#overview
                                2
                                  https://blogs.microsoft.com/blog/2023/09/21/announcing-microsoft-copilot-your-everyday-ai-companion/
                                3
                                  https://github.blog/2023-11-08-universe-2023-copilot-transforms-github-into-the-ai-powered-developer-platform/
                                4
                                  https://blog.google/technology/ai/bard-google-ai-search-updates/
                                5
                                  https://blog.google/technology/ai/google-gemini-ai/



                                                                                                                                           1




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
46.06 in August 2023 and Smaug-72B6 holding the record in February 2024 with a score of 80.48,
only half a year later.
   These test suites however cover mostly natural language domains, like the task to continue
a sentence, answer questions or extract information from a paragraph of text. Based on the
experience from early experiments [6], a test suite [7] was developed that evaluates capabilities
of LLMs to interface knowledge graphs and assist in knowledge engineering tasks. While the
smaller open-source GPT4All models severely struggled, the state-of-the-art commercial LLMs
GPT4 and Claude showed promising results [8] and a trend of performance improvements over
the course of 2023 [9] in dealing with KGs in Turtle format.
   Alas, these results come with several caveats:

    1. The commercial LLMs that were tested are all hosted externally. This can be problematic
       regarding data protection, because a user has to send a information to a third party.
    2. Because of their sheer size (GPT4 has one trillion parameters7 ), running these models
       locally is prohibitively costly and therefore not an option for a lot of research institutes and
       other parties. On top of that, training a model of such size is also extremely expensive8 .
    3. Even these commercial models were at the time of writing still significantly challenged
       by SPARQL query generation or RML mapping generation [8, 10] indicating a need for
       specific training or fine-tuning of all models w.r.t. handling those tasks in a reliable and
       efficient way.
    4. Since all these larger are hosted on third party platforms, users are at the mercy of
       the vendors to keep the services running and affordable. However, vendors suddenly
       changing their licensing and cost model has already happened in the past9 , as well as
       deep sea cables being damaged10 , separating certain areas of the world from the internet
       and leaving local companies only with the computational resources they have on site.

   So we ask ourselves the following question: Given a single task that we want so solve using
LLMs, is it possible to achieve a similar performance of these large models with a much smaller
one? This would enable small businesses to use AI assistance with affordable hardware they
can host on site, increasing their resilience against outages, vendors changing their pricing
models, disruption due to trade embargoes and other external factors.
   As a first step into this direction, we study the task of translating a natural language question
into a SPARQL query because we think that this task enables people who are not familiar with
SPARQL to extract knowledge and insights from a knowledge graph which would otherwise not
be possible for them. The paper is organized as follows: First, we look at related research in this
field and explain where we fit into the big picture. Then we explain the setup of our experiments,
namely which model families were chosen and why and which datasets we trained them on.
After that, we present and explain the results of our work and finally, we draw conclusions and
give an outlook on the directions that our research will head next.

6
  https://huggingface.co/abacusai/Smaug-72B-v0.1
7
  https://www.semafor.com/article/03/24/2023/the-secret-history-of-elon-musk-sam-altman-and-openai
8
  https://www.wired.com/story/openai-ceo-sam-altman-the-age-of-giant-ai-models-is-already-over/
9
  https://www.theverge.com/2023/9/12/23870547/unit-price-change-game-development
10
   https://edition.cnn.com/2024/03/04/business/red-sea-cables-cut-internet/index.html



                                                     2
2. Related Work
Current approaches focus on fine-tuning large language models. For example the authors of
[11] propose a methodology for fine-tuning OpenLLaMA to generate SPARQL queries over life
science knowledge graphs using data augmentation techniques, such as providing meaningful
variable names and inline comments, improving the performance of the model in generating
accurate SPARQL queries. The authors of [12] use Llama as their basis for fine-tuning to
generate SPARQL queries over Wikidata.
   These two papers have shown that translating natural language to SPARQL queries is possible,
but they use models with at least three (OpenLLaMA) resp. seven (LLaMA) billion parameters.
The hardware required to train these models can be expensive, which is why we want to explore
models that are even smaller.
   Smaller, fine-tuned models for one specific task are also able to beat the performance of
LLMs, e.g. SQLCoder-7B 11 performs better on SQL than state of the art GPT4. Our research is
comparable to that, but with much less parameters and SPARQL instead of SQL.
   [13] manages to fine-tune T5 on SPARQL queries for Wikidata, but to achieve these results,
the data had to be preprocessed in a way that is specific to T5. Furthermore, while this paper
explores other ways to tackle this task in general, it only looks at T5 instead of other model
families as we do.
   [14] gives a comprehensive overview and performs a comparison of pre-trained LMs (PLMs),
non-pre-trained LMs (NPLMs), and LLMs, testing various fine-tuning methods using LLMs. [15]
fine-tunes a lightweight model for SPARQL generation using synthetic training data generated
by the FlexKBQA framework on a target knowledge graph (sampling structured query templates
that are converted into SPARQL query instances and translated into natural language questions
using LLMs). The light-weight model can perform further self-guided training on real queries
to address a distribution shift between synthetic and real queries. [16] uses a GPT model to
investigate what parts of the Text2SPARQL task are the hardest for the model to solve so
appropriate countermeasures can be taken.
   [17] proposes a whole new architecture specific for SPARQL generation based on GPT. This
direction we assume promising for the future, but here we are focusing on more foundational
research first to understand which model families work best on a given dataset and why.


3. Experimental Setup
3.1. Model families
As was mentioned in the introduction, the focus of our work is to fine-tune language models
that can be considered small by modern standards. We chose one billion parameters as an
arbitrary limit on the number of parameters, but as a general guideline we consulted the Steam
Hard- and Software Survey12 and found that 57.22% of their users use a GPU with 8GB of VRAM
or more (January 2024). A model with less than one billion parameters should fit into this
amount of VRAM comfortably, showing again that these LLMs can be trained and run locally.
11
     https://huggingface.co/defog/sqlcoder-7b-2
12
     https://store.steampowered.com/hwsurvey/



                                                  3
  Another consideration is the public availability of the models. We believe that research
should be available to anyone who is interested and this should be reflected in the choice of
models. Therefore, we only select models that are openly available on Huggingface.
  Following these criteria, we observe quickly that there are only three large model families
that fit the bill, which we introduce here briefly. A full list of models evaluated is given in table
1

3.1.1. T5 and Flan-T5
In June 2020 Google released an LLM called Text-To-Text Transfer Transformer, or T5 in short
[18]. The base version consists of roughly 220 million parameters, with smaller and larger
versions available. With T5, Google wanted to provide a single LLM that can solve any NLP
task like text classification, sentiment analysis and so on. A user must provide a prefix like
’Translate the following sentence to french:’ and the LLM then infers how to process the rest of
the prompt. In 2022, researchers at Google released new versions of T5 called FLAN-T5 [19]
(FLAN stands for fine-tuning language models [20]) which, according to the authors, should
outperform T5 on any given task.

3.1.2. BART
BART was developed by Facebook and released in October 2019 [21]. It consists of 139 million
parameters and is a combination of a BERT-like encoder [22] with a GPT-like autoregressive
decoder [23]. In August 2020, a multilingual version called mBART was released [24]. The
authors put special emphasis on the fact that BART is just a pretrained model and needs to be
fine-tuned for a given specific task. We also included mREBEL models as a specialized version
of BART for multilingual relation extraction [25] since it was finetuned with knowledge graphs
in mind.

3.1.3. M2M100 and NLLB-200
The M2M100 model was introduced in 2020 [26] as a many-to-many translation tool for 100
languages. The original version consists of 1.3 billion parameters which exceeds the upper
bound we imposed. But there is a distilled version available directly from the Facebook research
team at Huggingface called M2M100-418M13 which we use in our experiments.
   Its successor, the NLLB-200 model, was introduced in 2022 [27] and stands for ’no language
left behind’. Again we use the distilled version NLLB-200-Distilled-600M14 instead of the 3.3
billion full version of the model. As the authors state, the model is ’primarily intended for
research in machine translation’ which fits our bill perfectly.
   This leaves us with a selection of models to be assessed in our experiment that can be seen in
table 1.



13
     https://huggingface.co/facebook/m2m100_418M
14
     https://huggingface.co/facebook/nllb-200-distilled-600M



                                                          4
Table 1
Model names and their number of parameters, as used in our experiments.
                                                                Name                No. parameters
             Name         No. parameters
                                                              BART-Base                  139M
           T5-Small            60.5M
                                                              BART-Large                 406M
            T5-Base            223M
                                                           mBART-LARGE-50                611M
           T5-Large            738M
                                                             mREBEL-Base                 484M
         FLAN-T5-Small          77M
                                                            mREBEL-Large                 611M
         FLAN-T5-Base          248M
                                                            M2M100-418M                  418M
         FLAN-T5-Large         783M
                                                         NLLB200-Distilled-600M          600M


3.2. Datasets used for Fine Tuning and Evaluation
In order to study how well the models can be fine-tuned towards a target KG, we use three
evaluation datasets from different domains and with varying complexity. These datasets are
comprised of a number of natural language questions, which are mapped to a SPARQL query
w.r.t. the target KG. For the first two datasets (organizational graph and CoyPu graph) we
generate questions and queries by sending the graph via the OpenAI API to GPT4 and prompting
it to generate tuples of natural language question, matching SPARQL query, and the expected
result of the query. These tuples are filtered by checking if the results that the SPARQL query
returns match with the expected results. Both datasets are then augmented by sending each
remaining question again to GPT and asking it to paraphrase the question, giving us a total of
two natural language questions per SPARQL query.

3.2.1. Organizational Graph
Introduced in [6], this small knowledge graph uses established vocabularies to describe an
organization with departments and employees. There is a clear schema that maps person and
department names to their corresponding RDF resource, for example ”Anne Miller” maps to
:anne while ”Bob Tanner” maps to :bob . In this dataset and the next we also let the language
model omit the prefix definitions for the queries and assume they are already present in the
preamble of the executed SPARQL query. Using GPT4 we generated a dataset consisting of 69
datapoints, which were split into 53 tuples for training and 16 for testing.

3.2.2. A subset of the CoyPu graph
The CoyPu project15 aims to improve supply chain resiliency for corporations by combining
different data sources about public infrastructure, trades and trade agreements, events like
disasters and conflicts and many more into a large knowledge graph. Querying this knowledge
graph has the potential to help businesses identify risks like single points of failures and mitigate
them. This usefulness combined with the fact that the other two datasets have more of an
academic background made us decide that we use a subset of the CoyPu knowledge graph as
another dataset for training. Creating a viable subset lead to its own difficulties and hurdles

15
     https://coypu.org/



                                                 5
however, which we consider as future work. This dataset contains 131 tuples in total, which
were split into 105 for training and 26 for testing.

3.2.3. QALD10
The Question Answering over Linked Data (QALD) dataset is a standard benchmark16 with
QALD10 being the latest incarnation [28]. It consists of SPARQL queries along with matching
questions in different natural languages, w.r.t. Wikidata. In this work, we focus on English and
filter the dataset accordingly. This dataset is especially difficult for a language model to handle
because there is no clear indication how to link entities from a given question like ”Barack
Obama” to their Wikidata entity ID (:Q76 ), giving rise to a whole field of research called Entity
Linking [29].

3.3. Fine-tuning
For every evaluation dataset individually, we perform fine-tuning of the selected models using
PyTorch (100 epochs). Since a single run of fine-tuning does not hold much statistical significance
and involves random parameters, we performe isolated runs of the training for a total of ten
times. For each run we shuffle the training data with a predetermined random seed to make
the results reproducible. Specifically, each run is given an ID from 𝑅01 to 𝑅10 and the seeds
are generated by calculating the SHA512 sum of the ID and taking the first eight digits, so 𝑅01
results in the seed 99975818, 𝑅02 in 56899599 and so on.


4. Results
In the following subsections we only include those language models in the plots that generated
at least one correct query. The T5 family consistently generated not a single correct query on
the organizational graph which is why it is absent in the result tables and figures. In fact, all T5
models did not produce a single correct result across all runs.
   To generate the datapoints for each plot, we interrupted the training every five epochs and
made the language models translate the questions from the evaluation split into SPARQL queries.
We then executed the queries and compared the result sets to determine whether the answers is
correct.

4.1. Organizational Graph
Figure 1 shows that all models from the BART and M2M100 families manage to learn the
structure of the knowledge graph at least to a certain degree. When taking the best results for
each model, aside from NLLB-200, all models turn at least eleven of the sixteen questions into
correct SPARQL queries. The performance however fluctuates extremely during the course of
the training which is indicative of overfitting.
   Repeating the experiment we can see that performance varies a bit depending on the order
the training data is ingested into the network. The statistics are shown in table 3 and the raw
16
     https://www.nliwod.org/challenge



                                                 6
Table 2
An example of the errors that were made in the context of the organizational graph: no binding variable,
wrong entity ID :charles, wrong property foaf:firstName, wrong literal BobTanner
          Question       What is the surname of Bob Tanner?
      Gold answer        SELECT ?surname WHERE { :bob foaf:surname ?surname . }
   Generated query       SELECT ?surname WHERE { :charles foaf:firstName 'BobTanner' }




Figure 1: Number of correct SPARQL translations across epochs for the organizational graph for one
fine-tuning run per model. 16 questions were presented.


data is plotted on the left side of figure 3. We can see that for this dataset, BART-L performs best
(as well as the other sizes of BART), with M2M100 being close behind. Another thing we see
from the left plot in figure 3 is that except for one outlier coming from mREBEL-L, the success
of fine-tuning is reliable and reproducible.
   Looking at common errors made during the translation, we found that the best models rarely
generated SPARQL that could not be parsed, but they rather mixed up terms and injected parts
of the training data into the queries. An example is shown in table 2.

4.2. CoyPu
In figure 2 we can see that the performance during the first run of the experiment varies less
drastically than for the Organizational Graph. The standard deviations seen in table 5 are similar
though so we think this is just a coincidence. Again, the (FLAN-)T5 models never generate even
a single correct query, so they are excluded from consecutive runs.
   We can also see that for this dataset, M2M100 outperformed the other models and BART-L is
in fact one of the worst, which is a complete shift from the results before. This again shows



                                                   7
Table 3
Average number of correctly translated questions in the context of the organizational graph. Standard
deviation is also shown as a percentage of the average to measure the reliability of the fine-tuning.
               Model name       Average     Standard deviation       Std. dev percent
               BART              12.90             1.14                     8.80
               BART-L            13.30             0.64                    4.81
               mBART-50          12.80             0.75                     5.85
               mREBEL-L          11.10             1.92                    17.31
               M2M100            12.50             1.02                     8.20
               NLLB-200           8.20             0.75                     9.13

Table 4
An example of the errors that were made in the context of the CoyPu graph: latitude and longitude got
mixed.
The ellipsis on the IRI was inserted by ourselves to keep the line shorter. In fact, the language model
generated the correct IRI to use in the query.
                Question      What is the latitude of the port with the ID ’AUDKB’?
             Gold answer      SELECT ?latitude WHERE {