-

Preface of the First International TEXT2SPARQL Challenge (TEXT2SPARQL'25)

Edgard Marx

edgard.marx@htwk-leipzig.de 0 1 4

Paulo Viviurka do Carmo

paulo.carmo@htwk-leipzig.de 0 1 4

Marcos Gôlo

marcosgolo@usp.br 0 2 4

Sebastian Tramp

sebastian.tramp@eccenca.com 0 3 4 0 First International TEXT2SPARQL Challenge, Co-Located with Text2KG at ESWC25 1 Leipzig University of Applied Science , Germany 2 University of São Paulo , Brazil 3 eccenca GmbH , Hainstr. 8, 04109 Leipzig, Germany, corresponding editor 4 Adrian Brasoveanu, Modul University Vienna , Austria • Aidan Hogan, DCC , Universidad de Chile • Axel Ngonga, University of Paderborn , Germany • Andreas Both, HTWK, Germany • Gong Cheng , Nanjing University , China • Gustavo Publio, Schwarz IT, Germany • Muhammad Saleem , University of Paderborn, Germany • Ricardo Usbeck, Leuphana Universität Lüneburg, Germany • Ricardo Marcondes Marcacini, USP, Brazil • Sanju Tiwari, Sharda University , India

2025

This preface presents information about the challenge, accepted papers, a detailed discussion of the new benchmark datasets, evaluation metrics, and ranking procedure in the following sections. The TEXT2SPARQL challenge invited researchers to participate by sharing an endpoint capable of translating natural language questions into SPARQL queries. Our procedure consisted of three steps: registration, ask, and evaluation. Figure 1 illustrates the challenge procedure with the three steps. Figure 1: TEXT2SPARQL challenge pipeline with three steps: registration, ask, and evaluation.

https://aksw.org/SebastianTramp (S. Tramp) CEUR Workshop

ISSN1613-0073

New Benchmark Datasets

The TEXT2SPARQL challenge introduced 250 new question/query pairs over two new benchmark datasets. A DBpedia benchmark with English and Spanish queries from the 2015-10 core, dubbed DB25, and a corporate dataset with a showcase ontology made from scratch to demonstrate the eccenca Corporate Memory capabilities, dubbed CK25. For the DBpedia 200 question-query pairs were created by automatically modifying pairs from QALD 1-82 and LCQuaD 1.03. These queries were then rewritten using GPT [ 1 ] and manually checked and modified to improve syntax and semantics, as shown in Figure 2. After the human check stage, 35 pairs were deemed initially valid, and 165 were then further checked and modified until 100 question/query pairs were reached. Finally, these questions were then translated into Spanish. For the corporate dataset, 50 questions/query pairs were manually curated, considering classic stakeholders. For details on this new dataset, refer to [ 2 ]. It is essential to mention that for both endpoints, we tried to use diferent SPARQL querying strategies (e.g., ASK, GROUP BY, ORDER BY) in order to balance the endpoints’ evaluation.

2https://github.com/ag-sc/QALD 3https://github.com/AskNowQA/LC-QuAD Evaluation Metrics and Ranking

The pipeline presented in Figure 3 was used to evaluate the teams. We used Pytrec_eval [ 3 ], an information retrieval evaluation tool, to compute information retrieval measures. The challenge team obtained information about the true question/query pairs from the YAML datasets and predicted queries. Then, the challenge team sends these queries to the endpoints to retrieve an answer saved in JSON format. The result is transformed into the Pytrec_eval standard format, consisting of true and predicted lists. Finally, the performance of Pytrec was evaluated by comparing the two lists, which enabled us to calculate the metrics used for the final ranking post-processing.

This challenge explores precision, recall, and 1 metrics. Precision and recall are defined in Equations 1 and 2, in which the precision is defined as the proportion of retrieved documents that are relevant to evenly weighted harmonic mean of precision and recall, as shown in Equation 3. the user, and the recall is defined as the proportion of relevant documents that were retrieved. 1 is an precision = recall = |{relevant documents} ∩ {retrieved documents}|

|{retrieved documents}| |{relevant documents} ∩ {retrieved documents}|

, 1 = |{relevant documents}| 2 ⋅ precision ⋅ recall precision + recall .

There are queries in both datasets where the order matters, as indicated by the flag _ _ in the YAML files. In these cases, we calculate , a normalized measure for the Discounted Cumulative Gain ( ) metric. The compares a position of where the document was retrieved and penalizes the value based on a logarithmically proportional reduction of the relevance , as shown in Equation 4. To calculate the , where represents a relevant docThe final

score is then obtained by averaging the ument ranked in the set, the value was used, which is subsequently divided by the ideal (

scores of all retrieved documents. _ p = ∑ log2( + 1) =1

_ 1_ =

∑ { =1 1 if _

_ , otherwise.

Finally, the organizers create a new metric by considering the averages of the 1 measure for every question, except those flagged as then guarantee a final value between , as shown in Equation 5. All these steps 0 and 1, which considers the maximum retrieval of relevant documents and the order in which the documents were retrieved, where is the number of questions. (1) (2) (3) ). (4) (5)

Text2SPARQL baseline

In recent years, large language models (LLMs) have become a central tool in text mining tasks due to their ability to understand, synthesize, and generate natural language with high accuracy [ 4 ]. In tasks involving the translation of natural language into formal representations, such as generating SPARQL queries from natural language questions (text-to-SPARQL), LLMs have demonstrated SoTA results [ 5 ]. These models can be explored in various ways, such as using pre-trained versions to more sophisticated approaches involving task-specific fine-tuning. As part of the challenge proposed in this workshop, open-source LLMs were used as baselines, evaluating their performance in controlled and reproducible settings to provide a solid foundation for comparison among participants.

There is a need for datasets that accurately and diversely represent the target task, enabling efective ifne-tuning of language models. In the context of this challenge, the focus is on generating SPARQL queries that target the DBpedia ontology. Over the past years, several datasets have been proposed for the text-to-SPARQL task using DBpedia as a reference. Among them, we highlight four main sources employed in our preparation: QALD1-94, LC-QuAD 1.05, Paraqa6, and Question-Sparql7. These datasets were merged into a unified corpus to train our models robustly. Only queries in English and Spanish were used in these datasets, as these languages were the focus of the challenge. The organizers applied a preprocessing pipeline that involved filtering out inconsistent, duplicate, or non-executable SPARQL queries when tested against the DBpedia endpoint adopted in the challenge. This process ensured that the training data reliably reflected the constraints and characteristics of the target knowledge base.

The model Qwen 2.5 was selected, a high-performance open-source LLM, for fine-tuning [ 6 ]. The Unsloth library was used, which implements an eficient fine-tuning strategy based on QLoRA (Quantized Low-Rank Adaptation) [ 7 ]. Training was conducted for a total of 100 steps, using a learning rate of 0.001, which was chosen to strike a balance between training time and result quality. During the generation phase, the organizers evaluated the model’s performance using four diferent temperature values (0.01, 0.25, 0.5, and 0.75) to assess the impact of variability on SPARQL query generation. Our results showed that intermediate temperature values, particularly 0.25 and 0.5, outperform the other results. These experimental settings introduced a moderate level of diversity that helped the model produce more accurate and contextually appropriate queries without compromising the syntactic correctness of the SPARQL language. The 1 table presents our baseline results.

Our best baseline result was achieved using the fine-tuned Qwen 2.5 14B model. In comparison, our worst baseline relied on the pretrained Qwen 2.5 7B model. The results demonstrate that fine-tuning significantly improved model performance, leading to the generation of more accurate and contextually appropriate SPARQL queries. We highlight that smaller fine-tuned models outperform larger pretrained models. All three models and the constructed dataset8 are publicly available on Hugging Face: • Text2SPARQL-S refers to the small version (7B), which requires approximately 6 GB of GPU memory. Model: https://huggingface.co/aksw/text2sparql-S; • Text2SPARQL-M denotes the medium version (14B), which requires approximately 11 GB of

GPU memory. Model: https://huggingface.co/aksw/text2sparql-M;

4https://github.com/ag-sc/QALD

5https://github.com/AskNowQA/LC-QuAD 6https://huggingface.co/datasets/Orange/paraqa-sparqltotext 7https://huggingface.co/datasets/julioc-p/Question-Sparql 8https://huggingface.co/datasets/aksw/Text2SPARQL-Raw average on the evaluation scenarios. Best is bold, second best is underlined, and third best is in italic.

Corporate

DBpedia en

DBpedia es

DBpedia

Overall Team WSE INFAI IIS-Q IIS-L MIPT AIFB∗ DBPEDIA-SC∗ DBPEDIA-CL∗ DBPEDIA-CG∗ Challenge Baseline

Text2SPARQL Awards

Considering datasets and languages, we have four categories for the Text2SPARQL challenge awards: 1. Corporate 2. DBpedia English 1st INFAI: Daniel Gerber, Lorenz Bühmann, Lars-Peter Meyer, Felix Brei, Claus Stadler 2nd IIS-Q: Daniel Henselmann, Rene Dorsch, and Andreas Harth 3rd IIS-L: Daniel Henselmann, Rene Dorsch, and Andreas Harth 1st INFAI: Daniel Gerber, Lorenz Bühmann, Lars-Peter Meyer, Felix Brei, Claus Stadler 2nd IIS-Q: Daniel Henselmann, Rene Dorsch, and Andreas Harth 3rd AIFB: Jan Wardenga and Tobias Käfer 3. DBpedia Spanish 1st WSE: Aleksandr Perevalov and Andreas Both 2nd AIFB: Jan Wardenga and Tobias Käfer 3rd MIPT: Oleg Somov, Daniil Berezin, and Roman Avdeev 4. Overall 1st WSE: Aleksandr Perevalov and Andreas Both 2nd INFAI: Daniel Gerber, Lorenz Bühmann, Lars-Peter Meyer, Felix Brei, Claus Stadler 3rd IIS-Q: Daniel Henselmann, Rene Dorsch, and Andreas Harth

Assistant Committee Program Committee

• Edgard Marx, Leipzig University of Applied Sciences (HTWK), Germany • Sebastian Tramp, eccenca GmbH, Germany • Diego Moussallem, Paderborn University, Germany • Paulo Viviurka do Carmo, Leipzig University of Applied Sciences (HTWK), Germany • Marcos Paulo Silva Gôlo, University of São Paulo, Brazil

Acknowledgements

The editors would like to thank the advisory team, authors, program committee, and other organizers for their ongoing support in making this event a success.

[1] OpenAI, GPT-4 Technical Report , 2024 . URL: https://arxiv.org/abs/2303.08774. arXiv: 2303 . 08774 .

[2]

Tramp ,

Pietzsch , The CK25 Corporate Knowledge Reference Dataset for Benchmarking Text 2 SPARQL Question Answering Approaches , in: The 1st GOBLIN Workshop on Knowledge Graph Technologies, DBpedia Association , 2025 , pp. - . URL: https://github.com/eccenca/ck25-dataset.

[3]

Van Gysel , M. de Rijke , Pytrec_eval: An extremely fast python interface to trec_eval , in: SIGIR, ACM, 2018 , pp. 1 - 10 .

[4]

Chang ,

Wang ,

Wu ,

Yang ,

Zhu ,

Chen ,

Yi ,

Wang ,

Wang , et al., A survey on evaluation of large language models , ACM Transactions on Intelligent Systems and Technology ( 2023 ).

[5]

Perevalov ,

Both , A. -C. Ngonga Ngomo , Multilingual question answering systems for knowledge graphs-a survey , Semantic Web 15 ( 2024 ) 2089 - 2124 .

[6]

Bai ,

Chu ,

Cui ,

Dang ,

Deng ,

Fan ,

Ge , Y. Han,

Huang , et al., Qwen technical report, arXiv preprint arXiv:2309.16609 ( 2023 ).

[7]

Dettmers ,

Pagnoni ,

Holtzman , L. Zettlemoyer, Qlora: Eficient finetuning of quantized llms , Advances in neural information processing systems 36 ( 2023 ).