<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Termite Italian Text-to-SQL: A CALAMITA Challenge</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Federico</forename><surname>Ranaldi</surname></persName>
							<email>federico.ranaldi99@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="department">Human-Centric ART</orgName>
								<orgName type="institution">University of Rome Tor Vergata</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Elena</forename><forename type="middle">Sofia</forename><surname>Ruzzetti</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Human-Centric ART</orgName>
								<orgName type="institution">University of Rome Tor Vergata</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Dario</forename><surname>Onorati</surname></persName>
							<affiliation key="aff2">
								<orgName type="institution">University of Rome La Sapienza</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Fabio</forename><forename type="middle">Massimo</forename><surname>Zanzotto</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Human-Centric ART</orgName>
								<orgName type="institution">University of Rome Tor Vergata</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Leonardo</forename><surname>Ranaldi</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Human-Centric ART</orgName>
								<orgName type="institution">University of Rome Tor Vergata</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">School of Informatics</orgName>
								<orgName type="institution">University of Edinburgh</orgName>
								<address>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff3">
								<address>
									<addrLine>Dec 04 -06</addrLine>
									<postCode>2024</postCode>
									<settlement>Pisa</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Termite Italian Text-to-SQL: A CALAMITA Challenge</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">6038ABCAE59CB81CA7EA59E9C31B057F</idno>
					<idno type="arXiv">arXiv:2402.12554.</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:34+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Text-to-SQL</term>
					<term>Italian LLMs</term>
					<term>CALAMITA</term>
					<term>CLiC-it</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Relational databases play an important role in business, science, and beyond. However, the operability of relational databases is restricted to users familiar with specific languages such as SQL, which limits the analytical power that they could deliver. Although earlier techniques have been proposed to automatically generate SQL from natural language, such as Text-to-SQL large-scale datasets, they are predominantly built-in English and are automatically constructed using surface web data. This phenomenon limits evaluation and use in settings beyond English and also limits fair assessment, given the origin of the datasets, as the data may have already been seen in pre-training corpora.</p><p>In this work, we introduce Termite, which is a definitely unseen resource for evaluating Text-to-SQL in Italian. Specifically, we transfer evaluation pipelines beyond English, proposing novel, definitely unseen resources that avoid data-contamination phenomena while assessing the ability of models to perform Text-to-SQL tasks when natural language queries are written in Italian. We establish an evaluation grid based on execution accuracy. Our code and datasets are available at link.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The Text-to-SQL is an important NLP task, which maps input questions to meaningful and executable SQL queries, enabling users to interact with databases in a more intuitive and user-friendly way. Despite the substantial number of state-of-the-art systems <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2,</ref><ref type="bibr" target="#b2">3]</ref> and benchmarks <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b5">6]</ref> for Text-to-SQL, most of them are in English and this limits the operability to non-English users.</p><p>Dou et al. <ref type="bibr" target="#b4">[5]</ref> proposed extensions beyond English Spider <ref type="bibr" target="#b3">[4]</ref>. This still highlights significant limitations because the resources in specific languages were generated from automatic translations for a few languages. On the other hand, publicly released resources could be translated and adapted to the Text-to-SQL task, but these could be the panacea of contamination as they are often publicly available (e.g., Kaggle or Wikipedia as in the case of <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b6">7]</ref>). Indeed, portions of these resources are included in the huge corpora employed to conduct the pre-training phases of large language models (LLM), i.e., the data-contamination phenomenon <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b9">10,</ref><ref type="bibr" target="#b10">11,</ref><ref type="bibr" target="#b11">12]</ref>.</p><p>To tackle these problems, in the context of CALAMTIA <ref type="bibr" target="#b12">[13]</ref> we propose Termite (Text-to-SQL Repository Made Invisible to Engines), a novel Text-to-SQL resource created and conceived for the Italian. We aim to reduce the possibility of increased performance due to data contamination while proposing a suitable resource for a specific language. In fact, in contrast to native English benchmark translation methods, Termite is designed to be used as an assessment pipeline, ensuring that it remains a resource not exposed to search engines as it is locked by an encryption key distributed with the dataset, reducing accidentally inclusion in a new commercial or search LLMs training set.</p><p>Termite is structurally designed to resemble Spider. However, it complements Spider's extensions into other languages by proposing a series of databases originally hand-crafted in Italian. Specifically, part of the Termite content comes from a thorough reworking of databases initially designed by students from the University of Rome Tor Vergata. This aspect, enriched by the invisibility to search engines, makes Termite a valuable resource for evaluating models on a practical and theoretically significant task.</p><p>Moreover, evaluating Text-to-SQL models in languages beyond English is essential for broadening their practical use and understanding of their linguistic behavior. Assessing how these models handle the same problem presented in different languages is critical for gaining insights into their adaptability and consistency across multilingual contexts <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b15">16]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Background</head><p>In this section, we provide a formal problem definition of Text-to-SQL ( §2.1), addressing typical aspects that define it beyond a natural language understanding or code generation problem. Then, we discuss the potential impact of data contamination on this task and how our Termite serves as a measure against it, outlining several considerations that mitigate contamination risks ( §2.2). Finally, in §2.3 we introduce the challenges that leverage our contribution through the Termite resource.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">The Task</head><p>Text-to-SQL is a fundamental task within Natural Language Processing (NLP) that involves not only understanding natural language queries and generating corresponding SQL code, but also establishing a mapping between data expressed in natural language and data represented within the database schema. This requires the model to accurately link natural language terms with database structures such as tables, columns, and values, making it a more complex challenge than simple code generation or natural language understanding. This task is crucial in making relational database interactions more accessible to users who may not be familiar with SQL syntax. The foundational work was based on rule-based and heuristic approaches <ref type="bibr" target="#b0">[1]</ref>, (et. alia). The actual automatic processing of Text-to-SQL pipelines became meaningful with the advent of neural networkbased approaches. The shift towards neural models was facilitated by the introduction of resources such as Spider <ref type="bibr" target="#b3">[4]</ref> and the more recent <ref type="bibr" target="#b16">[17]</ref>, which delivered various and complex natural language to SQL demonstrations.</p><p>The most recent advancements in Text-to-SQL involve the use of Large Language Models (LLMs), which have demonstrated remarkable capabilities in handling various tasks without needing specific pretraining or fine-tuning tailored to each task.</p><p>Gao et al. <ref type="bibr" target="#b17">[18]</ref> and Pourreza and Rafiei <ref type="bibr" target="#b2">[3]</ref> shown that GPTs are effective Text-to-SQL coders on Spider, widely acknowledged as an effective benchmark for assessing performance in this specific task On the same dataset, approaches that deconstruct the problem in smaller ones via in-context learning are even actually examined <ref type="bibr" target="#b2">[3]</ref>.</p><p>The emergence of LLMs as a key paradigm for the Text-to-SQL task has also led to a more in-depth study of various prompt engineering methods. These efforts aim to understand what best enhances a model's performance in text-to-SQL translation. In <ref type="bibr" target="#b18">[19]</ref>, the performance of the GPT family is evaluated across different prompt scenarios, which vary based on how much information about the database is provided to the model for the translation process. Results show that providing a specific set of additional information significantly improves the model's ability to generate accurate SQL queries <ref type="bibr" target="#b18">[19]</ref>.</p><p>This last aspect enlights how LLMs appear to be behaviourally influenced by both the in-context prompt <ref type="bibr" target="#b19">[20]</ref> and the text used during the pre-training <ref type="bibr" target="#b10">[11]</ref>. Consequently, if LLMs perform better on tasks with data that were already seen during the pre-training phase, we would face an issue of data contamination.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Data Contamination in Modern Benchmarks</head><p>Data contamination is an increasingly recognized challenge in the field of machine learning, with a growing number of studies dedicated to its investigation. Several recent studies such as <ref type="bibr" target="#b20">[21]</ref> and <ref type="bibr" target="#b21">[22]</ref> have explored the issue of data contamination, proposing a comprehensive taxonomy of methods to detect and address it. Due to its nature, the text-to-SQL task is susceptible to overestimation issues, particularly related to data contamination. Therefore, a good practice when evaluating a model on this task is to ensure that there is no overlap between the test data and the pre-training data. On the other hand, this becomes challenging when dealing with closed-source models, where there is no clear knowledge of the pre-training data, such as in the case of the GPT family <ref type="bibr" target="#b22">[23]</ref>. Hence, taking inspiration from Golchin and Surdeanu <ref type="bibr" target="#b23">[24]</ref> and Deng et al. <ref type="bibr" target="#b24">[25]</ref> who treated the issue of Data Contamination in closed-source models, Ranaldi et al. <ref type="bibr" target="#b11">[12]</ref> proposed a novel method for detecting Data Contamination applied to text-to-SQL. This consists in carefully comparing the model's performance on a novel test set (such as Termite) with that on a well-known test set (such as Spider), whose content is suspected to have been exposed to the model's pre-training data. The results showed that GPT models exhibit a drop in performance on Termite compared to Spider. Furthermore, it was observed that even perturbing Spider by removing information from the dump provided with the prompt had no significant impact on performance. The study of contaminating test sets continues to expand into other tasks, to the extent that an index of contaminated datasets <ref type="bibr" target="#b25">[26]</ref> has been established.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Termite</head><p>Our contribution complements <ref type="bibr" target="#b11">[12]</ref> in particular by introducing Termite. We aim to provide an Italian text-to-SQL dataset and a tool for analysing the contamination of Spider data for LLMs. Indeed, the structural complexity of Termite mirrors that of the Spider test set. Moreover, to prevent data contamination from compromising its usefulness, it is freely accessible, but its content is not provided in a fully transparent form.</p><p>In the following sections, we describe the composition of Termite in detail and provide a basic evaluation to facilitate usability and reproducibility. In addition, to encourage usability, we share the resources and code.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Dataset</head><p>Our main intent is to provide an evaluation resource for Text-to-SQL on data that is definitely unknown and, therefore, not present in well-known pre-training corpora. However, since several robust evaluation pipelines exist in state of the art, the first step is understanding their structure and operation. Therefore, beyond the de-facto standards resources ( §3.1), we introduce our Termite conceived as a novel unseen Italian resource ( §3.2).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Spider: Characteristics and Content</head><p>Among the best-known Text-to-SQL resources is Spider <ref type="bibr" target="#b3">[4]</ref>. This resource is the de-facto standard for training and testing systems on the Text-to-SQL task.</p><p>Spider appears as a collection of databases and associated sets of pairs of natural language (NL) questions and the corresponding SQL translations. Databases are structurally represented inside the dataset in the form of SQL dumps, which include the CREATE TABLE operations and a limited number of INSERT DATA operations for each table.</p><p>NL questions are organized into four difficulty levels: EASY, MEDIUM, HARD, and EXTRA-HARD. For the definition of the hardness level, we refer to the categorization originally made in Spider <ref type="bibr" target="#b3">[4]</ref>. The difficulty of an NL question is assessed by considering the corresponding SQL query. Hence, the difficulty is correlated with the number and kind of operations that the gold query contains: the presence of JOIN operations, aggregation, and WHERE conditions contribute to the hardness of the query. EASY queries do not involve more than one table. MEDIUM and HARD queries span multiple tables: MEDIUM queries contain only a JOIN or aggregation operation whereas HARD queries are more complex both in terms of number of JOIN and aggregations. Finally, EXTRA-HARD queries may contain nested queries, and other operators like UNION and INTERSECT 1 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Termite: a Text-to-SQL Repository Made Invisible to Engines</head><p>The driving idea for proposing a novel resource for the Text-to-SQL task is to reduce the possibility of boosting performance due to data contamination. Indeed, publicly available datasets are not suitable for this purpose. Even though novel datasets are made available, they are built from publicly open-access resources such as Kaggle or Wikipedia (this is the case for recently developed datasets like BIRD <ref type="bibr" target="#b6">[7]</ref> or Spider itself). Hence, these do not guarantee that they are as new as required. The same issue may also be faced for hidden test sets. Moreover, since 1 More details are available on the official Spider repository freely available datasets are easily accessed and tracked by engines, they are at risk of being contaminated in the near future if they are not already contaminated.</p><p>To address these challenges, we propose Termite<ref type="foot" target="#foot_0">2</ref> . Termite aims to be a permanently fresh dataset. Termite will be invisible to search engines since it is locked under an encryption key delivered along the resource. This trick will reduce the accidental inclusion in a novel training set for commercial or research GPTs.</p><p>Hence, by following characteristics of Spider, Termite contains hand-crafted databases in different domains. Each database has a balanced set of NL-SQL query pairs: we defined an average of 5 queries per hardness-level. The entire dataset was designed to be comparable to the Spider Validation Set, not only in terms of database characteristics such as size and table count (Table <ref type="table" target="#tab_0">1</ref>) but also in terms of query difficulty, which was measured using the same definition provided by Spider. Moreover, as in Spider, during the construction of Termite, we took care to write unambiguous, direct NL questions that can be solved by a model relying only on its linguistic proficiency and an analysis of the schema, with no external knowledge needed. The style adopted in the NL questions is plain and colloquial in line with the style of Spider's NL questions. Spider and Termite are also comparable in terms of number of tables and columns in each dataset. We curated the column names to make them similar to the ones in Spider, using a similar percentage of abbreviations and compound names (see Table <ref type="table" target="#tab_0">1</ref>). This equivalence will be crucial to limit the influence of the dataset itself on the following evaluations and will be further explored in Section 4.2.</p><p>However, there is a significant and fundamental difference between the two datasets, as the Termite is not openly available on the web or easily retrievable nor built on pre-existing openly available resources.</p><p>This aspect is crucial because the way it is made available certainly reduces the risk of falling into the LM contamination index ( <ref type="bibr" target="#b25">[26]</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Comparing Hardness of Termite vs. Spider</head><p>When introducing a new dataset for benchmarking a particular task, it is important to ensure it aligns with the established and commonly used datasets within the community to maintain consistency and comparability. Our Termite is designed to resemble Spider in terms of measurable aspects, like the number of columns and tables per database, as well as the lexicon used in the schema definition. However, it remains difficult to quantify via some simple statistics how hard it is to understand how to translate a natural language question into an SQL statement.</p><p>To compare hardness of Termite and Spider, we adopted a human-centered definition: if humans can translate questions into an SQL queries on both Spider and Termite with the same level of challenge, then it means that their hardness, at least for a SQL-proficient human annotator, is the same.</p><p>Therefore, ten annotators were asked to judge the equivalence in terms of hardness of the SQL translations that compose Spider and Termite by examining a random sample of queries of both datasets.</p><p>To measure the hardness of the two datasets, we designed a simple test. Given a Entity-Relationship schema of a database and a question in natural language, each annotator is asked to choose among three options the correct translation in SQL of the question. Appendix ?? presents details on the construction of the test.</p><p>On both Spider and Termite, taking as join annotation the answer chosen by the majority of annotators leads to almost perfect classification (0.975 accuracy on Spider and maximum accuracy on Termite). The average accuracy per annotator is 0.91(±0.05) on Spider and 0.94(±0.07) on Termite. Moreover, Fleiss's Kappa coefficients are rather high (0.79 and 0.85 respectively) for both Spider and Termite. Hence, we can conclude that humans do not find one dataset more difficult than the other. The two datasets can then be considered equivalent in terms of the hardness of translations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Methods</head><p>Current evaluation pipelines exploit the behaviour of models by defining robust prompting strategies since the generations delivered by these are strongly correlated to the in-context structures <ref type="bibr" target="#b18">[19]</ref>.</p><p>Thus, in §4.1, we introduce the technique for the Textto-SQL task as the suggested evaluation metric for an initial exploration of Termite. Furthermore, in §4.2, we define Execution Accuracy as the evaluation metric of choice for evaluating the model, as it offers a practical method for determining the correctness of SQL query generation within this framework.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Prompting LLMs in Italian for Text-to-SQL Translation</head><p>Given instructions in natural language, LLMs can translate the request into code (i.e., SQL queries) to answer the given request. Specifically, models for generating text have undergone training to process both natural language and code. As a result of the inputs they receive, these models produce text-based outputs. For this reason, it is possible to frame the Text-to-SQL as a translation task: given a dump for a database and a query in natural language, the model is asked to translate the latter in the corresponding SQL query, referring to tables and columns into the considered database. The desiderata is an executable query, semantically equivalent to a gold human-generated query. In the next paragraphs, we first describe how GPT-3.5 (gpt-3.5-turbo) is prompted in order to obtain the translations .</p><p>Text-to-SQL as a Translation Task OpenAI API's enable to interrogate a model in a multi-turn conversation format: chat models receive a series of messages as input and generate a message as output. We test the ability of GPT-3.5 on the Text-to-SQL task by framing each translation from natural language to SQL as a separate conversation.</p><p>The proposed approach, aimed at analysing the model's in-context learning abilities in zero-shot scenarios, is very similar to "Code Representation" <ref type="bibr" target="#b18">[19]</ref> and has been specifically tested in Italian <ref type="bibr" target="#b8">[9]</ref>.</p><p>In particular, the first message of a target database gives the model the dump of the database. In each dump, information about the database's tables is provided by the CREATE TABLE statements. In the CREATE instructions, the constraints of the primary and foreign keys are also encoded. In addition, some realistic data to fill the tables are provided by INSERT instructions. Given the dump, the model answers by producing an interpretation of the dump. Typically, this model response contains an explanation of the dump's contents. For example, considering the database bowling in Termite dataset, the first messages in the conversation are the following:</p><formula xml:id="formula_0">user: Considera il seguente database: CREATE TABLE "pista" [...]; CREATE TABLE "giocatori" [...];</formula><p>GPT-3.5: Questo database rappresenta una struttura per la gestione di un centro di bowling... Then, given the dump and the model's interpretation of it, a message containing the natural language question to be translated is sent. In particular, the selected prompt ensures that the model translates natural language questions into SQL queries with a limited amount of text that is not SQL. These steps are repeated for each question separately to obtain translations independently. However, to ensure that the model's understanding of each database is comparable across all questions, the database dump and the same interpretation initially produced by the model are sent as context, in the form of preceding messages, before each translation is requested. Hence, building from the previous example, a conversation to translate a question on the bowling database would be completed by the following messages: user: Traduci in SQL la seguente query.</p><p>Rispondi usando esclusivamente linguaggio SQL. Conta il numero di giocatori per partita.</p><p>GPT-3.5: SELECT ora_inizio,tenuta_il,id_pista, COUNT(*) FROM 'partita' GROUP BY ora_inizio,tenuta_il,id_pista;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Measuring Hardness of queries in Spider and Termite</head><p>We need to ensure that Spider and Termite are hardness comparable. Termite is designed with a similar annotation protocol; however, a similarity in terms of the hardness of the natural language questions used is hard to quantify. For this reason, we asked 10 SQL-proficient annotators to perform a simple yet effective test to measure how difficult it is for them to translate questions both from Spider and from Termite. The main idea is that if they can translate both Spider and Termite questions with the same accuracy level, then the challenge level is similar on both datasets.</p><p>In particular, given an E-R database schema and a natural language utterance, each test question asks the annotator to choose from three SQL query options that satisfy the request. All three options are syntactically correct SQL queries, but the incorrect answers are semantically different from the correct ones. The authors designed the first incorrect option, perturbing the correct answer by removing or replacing some operations or retrieved columns and changing the field and table names with non-matching ones. The second incorrect answer is another query extracted from the same dataset as the correct one. The selected query is the most similar under the Bag of Words assumption concerning the correct one. To retrieve this third option, the similarity of two queries is measured via the cosine similarity of their BOW vector representations.</p><p>The complete test is composed of 20 randomly selected queries from each dataset, Hence, the resulting 40 questions are shared to 10 SQL-proficient annotators: 60% of them are Computer Science Master students, the remaining are already graduated. Five annotators work in a field that requires daily use of the SQL query language. Finally, we divided the test into two trials of 20 queries each. We administered it to the annotators at two different times to limit errors due to gradual loss of concentration.</p><p>Our approach is completely zero-shot to minimize the effect that the prompt itself-rather than data contamination-can have on performance. Once the translation process is completed, the SQL code produced by the model is retrieved to evaluate whether or not the generated query satisfies the natural language query.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Execution Accuracy: the Evaluation Metric</head><p>The evaluation metric adopted is execution accuracy introduced by Yu et al. <ref type="bibr" target="#b3">[4]</ref>, which assesses the correctness of the generated SQL query by executing it against the database and comparing the result with the expected output.</p><p>The Execution Accuracy (EA) can be formally defined as follows:</p><p>Let 𝑞 represent the gold query and 𝑔 represent the generated query. The execution accuracy compares the execution results of 𝑔 and 𝑞 on a database 𝐷.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>𝐸𝐴(𝑔, 𝑞, 𝐷)</head><formula xml:id="formula_1">= {︃ 1 if 𝑔(𝐷) = 𝑞(𝐷) 0 if 𝑔(𝐷) ̸ = 𝑞(𝐷)</formula><p>where 𝑔(𝐷) and 𝑞(𝐷) represent the outputs of the queries on 𝐷. Execution accuracy is 1 if the results are the same and 0 otherwise.</p><p>In case of syntactic errors in the generated SQL query, it is considered definitively incorrect, as adherence to SQL grammar is part of the model's evaluation.</p><p>The execution accuracy metric is prone to false positives, as two different queries can return the same output under specific database record configurations. For this reason, in <ref type="bibr" target="#b11">[12]</ref>, the Test Suite Accuracy metric is adopted. Test Suite Accuracy, introduced in Zhong et al. <ref type="bibr" target="#b26">[27]</ref>, essentially involves performing execution accuracy on the same query across many randomly generated database record configurations called Test Suite.</p><p>In this paper, we propose EA as an evaluation metric because the way queries and database records are designed in Termite aims to minimize the occurrence of false positives. Additionally, to encourage experimentation with Termite, we recommend initially employing simple and computationally inexpensive evaluation metrics, in contrast to Test Suite Accuracy. Moreover, we suggest disregarding the query difficulty evaluation metric proposed by <ref type="bibr" target="#b3">[4]</ref>.</p><p>Hence, in link is available, an automated script evaluates generated SQL queries using Execution Accuracy as the metric. It can be run locally as it is a lightweight program that executes queries on an SQL server and processes the output as our metric requires.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Experiments</head><p>Our Termite aims to extend the Text-to-SQL evaluation pipeline to Italian while preserving data integrity and thus preventing possible contamination. To prove its operability, we propose a baseline assessment in §5.1 and discuss the obtained results in §5.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Experimental Setup</head><p>We systematically evaluated GPT-3.5 (gpt-3.5-turbo-16k) performance on the Termite dataset for the Text-to-SQL task. We employed the API to generate SQL translations for each query in the dataset. To ensure consistency in the results, we set the temperature parameter to 1, allowing for greater flexibility and diversity in the model's output. For each natural language query, a translation request was sent to the model. The generated SQL query was then saved and subsequently processed according to the aforementioned metric ( §4.2). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Database</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Baseline Results</head><p>The results achieved in the baseline assessment reveal the intrinsic challenges of the text-to-SQL task performance. In fact, Table <ref type="table" target="#tab_1">2</ref> reports the Execution Accuracy percentages (EA_SCORE (%)) achieved by GPT-3.5 on each of the 10 datasets that compose our Termite. It can be observed that an acceptable accuracy, significantly exceeding 50%, is only seen for the "farma" and "galleria" databases, where 69% and 62% accuracy were achieved, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Limitations &amp; Future Works</head><p>The idea of Termite is to propose a new resource conceived and realized for the Italian language. During the discussion of the contribution, we introduced the underlying motivations that support our choices regarding encryption and baseline evaluations.</p><p>However, we plan to extend our contribution to languages beyond Italian in future developments. We also aim to propose efficient alignment techniques to enable smaller models to cope with more demanding tasks such as text-to-SQL by adopting teacher-student alignment techniques <ref type="bibr" target="#b27">[28,</ref><ref type="bibr" target="#b28">29]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusions</head><p>We have introduced Termite, a resource that, to the best of our knowledge, is unique in that the databases and queries were natively conceived in Italian. Its structural alignment with well-known datasets like Spider makes it a solid benchmarking tool for analysing Text-to-SQL results when the test set languages differ.</p><p>Additionally, its uniqueness lies in the fact that it is not publicly accessible by search engines, making it less exposed to the increasingly prominent issue of data contamination, particularly when dealing with closed-source large language models.</p><p>Extending Termite to include queries where the complexity is not only driven by the SQL query itself but also by tasks such as commonsense and arithmetic reasoning would further enrich the dataset. This is in line with approaches like those seen in Archer <ref type="bibr" target="#b29">[30]</ref>, which address these additional challenges.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Spider and fact sheet. Termite is designed to be comparable to the validation set of Spider.</figDesc><table><row><cell></cell><cell cols="2">Dataset</cell></row><row><cell></cell><cell cols="2">Spider Termite</cell></row><row><cell>#DB</cell><cell>20</cell><cell>10</cell></row><row><cell>avg #TABLES per DB</cell><cell>4.2</cell><cell>4.0</cell></row><row><cell>avg #COLUMNS per TABLE</cell><cell>5.46</cell><cell>5.56</cell></row><row><cell>#QUERY</cell><cell>1035</cell><cell>202</cell></row><row><cell>avg #QUERY per DB</cell><cell>51.75</cell><cell>20.2</cell></row><row><cell>avg #FK/#COLUMNS per DB</cell><cell>0.16</cell><cell>0.13</cell></row><row><cell>avg #Compound/#COLUMNS per</cell><cell>0.63</cell><cell>0.51</cell></row><row><cell>DB</cell><cell></cell><cell></cell></row><row><cell>avg #Abbr/#COLUMNS per DB</cell><cell>0.10</cell><cell>0.12</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Execution Accuracy (EA_SCORE (%)) achieved by GPT-3.5 and Number of Queries for each Database</figDesc><table><row><cell>Name</cell><cell>EA_SCORE (%)</cell><cell>Queries</cell></row><row><cell>bowling</cell><cell>50.79</cell><cell>24</cell></row><row><cell>centri</cell><cell>56.25</cell><cell>19</cell></row><row><cell>coronavirus</cell><cell>40.00</cell><cell>20</cell></row><row><cell>farma</cell><cell>62.50</cell><cell>20</cell></row><row><cell>farmacia</cell><cell>50.00</cell><cell>20</cell></row><row><cell>galleria</cell><cell>69.15</cell><cell>23</cell></row><row><cell>hackathon</cell><cell>46.25</cell><cell>19</cell></row><row><cell>pratica</cell><cell>50.11</cell><cell>22</cell></row><row><cell>recensioni</cell><cell>20.00</cell><cell>18</cell></row><row><cell>voli</cell><cell>56.25</cell><cell>17</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0">The repository is available here under GPL-3.0 license. To access, use the password "youshallnotpass".</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>We would like to express our gratitude to the Human-Centric Art team for their valuable collaboration in the creation of the Termite dataset. Special thanks go to the annotators whose work was essential in affirming the comparability between Termite and Spider. Finally we extend our appreciation to the Computer Science's students of the University of Rome Tor Vergata for providing the original hand-crafted databases, which were subsequently the subject of extensive reworking and refinement.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Translating questions to SQL queries with generative parsers discriminatively reranked</title>
		<author>
			<persName><forename type="first">A</forename><surname>Giordani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Moschitti</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/C12-2040" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of COLING 2012: Posters, The COLING 2012 Organizing Committee</title>
				<editor>
			<persName><forename type="first">M</forename><surname>Kay</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Boitet</surname></persName>
		</editor>
		<meeting>COLING 2012: Posters, The COLING 2012 Organizing Committee<address><addrLine>Mumbai, India</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="401" to="410" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">PI-CARD: Parsing incrementally for constrained auto-regressive decoding from language models</title>
		<author>
			<persName><forename type="first">T</forename><surname>Scholak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Schucher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Bahdanau</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2021.emnlp-main.779</idno>
		<ptr target="https://aclanthology.org/2021.emnlp-main.779.doi:10.18653/v1/2021.emnlp-main.779" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and</title>
				<editor>
			<persName><forename type="first">M.-F</forename><surname>Moens</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">X</forename><surname>Huang</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Specia</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><forename type="middle">W</forename><surname>.-T</surname></persName>
		</editor>
		<editor>
			<persName><surname>Yih</surname></persName>
		</editor>
		<meeting>the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and<address><addrLine>Punta Cana, Dominican Republic</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="9895" to="9901" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">DIN-SQL: Decomposed incontext learning of text-to-SQL with self-correction</title>
		<author>
			<persName><forename type="first">M</forename><surname>Pourreza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Rafiei</surname></persName>
		</author>
		<ptr target="https://openreview.net/forum?id=p53QDxSIc5" />
	</analytic>
	<monogr>
		<title level="m">Thirty-seventh Conference on Neural Information Processing Systems</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task</title>
		<author>
			<persName><forename type="first">T</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yasunaga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Yao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Roman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Radev</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/D18-1425</idno>
		<ptr target="https://aclanthology.org/D18-1425.doi:10.18653/v1/D18-1425" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">E</forename><surname>Riloff</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Chiang</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Hockenmaier</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Tsujii</surname></persName>
		</editor>
		<meeting>the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics<address><addrLine>Brussels, Belgium</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="3911" to="3921" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Multispider: Towards benchmarking multilingual text-to-sql semantic parsing</title>
		<author>
			<persName><forename type="first">L</forename><surname>Dou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Pan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Che</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zhan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-G</forename><surname>Lou</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2212.13492.arXiv:2212.13492" />
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Can LLM already serve as a database interface? a BIg bench for large-scale database grounded text-to-SQLs</title>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Hui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Qu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Geng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Huo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<ptr target="https://openreview.net/forum?id=dI4wzAE6uV" />
	</analytic>
	<monogr>
		<title level="m">Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Hui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Qu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Geng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Huo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">C C</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.03111</idno>
		<title level="m">Can llm already serve as a database interface? a big bench for large-scale database grounded textto-sqls</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<author>
			<persName><forename type="first">I</forename><surname>Magar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Schwartz</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2203.08242</idno>
		<title level="m">Data contamination: From memorization to exploitation</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Prompting llms in italian language for text-to-sql translation</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">R D V C G A F R R F M Z</forename><surname>Federico Ranaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Elena</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Sofia</forename><surname>Ruzzetti</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of CLIC 2023</title>
				<meeting>CLIC 2023<address><addrLine>Location</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">The dark side of the language: Pretrained transformers in the DarkNet</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ranaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Nourbakhsh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">S</forename><surname>Ruzzetti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Patrizi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Onorati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mastromattei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Fallucchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">M</forename><surname>Zanzotto</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2023.ranlp-1.102" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing</title>
				<editor>
			<persName><forename type="first">R</forename><surname>Mitkov</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">G</forename><surname>Angelova</surname></persName>
		</editor>
		<meeting>the 14th International Conference on Recent Advances in Natural Language Processing<address><addrLine>Shoumen, Bulgaria, Varna, Bulgaria</addrLine></address></meeting>
		<imprint>
			<publisher>INCOMA Ltd</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="949" to="960" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Pre-Cog: Exploring the relation between memorization and performance in pre-trained language models</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ranaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">S</forename><surname>Ruzzetti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">M</forename><surname>Zanzotto</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2023.ranlp-1.103" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing</title>
				<editor>
			<persName><forename type="first">R</forename><surname>Mitkov</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">G</forename><surname>Angelova</surname></persName>
		</editor>
		<meeting>the 14th International Conference on Recent Advances in Natural Language Processing<address><addrLine>Shoumen, Bulgaria, Varna, Bulgaria</addrLine></address></meeting>
		<imprint>
			<publisher>IN-COMA Ltd</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="961" to="967" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Investigating the impact of data contamination of large language models in text-to-SQL translation</title>
		<author>
			<persName><forename type="first">F</forename><surname>Ranaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">S</forename><surname>Ruzzetti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Onorati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Ranaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Giannone</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Favalli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Romagnoli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">M</forename><surname>Zanzotto</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2024.findings-acl.827" />
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics ACL 2024, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">L.-W</forename><surname>Ku</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Martins</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">V</forename><surname>Srikumar</surname></persName>
		</editor>
		<meeting><address><addrLine>Bangkok, Thailand</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="13909" to="13920" />
		</imprint>
	</monogr>
	<note>and virtual meeting</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian</title>
		<author>
			<persName><forename type="first">G</forename><surname>Attanasio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Basile</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Borazio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Croce</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Francis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gili</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Musacchio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Nissim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Patti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rinaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Scalena</surname></persName>
		</author>
		<ptr target="CEUR-WS.org" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<meeting>the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)<address><addrLine>Pisa, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024-12-06">December 4 -December 6, 2024. 2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Does the English matter? elicit cross-lingual abilities of large language models</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ranaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Pucci</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2023.mrl-1.14</idno>
		<ptr target="https://aclanthology.org/2023.mrl-1.14.doi:10.18653/v1/2023.mrl-1.14" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL), Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">D</forename><surname>Ataman</surname></persName>
		</editor>
		<meeting>the 3rd Workshop on Multi-lingual Representation Learning (MRL), Association for Computational Linguistics<address><addrLine>Singapore</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="173" to="183" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">A tree-of-thoughts to broaden multi-step reasoning across languages</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ranaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Pucci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Ranaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">S</forename><surname>Ruzzetti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">M</forename><surname>Zanzotto</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2024.findings-naacl.78</idno>
		<ptr target="https://aclanthology.org/2024.findings-naacl.78.doi:10.18653/v1/2024.findings-naacl.78" />
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics: NAACL 2024, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">K</forename><surname>Duh</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Gomez</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Bethard</surname></persName>
		</editor>
		<meeting><address><addrLine>Mexico City, Mexico</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="1229" to="1241" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Empowering crosslingual abilities of instruction-tuned large language models by translation-following demonstrations</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ranaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Pucci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Freitas</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2024.findings-acl.473" />
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics ACL 2024, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">L.-W</forename><surname>Ku</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Martins</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">V</forename><surname>Srikumar</surname></persName>
		</editor>
		<meeting><address><addrLine>Bangkok, Thailand</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="7961" to="7973" />
		</imprint>
	</monogr>
	<note>and virtual meeting</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">database grounded textto-sqls</title>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Hui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Qu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Geng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Huo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">C C</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2305.03111.arXiv:2305.03111" />
	</analytic>
	<monogr>
		<title level="m">Can llm already serve as a database interface? a big bench for large-scale</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Qian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhou</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2308.15363</idno>
		<title level="m">Text-to-sql empowered by large language models: A benchmark evaluation</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Qian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhou</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2308.15363.arXiv:2308.15363" />
		<title level="m">Text-to-sql empowered by large language models: A benchmark evaluation</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<title level="m" type="main">When large language models contradict humans? large language models&apos; sycophantic behaviour</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ranaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Pucci</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2311.09410.arXiv:2311.09410" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Unveiling the spectrum of data contamination in language model: A survey from detection to remediation</title>
		<author>
			<persName><forename type="first">C</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Heng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Cohan</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2024.findings-acl.951" />
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics ACL 2024, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">L.-W</forename><surname>Ku</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Martins</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">V</forename><surname>Srikumar</surname></persName>
		</editor>
		<meeting><address><addrLine>Bangkok, Thailand</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="16078" to="16092" />
		</imprint>
	</monogr>
	<note>and virtual meeting</note>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title level="m" type="main">How much are large language models contaminated? a comprehensive survey and the llmsanitize library</title>
		<author>
			<persName><forename type="first">M</forename><surname>Ravaut</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Jiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Xiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Joty</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2404.00699.arXiv:2404.00699" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<title level="m" type="main">Gpt&apos;s family</title>
		<author>
			<persName><surname>Openai</surname></persName>
		</author>
		<ptr target="https://platform.openai.com/docs/models" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Golchin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Surdeanu</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2308.08493.arXiv:2308.08493" />
		<title level="m">Time travel in llms: Tracing data contamination in large language models</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<title level="m" type="main">Investigating data contamination in modern benchmarks for large language models</title>
		<author>
			<persName><forename type="first">C</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gerstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Cohan</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2311.09783.arXiv:2311.09783" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<ptr target="https://hitz-zentroa.github.io/lm-contamination/" />
		<title level="m">Contaminated datasets index</title>
				<imprint>
			<date type="published" when="2023">2023. 2024-09-23</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Semantic evaluation for text-to-SQL with distilled test suites</title>
		<author>
			<persName><forename type="first">R</forename><surname>Zhong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Klein</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2020.emnlp-main.29</idno>
		<ptr target="https://aclanthology.org/2020.emnlp-main.29.doi:10.18653/v1/2020.emnlp-main.29" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">B</forename><surname>Webber</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Cohn</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>He</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</editor>
		<meeting>the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="396" to="411" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Aligning large and small language models via chain-of-thought reasoning</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ranaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Freitas</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2024.eacl-long.109" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics</title>
		<title level="s">Long Papers</title>
		<editor>
			<persName><forename type="first">Y</forename><surname>Graham</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Purver</surname></persName>
		</editor>
		<meeting>the 18th Conference of the European Chapter of the Association for Computational Linguistics<address><addrLine>St. Julian&apos;s, Malta</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="1812" to="1827" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Modeling easiness for training transformers with curriculum learning</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ranaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Pucci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">M</forename><surname>Zanzotto</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2023.ranlp-1.101" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing</title>
				<editor>
			<persName><forename type="first">R</forename><surname>Mitkov</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">G</forename><surname>Angelova</surname></persName>
		</editor>
		<meeting>the 14th International Conference on Recent Advances in Natural Language Processing<address><addrLine>Shoumen, Bulgaria, Varna, Bulgaria</addrLine></address></meeting>
		<imprint>
			<publisher>INCOMA Ltd</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="937" to="948" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<title level="m" type="main">Archer: A human-labeled text-to-sql dataset with arithmetic, commonsense and hypothetical reasoning</title>
		<author>
			<persName><forename type="first">D</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lapata</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">Z</forename><surname>Pan</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2402.12554" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
