<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Termite Italian Text-to-SQL: A CALAMITA Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Federico Ranaldi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elena Sofia Ruzzetti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dario Onorati</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Massimo Zanzotto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leonardo Ranaldi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Human-Centric ART, University of Rome Tor Vergata</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Informatics, University of Edinburgh</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Rome La Sapienza</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Relational databases play an important role in business, science, and beyond. However, the operability of relational databases is restricted to users familiar with specific languages such as SQL, which limits the analytical power that they could deliver. Although earlier techniques have been proposed to automatically generate SQL from natural language, such as Text-to-SQL large-scale datasets, they are predominantly built-in English and are automatically constructed using surface web data. This phenomenon limits evaluation and use in settings beyond English and also limits fair assessment, given the origin of the datasets, as the data may have already been seen in pre-training corpora. In this work, we introduce Termite, which is a definitely unseen resource for evaluating Text-to-SQL in Italian. Specifically, we transfer evaluation pipelines beyond English, proposing novel, definitely unseen resources that avoid data-contamination phenomena while assessing the ability of models to perform Text-to-SQL tasks when natural language queries are written in Italian. We establish an evaluation grid based on execution accuracy. Our code and datasets are available at link.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Text-to-SQL</kwd>
        <kwd>Italian LLMs</kwd>
        <kwd>CALAMITA</kwd>
        <kwd>CLiC-it</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        language. In fact, in contrast to native English
benchmark translation methods, Termite is designed to be
The Text-to-SQL is an important NLP task, which used as an assessment pipeline, ensuring that it remains
maps input questions to meaningful and executable SQL a resource not exposed to search engines as it is locked
queries, enabling users to interact with databases in a by an encryption key distributed with the dataset,
reducmore intuitive and user-friendly way. Despite the sub- ing accidentally inclusion in a new commercial or search
stantial number of state-of-the-art systems [1, 2, 3] and LLMs training set.
benchmarks [
        <xref ref-type="bibr" rid="ref29">4, 5, 6</xref>
        ] for Text-to-SQL, most of them are Termite is structurally designed to resemble Spider.
in English and this limits the operability to non-English However, it complements Spider’s extensions into other
users. languages by proposing a series of databases originally
      </p>
      <p>
        Dou et al. [5] proposed extensions beyond English hand-crafted in Italian. Specifically, part of the Termite
Spider [
        <xref ref-type="bibr" rid="ref29">4</xref>
        ]. This still highlights significant limitations content comes from a thorough reworking of databases
because the resources in specific languages were gen- initially designed by students from the University of
erated from automatic translations for a few languages. Rome Tor Vergata. This aspect, enriched by the
invisibilOn the other hand, publicly released resources could be ity to search engines, makes Termite a valuable resource
translated and adapted to the Text-to-SQL task, but these for evaluating models on a practical and theoretically
could be the panacea of contamination as they are often significant task.
publicly available (e.g., Kaggle or Wikipedia as in the Moreover, evaluating Text-to-SQL models in languages
case of [
        <xref ref-type="bibr" rid="ref29">4, 7</xref>
        ]). Indeed, portions of these resources are beyond English is essential for broadening their
practiincluded in the huge corpora employed to conduct the cal use and understanding of their linguistic behavior.
pre-training phases of large language models (LLM), i.e., Assessing how these models handle the same problem
the data-contamination phenomenon [
        <xref ref-type="bibr" rid="ref1">8, 9, 10, 11, 12</xref>
        ]. presented in diferent languages is critical for gaining
      </p>
      <p>To tackle these problems, in the context of CALAMTIA insights into their adaptability and consistency across
[13] we propose Termite (Text-to-SQL Repository Made multilingual contexts [9, 14, 15, 16].
Invisible to Engines), a novel Text-to-SQL resource
created and conceived for the Italian. We aim to reduce the
possibility of increased performance due to data contam- 2. Background
ination while proposing a suitable resource for a specific</p>
      <sec id="sec-1-1">
        <title>In this section, we provide a formal problem definition of</title>
        <p>Text-to-SQL (§2.1), addressing typical aspects that define
it beyond a natural language understanding or code
genof data contamination on this task and how our Termite
serves as a measure against it, outlining several
considerations that mitigate contamination risks (§2.2). Finally,
in §2.3 we introduce the challenges that leverage our
contribution through the Termite resource.
that were already seen during the pre-training phase, we
would face an issue of data contamination.</p>
        <sec id="sec-1-1-1">
          <title>2.2. Data Contamination in Modern</title>
        </sec>
        <sec id="sec-1-1-2">
          <title>Benchmarks</title>
          <p>2.1. The Task Data contamination is an increasingly recognized
challenge in the field of machine learning, with a growing
Text-to-SQL is a fundamental task within Natural Lan- number of studies dedicated to its investigation.
Sevguage Processing (NLP) that involves not only under- eral recent studies such as [21] and [22] have explored
standing natural language queries and generating cor- the issue of data contamination, proposing a
compreresponding SQL code, but also establishing a mapping hensive taxonomy of methods to detect and address it.
between data expressed in natural language and data Due to its nature, the text-to-SQL task is susceptible to
represented within the database schema. This requires overestimation issues, particularly related to data
conthe model to accurately link natural language terms with tamination. Therefore, a good practice when evaluating
database structures such as tables, columns, and values, a model on this task is to ensure that there is no overlap
making it a more complex challenge than simple code between the test data and the pre-training data. On the
generation or natural language understanding. other hand, this becomes challenging when dealing with</p>
          <p>
            This task is crucial in making relational database inter- closed-source models, where there is no clear knowledge
actions more accessible to users who may not be familiar of the pre-training data, such as in the case of the GPT
with SQL syntax. The foundational work was based on family [23].
rule-based and heuristic approaches [1], (et. alia). The Hence, taking inspiration from Golchin and Surdeanu
actual automatic processing of Text-to-SQL pipelines be- [24] and Deng et al. [25] who treated the issue of Data
came meaningful with the advent of neural network- Contamination in closed-source models, Ranaldi et al.
based approaches. The shift towards neural models was [12] proposed a novel method for detecting Data
Contamfacilitated by the introduction of resources such as Spider ination applied to text-to-SQL. This consists in carefully
[
            <xref ref-type="bibr" rid="ref29">4</xref>
            ] and the more recent [17], which delivered various and comparing the model’s performance on a novel test set
complex natural language to SQL demonstrations. (such as Termite) with that on a well-known test set
          </p>
          <p>The most recent advancements in Text-to-SQL involve (such as Spider), whose content is suspected to have been
the use of Large Language Models (LLMs), which have exposed to the model’s pre-training data. The results
demonstrated remarkable capabilities in handling various showed that GPT models exhibit a drop in performance
tasks without needing specific pretraining or fine-tuning on Termite compared to Spider. Furthermore, it was
tailored to each task. observed that even perturbing Spider by removing
infor</p>
          <p>
            Gao et al. [
            <xref ref-type="bibr" rid="ref21">18</xref>
            ] and Pourreza and Rafiei [3] shown that mation from the dump provided with the prompt had no
GPTs are efective Text-to-SQL coders on Spider, widely significant impact on performance. The study of
contamacknowledged as an efective benchmark for assessing inating test sets continues to expand into other tasks, to
performance in this specific task the extent that an index of contaminated datasets [26]
          </p>
          <p>On the same dataset, approaches that deconstruct the has been established.
problem in smaller ones via in-context learning are even
actTuhaellyemexearmgeinnecde [o3f].LLMs as a key paradigm for the 2.3. Termite
Text-to-SQL task has also led to a more in-depth study of Our contribution complements [12] in particular by
invarious prompt engineering methods. These eforts aim troducing Termite. We aim to provide an Italian
text-toto understand what best enhances a model’s performance SQL dataset and a tool for analysing the contamination of
in text-to-SQL translation. In [19], the performance of the Spider data for LLMs. Indeed, the structural complexity
GPT family is evaluated across diferent prompt scenarios, of Termite mirrors that of the Spider test set. Moreover,
which vary based on how much information about the to prevent data contamination from compromising its
database is provided to the model for the translation usefulness, it is freely accessible, but its content is not
process. Results show that providing a specific set of provided in a fully transparent form.
additional information significantly improves the model’s In the following sections, we describe the composition
ability to generate accurate SQL queries [19]. of Termite in detail and provide a basic evaluation to</p>
          <p>This last aspect enlights how LLMs appear to be be- facilitate usability and reproducibility. In addition, to
haviourally influenced by both the in-context prompt encourage usability, we share the resources and code.
[20] and the text used during the pre-training [11].
Consequently, if LLMs perform better on tasks with data</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Dataset</title>
      <p>freely available datasets are easily accessed and tracked
by engines, they are at risk of being contaminated in the
Our main intent is to provide an evaluation resource near future if they are not already contaminated.
for Text-to-SQL on data that is definitely unknown and, To address these challenges, we propose Termite2.
therefore, not present in well-known pre-training cor- Termite aims to be a permanently fresh dataset. Termite
pora. However, since several robust evaluation pipelines will be invisible to search engines since it is locked under
exist in state of the art, the first step is understanding their an encryption key delivered along the resource. This trick
structure and operation. Therefore, beyond the de-facto will reduce the accidental inclusion in a novel training
standards resources (§3.1), we introduce our Termite set for commercial or research GPTs.
conceived as a novel unseen Italian resource (§3.2). Hence, by following characteristics of Spider, Termite
contains hand-crafted databases in diferent domains.
3.1. Spider: Characteristics and Content Each database has a balanced set of NL-SQL query pairs:
we defined an average of 5 queries per hardness-level.</p>
      <p>
        Among the best-known Text-to-SQL resources is Spider The entire dataset was designed to be comparable to
[
        <xref ref-type="bibr" rid="ref29">4</xref>
        ]. This resource is the de-facto standard for training the Spider Validation Set, not only in terms of database
and testing systems on the Text-to-SQL task. characteristics such as size and table count (Table 1) but
      </p>
      <p>Spider appears as a collection of databases and asso- also in terms of query dificulty, which was measured
ciated sets of pairs of natural language (NL) questions using the same definition provided by Spider. Moreover,
and the corresponding SQL translations. Databases are as in Spider, during the construction of Termite, we
structurally represented inside the dataset in the form took care to write unambiguous, direct NL questions that
of SQL dumps, which include the CREATE TABLE opera- can be solved by a model relying only on its linguistic
tions and a limited number of INSERT DATA operations proficiency and an analysis of the schema, with no
exfor each table. ternal knowledge needed. The style adopted in the NL</p>
      <p>
        NL questions are organized into four dificulty levels: questions is plain and colloquial in line with the style
EASY, MEDIUM, HARD, and EXTRA-HARD. For the defini- of Spider’s NL questions. Spider and Termite are also
tion of the hardness level, we refer to the categoriza- comparable in terms of number of tables and columns
tion originally made in Spider [
        <xref ref-type="bibr" rid="ref29">4</xref>
        ]. The dificulty of an in each dataset. We curated the column names to make
NL question is assessed by considering the correspond- them similar to the ones in Spider, using a similar
pering SQL query. Hence, the dificulty is correlated with centage of abbreviations and compound names (see Table
the number and kind of operations that the gold query 1). This equivalence will be crucial to limit the influence
contains: the presence of JOIN operations, aggregation, of the dataset itself on the following evaluations and will
and WHERE conditions contribute to the hardness of the be further explored in Section 4.2.
query. EASY queries do not involve more than one table. However, there is a significant and fundamental
difMEDIUM and HARD queries span multiple tables: MEDIUM ference between the two datasets, as the Termite is not
queries contain only a JOIN or aggregation operation openly available on the web or easily retrievable nor built
whereas HARD queries are more complex both in terms of on pre-existing openly available resources.
number of JOIN and aggregations. Finally, EXTRA-HARD This aspect is crucial because the way it is made
availqueries may contain nested queries, and other operators able certainly reduces the risk of falling into the LM
like UNION and INTERSECT 1. contamination index ([26]).
      </p>
      <sec id="sec-2-1">
        <title>3.2. Termite: a Text-to-SQL Repository</title>
      </sec>
      <sec id="sec-2-2">
        <title>Made Invisible to Engines</title>
        <sec id="sec-2-2-1">
          <title>The driving idea for proposing a novel resource for the</title>
          <p>Text-to-SQL task is to reduce the possibility of boosting
performance due to data contamination. Indeed, publicly
available datasets are not suitable for this purpose. Even
though novel datasets are made available, they are built
from publicly open-access resources such as Kaggle or
Wikipedia (this is the case for recently developed datasets
like BIRD [7] or Spider itself). Hence, these do not
guarantee that they are as new as required. The same issue
may also be faced for hidden test sets. Moreover, since</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>1More details are available on the oficial Spider repository</title>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>3.3. Comparing Hardness of Termite vs.</title>
      </sec>
      <sec id="sec-2-4">
        <title>Spider</title>
        <sec id="sec-2-4-1">
          <title>When introducing a new dataset for benchmarking a</title>
          <p>particular task, it is important to ensure it aligns with
the established and commonly used datasets within the
community to maintain consistency and comparability.</p>
          <p>Our Termite is designed to resemble Spider in terms
of measurable aspects, like the number of columns and
tables per database, as well as the lexicon used in the
schema definition. However, it remains dificult to
quantify via some simple statistics how hard it is to understand
2The repository is available here under GPL-3.0 license. To access,
use the password "youshallnotpass".
define Execution Accuracy as the evaluation metric of
choice for evaluating the model, as it ofers a practical
method for determining the correctness of SQL query
generation within this framework.</p>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>4.1. Prompting LLMs in Italian for Text-to-SQL Translation</title>
        <p>#DB
avg #TABLES per DB
avg #COLUMNS per TABLE
#QUERY
avg #QUERY per DB
avg #FK/#COLUMNS per DB
avg #Compound/#COLUMNS per
DB
avg #Abbr/#COLUMNS per DB</p>
        <p>Given instructions in natural language, LLMs can
translate the request into code (i.e., SQL queries) to answer
the given request. Specifically, models for generating
text have undergone training to process both natural
language and code. As a result of the inputs they receive,
these models produce text-based outputs. For this reason,
it is possible to frame the Text-to-SQL as a translation
task: given a dump for a database and a query in
natural language, the model is asked to translate the latter
in the corresponding SQL query, referring to tables and
columns into the considered database. The desiderata is
an executable query, semantically equivalent to a gold
human-generated query. In the next paragraphs, we first
describe how GPT-3.5 (gpt-3.5-turbo) is prompted in
order to obtain the translations .
how to translate a natural language question into an SQL
statement.</p>
        <p>To compare hardness of Termite and Spider, we
adopted a human-centered definition: if humans can
translate questions into an SQL queries on both Spider
and Termite with the same level of challenge, then it
means that their hardness, at least for a SQL-proficient
human annotator, is the same.</p>
        <p>Therefore, ten annotators were asked to judge the
equivalence in terms of hardness of the SQL translations Text-to-SQL as a Translation Task OpenAI API’s
that compose Spider and Termite by examining a ran- enable to interrogate a model in a multi-turn
conversadom sample of queries of both datasets. tion format: chat models receive a series of messages as</p>
        <p>To measure the hardness of the two datasets, we de- input and generate a message as output. We test the
abilsigned a simple test. Given a Entity-Relationship schema ity of GPT-3.5 on the Text-to-SQL task by framing each
of a database and a question in natural language, each translation from natural language to SQL as a separate
annotator is asked to choose among three options the conversation.
correct translation in SQL of the question. Appendix ?? The proposed approach, aimed at analysing the
presents details on the construction of the test. model’s in-context learning abilities in zero-shot
scenar</p>
        <p>On both Spider and Termite, taking as join annotation ios, is very similar to "Code Representation" [19] and has
the answer chosen by the majority of annotators leads been specifically tested in Italian [9].
to almost perfect classification ( 0.975 accuracy on Spi- In particular, the first message of a target database
der and maximum accuracy on Termite). The average gives the model the dump of the database. In each dump,
accuracy per annotator is 0.91(± 0.05) on Spider and information about the database’s tables is provided by
0.94(± 0.07) on Termite. Moreover, Fleiss’s Kappa co- the CREATE TABLE statements. In the CREATE
instruceficients are rather high ( 0.79 and 0.85 respectively) for tions, the constraints of the primary and foreign keys are
both Spider and Termite. Hence, we can conclude that also encoded. In addition, some realistic data to fill the
humans do not find one dataset more dificult than the tables are provided by INSERT instructions. Given the
other. The two datasets can then be considered equiva- dump, the model answers by producing an interpretation
lent in terms of the hardness of translations. of the dump. Typically, this model response contains an
explanation of the dump’s contents. For example,
consid4. Methods ering the database bowling in Termite dataset, the first
messages in the conversation are the following:</p>
        <sec id="sec-2-5-1">
          <title>Current evaluation pipelines exploit the behaviour of models by defining robust prompting strategies since the generations delivered by these are strongly correlated to the in-context structures [19].</title>
          <p>Thus, in §4.1, we introduce the technique for the
Textto-SQL task as the suggested evaluation metric for an
initial exploration of Termite. Furthermore, in §4.2, we
user: Considera il seguente database:
CREATE TABLE "pista" [...]; CREATE TABLE
"giocatori" [...];
GPT-3.5: Questo database rappresenta una
struttura per la gestione di un centro di
bowling...</p>
          <p>Then, given the dump and the model’s interpretation
of it, a message containing the natural language question
to be translated is sent. In particular, the selected prompt
ensures that the model translates natural language
questions into SQL queries with a limited amount of text that
is not SQL. These steps are repeated for each question
separately to obtain translations independently.
However, to ensure that the model’s understanding of each
database is comparable across all questions, the database
dump and the same interpretation initially produced by
the model are sent as context, in the form of preceding
messages, before each translation is requested. Hence,
building from the previous example, a conversation to
translate a question on the bowling database would be
completed by the following messages:
user: Traduci in SQL la seguente query.
Rispondi usando esclusivamente linguaggio
SQL. Conta il numero di giocatori per
partita.</p>
          <p>GPT-3.5: SELECT ora_inizio,tenuta_il,id_pista,
COUNT(*) FROM ’partita’ GROUP BY
ora_inizio,tenuta_il,id_pista;</p>
        </sec>
      </sec>
      <sec id="sec-2-6">
        <title>4.2. Measuring Hardness of queries in</title>
      </sec>
      <sec id="sec-2-7">
        <title>Spider and Termite</title>
        <p>The complete test is composed of 20 randomly selected
queries from each dataset, Hence, the resulting 40
questions are shared to 10 SQL-proficient annotators: 60% of
them are Computer Science Master students, the
remaining are already graduated. Five annotators work in a field
that requires daily use of the SQL query language. Finally,
we divided the test into two trials of 20 queries each. We
administered it to the annotators at two diferent times
to limit errors due to gradual loss of concentration.</p>
        <p>Our approach is completely zero-shot to minimize
the efect that the prompt itself–rather than data
contamination–can have on performance. Once the
translation process is completed, the SQL code produced by
the model is retrieved to evaluate whether or not the
generated query satisfies the natural language query.</p>
        <p>
          Execution Accuracy: the Evaluation Metric The
evaluation metric adopted is execution accuracy
introduced by Yu et al. [
          <xref ref-type="bibr" rid="ref29">4</xref>
          ], which assesses the correctness
of the generated SQL query by executing it against the
database and comparing the result with the expected
output.
        </p>
        <p>The Execution Accuracy (EA) can be formally defined
as follows:</p>
        <p>Let  represent the gold query and  represent the
generated query. The execution accuracy compares the
execution results of  and  on a database .</p>
        <p>We need to ensure that Spider and Termite are
hardness comparable. Termite is designed with a similar (, , ) = {︃1 if () = ()
annotation protocol; however, a similarity in terms of the 0 if () ̸= ()
hardness of the natural language questions used is hard where () and () represent the outputs of the
to quantify. For this reason, we asked 10 SQL-proficient queries on . Execution accuracy is 1 if the results are
annotators to perform a simple yet efective test to mea- the same and 0 otherwise.
sure how dificult it is for them to translate questions In case of syntactic errors in the generated SQL query,
both from Spider and from Termite. The main idea is it is considered definitively incorrect, as adherence to
that if they can translate both Spider and Termite ques- SQL grammar is part of the model’s evaluation.
tions with the same accuracy level, then the challenge The execution accuracy metric is prone to false
posilevel is similar on both datasets. tives, as two diferent queries can return the same output</p>
        <p>
          In particular, given an E-R database schema and a nat- under specific database record configurations. For this
ural language utterance, each test question asks the an- reason, in [12], the Test Suite Accuracy metric is adopted.
notator to choose from three SQL query options that Test Suite Accuracy, introduced in Zhong et al. [27],
essatisfy the request. All three options are syntactically sentially involves performing execution accuracy on the
correct SQL queries, but the incorrect answers are se- same query across many randomly generated database
mantically diferent from the correct ones. The authors record configurations called Test Suite.
designed the first incorrect option, perturbing the correct In this paper, we propose EA as an evaluation metric
answer by removing or replacing some operations or re- because the way queries and database records are
detrieved columns and changing the field and table names signed in Termite aims to minimize the occurrence of
with non-matching ones. The second incorrect answer false positives. Additionally, to encourage
experimentais another query extracted from the same dataset as the tion with Termite, we recommend initially employing
correct one. The selected query is the most similar under simple and computationally inexpensive evaluation
metthe Bag of Words assumption concerning the correct one. rics, in contrast to Test Suite Accuracy. Moreover, we
To retrieve this third option, the similarity of two queries suggest disregarding the query dificulty evaluation
metis measured via the cosine similarity of their BOW vector ric proposed by [
          <xref ref-type="bibr" rid="ref29">4</xref>
          ].
representations.
        </p>
        <p>Hence, in link is available, an automated script eval- exceeding 50%, is only seen for the "farma" and "galleria"
uates generated SQL queries using Execution Accuracy databases, where 69% and 62% accuracy were achieved,
as the metric. It can be run locally as it is a lightweight respectively.
program that executes queries on an SQL server and
processes the output as our metric requires.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6. Limitations &amp; Future Works</title>
      <p>5. Experiments The idea of Termite is to propose a new resource
conceived and realized for the Italian language. During the
Our Termite aims to extend the Text-to-SQL evaluation discussion of the contribution, we introduced the
unpipeline to Italian while preserving data integrity and derlying motivations that support our choices regarding
thus preventing possible contamination. To prove its encryption and baseline evaluations.
operability, we propose a baseline assessment in §5.1 and However, we plan to extend our contribution to
landiscuss the obtained results in §5.2. guages beyond Italian in future developments. We also
aim to propose eficient alignment techniques to enable
5.1. Experimental Setup smaller models to cope with more demanding tasks such
as text-to-SQL by adopting teacher-student alignment
We systematically evaluated GPT-3.5 (gpt-3.5-turbo-16k) techniques [28, 29].
performance on the Termite dataset for the Text-to-SQL
task. We employed the API to generate SQL translations
for each query in the dataset. To ensure consistency in the 7. Conclusions
results, we set the temperature parameter to 1, allowing
for greater flexibility and diversity in the model’s output.</p>
      <p>For each natural language query, a translation request
was sent to the model. The generated SQL query was
then saved and subsequently processed according to the
aforementioned metric (§4.2).</p>
      <p>Database Name</p>
      <p>EA_SCORE (%)</p>
      <p>Queries</p>
      <sec id="sec-3-1">
        <title>We have introduced Termite, a resource that, to the best</title>
        <p>of our knowledge, is unique in that the databases and
queries were natively conceived in Italian. Its structural
alignment with well-known datasets like Spider makes
it a solid benchmarking tool for analysing Text-to-SQL
results when the test set languages difer.</p>
        <p>Additionally, its uniqueness lies in the fact that it is
not publicly accessible by search engines, making it less
exposed to the increasingly prominent issue of data
contamination, particularly when dealing with closed-source
large language models.</p>
        <p>Extending Termite to include queries where the
complexity is not only driven by the SQL query itself but also
by tasks such as commonsense and arithmetic reasoning
would further enrich the dataset. This is in line with
approaches like those seen in Archer [30], which address
these additional challenges.</p>
        <p>Acknowledgments
bowling
centri
coronavirus
farma
farmacia
galleria
hackathon
pratica
recensioni
voli
50.79
56.25
40.00
62.50
50.00
69.15
46.25
50.11
20.00
56.25
24
19
20
20
20
23
19
22
18
17</p>
      </sec>
      <sec id="sec-3-2">
        <title>We would like to express our gratitude to the Human</title>
        <p>
          Table 2 Centric Art team for their valuable collaboration in the
Execution Accuracy (EA_SCORE (%)) achieved by GPT-3.5 creation of the Termite dataset. Special thanks go to
and Number of Queries for each Database
the annotators whose work was essential in afirming
the comparability between Termite and Spider. Finally
we extend our appreciation to the Computer Science’s
5.2. Baseline Results students of the University of Rome Tor Vergata for
providing the original hand-crafted databases, which were
The results achieved in the baseline assessment reveal subsequently the subject of extensive reworking and
rethe intrinsic challenges of the text-to-SQL task perfor- finement.
mance. In fact, Table 2 reports the Execution Accuracy
percentages (EA_SCORE (%)) achieved by GPT-3.5 on
each of the 10 datasets that compose our Termite. It can
be observed that an acceptable accuracy, significantly
ation for Computational Linguistics: NAACL 2024, marks for large language models, 2024. URL: https:
Association for Computational Linguistics, Mex- //arxiv.org/abs/2311.09783. arXiv:2311.09783.
ico City, Mexico, 2024, pp. 1229–1241. URL: https: [26] Contaminated datasets index, https://hitz-zentroa.
//aclanthology.org/2024.findings-naacl.78. doi: 10. github.io/lm-contamination/, 2023. Accessed:
202418653/v1/2024.findings-naacl.78. 09-23.
[16] L. Ranaldi, G. Pucci, A. Freitas, Empowering cross- [27] R. Zhong, T. Yu, D. Klein, Semantic evaluation for
lingual abilities of instruction-tuned large language text-to-SQL with distilled test suites, in: B. Webber,
models by translation-following demonstrations, in: T. Cohn, Y. He, Y. Liu (Eds.), Proceedings of the 2020
L.-W. Ku, A. Martins, V. Srikumar (Eds.), Findings Conference on Empirical Methods in Natural
Lanof the Association for Computational Linguistics guage Processing (EMNLP), Association for
ComACL 2024, Association for Computational Linguis- putational Linguistics, Online, 2020, pp. 396–411.
tics, Bangkok, Thailand and virtual meeting, 2024, URL: https://aclanthology.org/2020.emnlp-main.29.
pp. 7961–7973. URL: https://aclanthology.org/2024. doi:10.18653/v1/2020.emnlp-main.29.
ifndings-acl.473. [28] L. Ranaldi, A. Freitas, Aligning large and small
[17] J. Li, B. Hui, G. Qu, J. Yang, B. Li, B. Li, B. Wang, language models via chain-of-thought reasoning,
B. Qin, R. Cao, R. Geng, N. Huo, X. Zhou, C. Ma, in: Y. Graham, M. Purver (Eds.), Proceedings of the
G. Li, K. C. C. Chang, F. Huang, R. Cheng, Y. Li, 18th Conference of the European Chapter of the
Can llm already serve as a database interface? a Association for Computational Linguistics (Volume
big bench for large-scale database grounded text- 1: Long Papers), Association for Computational
to-sqls, 2023. URL: https://arxiv.org/abs/2305.03111. Linguistics, St. Julian’s, Malta, 2024, pp. 1812–1827.
arXiv:2305.03111. URL: https://aclanthology.org/2024.eacl-long.109.
[
          <xref ref-type="bibr" rid="ref21">18</xref>
          ] D. Gao, H. Wang, Y. Li, X. Sun, Y. Qian, B. Ding, [29] L. Ranaldi, G. Pucci, F. M. Zanzotto, Modeling
easJ. Zhou, Text-to-sql empowered by large lan- iness for training transformers with curriculum
guage models: A benchmark evaluation, 2023. learning, in: R. Mitkov, G. Angelova (Eds.),
ProarXiv:2308.15363. ceedings of the 14th International Conference on
[19] D. Gao, H. Wang, Y. Li, X. Sun, Y. Qian, B. Ding, Recent Advances in Natural Language Processing,
J. Zhou, Text-to-sql empowered by large language INCOMA Ltd., Shoumen, Bulgaria, Varna, Bulgaria,
models: A benchmark evaluation, 2023. URL: https: 2023, pp. 937–948. URL: https://aclanthology.org/
//arxiv.org/abs/2308.15363. arXiv:2308.15363. 2023.ranlp-1.101.
[20] L. Ranaldi, G. Pucci, When large language models [30] D. Zheng, M. Lapata, J. Z. Pan, Archer: A
contradict humans? large language models’ syco- human-labeled text-to-sql dataset with
arithphantic behaviour, 2024. URL: https://arxiv.org/abs/ metic, commonsense and hypothetical
reason2311.09410. arXiv:2311.09410. ing, 2024. URL: https://arxiv.org/abs/2402.12554.
[21] C. Deng, Y. Zhao, Y. Heng, Y. Li, J. Cao, X. Tang, arXiv:2402.12554.
        </p>
        <p>A. Cohan, Unveiling the spectrum of data
contamination in language model: A survey from
detection to remediation, in: L.-W. Ku, A. Martins,
V. Srikumar (Eds.), Findings of the Association for
Computational Linguistics ACL 2024, Association
for Computational Linguistics, Bangkok, Thailand
and virtual meeting, 2024, pp. 16078–16092. URL:
https://aclanthology.org/2024.findings-acl.951.
[22] M. Ravaut, B. Ding, F. Jiao, H. Chen, X. Li, R. Zhao,</p>
        <p>C. Qin, C. Xiong, S. Joty, How much are large
language models contaminated? a comprehensive
survey and the llmsanitize library, 2024. URL: https:
//arxiv.org/abs/2404.00699. arXiv:2404.00699.
[23] OpenAI, Gpt’s family, 2023. URL: https://platform.</p>
        <p>openai.com/docs/models.
[24] S. Golchin, M. Surdeanu, Time travel in llms:
Tracing data contamination in large language
models, 2024. URL: https://arxiv.org/abs/2308.08493.</p>
        <p>arXiv:2308.08493.
[25] C. Deng, Y. Zhao, X. Tang, M. Gerstein, A. Cohan,</p>
        <p>Investigating data contamination in modern
bench</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [8]
          <string-name>
            <surname>I. Magar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          , Data contamination:
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          From memorization to exploitation,
          <year>2022</year>
          . [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Giordani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moschitti</surname>
          </string-name>
          , Translating questions arXiv:
          <volume>2203</volume>
          .
          <fpage>08242</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>to SQL queries with generative parsers discrimina-</article-title>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>R. D. V. C. G. A. F. R. R. F. M. Z. Federico</surname>
          </string-name>
          <string-name>
            <surname>Ranaldi</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>ings of COLING</source>
          <year>2012</year>
          :
          <article-title>Posters, The COLING 2012 guage for text-to-sql translation</article-title>
          , in: Proceedings
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Organizing</given-names>
            <surname>Committee</surname>
          </string-name>
          , Mumbai, India,
          <year>2012</year>
          , pp.
          <source>of CLIC</source>
          <year>2023</year>
          , Location,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          401-
          <fpage>410</fpage>
          . URL: https://aclanthology.org/C12-2040. [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ranaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nourbakhsh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Ruzzetti</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Pa[2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Scholak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          , PI- trizi, D. Onorati,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mastromattei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fallucchi</surname>
          </string-name>
          ,
          <string-name>
            <surname>F. M.</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          els, in: M.
          <article-title>-</article-title>
          <string-name>
            <surname>F. Moens</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Specia</surname>
          </string-name>
          , G. Angelova (Eds.),
          <source>Proceedings of the 14th Inter-</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>S. W</surname>
          </string-name>
          .-t. Yih (Eds.),
          <source>Proceedings of the 2021 Con- national Conference on Recent Advances in Natu-</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>guage Processing</surname>
          </string-name>
          , Association for Computational Bulgaria, Varna, Bulgaria,
          <year>2023</year>
          , pp.
          <fpage>949</fpage>
          -
          <lpage>960</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Linguistics</surname>
            , Online and
            <given-names>Punta</given-names>
          </string-name>
          <string-name>
            <surname>Cana</surname>
          </string-name>
          , Domini- https://aclanthology.org/
          <year>2023</year>
          .ranlp-
          <volume>1</volume>
          .
          <fpage>102</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>can Republic</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>9895</fpage>
          -
          <lpage>9901</lpage>
          . URL: https: [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ranaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Ruzzetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Zanzotto</surname>
          </string-name>
          , Pre-
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          //aclanthology.org/
          <year>2021</year>
          .emnlp-main.
          <volume>779</volume>
          . doi:10.
          <string-name>
            <surname>Cog</surname>
          </string-name>
          <article-title>: Exploring the relation between memoriza-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <volume>18653</volume>
          /v1/
          <year>2021</year>
          .
          <article-title>emnlp-main.779. tion and performance in pre-trained language mod</article-title>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pourreza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rafiei</surname>
          </string-name>
          , DIN-SQL:
          <article-title>Decomposed in- els</article-title>
          , in: R. Mitkov, G. Angelova (Eds.), Proceed-
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <article-title>context learning of text-to-SQL with self-correction, ings of the</article-title>
          14th International Conference on Re-
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>mation Processing Systems</source>
          ,
          <year>2023</year>
          . URL: https:// COMA Ltd.,
          <string-name>
            <surname>Shoumen</surname>
          </string-name>
          , Bulgaria, Varna, Bulgaria,
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          openreview.net/forum?id=p53QDxSIc5.
          <year>2023</year>
          , pp.
          <fpage>961</fpage>
          -
          <lpage>967</lpage>
          . URL: https://aclanthology.org/ [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yasunaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <year>2023</year>
          .ranlp-
          <volume>1</volume>
          .
          <fpage>103</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , [12]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ranaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Ruzzetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Onorati</surname>
          </string-name>
          , L. Ranaldi,
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>ings of the 2018 Conference on Empirical Meth- (Eds.)</source>
          ,
          <article-title>Findings of the Association for Computa-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>ods in Natural Language Processing, Association tional Linguistics ACL</source>
          <year>2024</year>
          , Association for Com-
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <year>2018</year>
          , pp.
          <fpage>3911</fpage>
          -
          <lpage>3921</lpage>
          . URL: https://aclanthology.org/ tual meeting,
          <year>2024</year>
          , pp.
          <fpage>13909</fpage>
          -
          <lpage>13920</lpage>
          . URL: https:
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <fpage>D18</fpage>
          -
          <lpage>1425</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D18</fpage>
          -1425. //aclanthology.org/
          <year>2024</year>
          .findings-acl.
          <volume>827</volume>
          . [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Dou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          , W. Che, [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Borazio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Croce</surname>
          </string-name>
          , M. Fran-
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>ing</surname>
          </string-name>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2212.13492.
          <article-title>ties of LAnguage Models in ITAlian</article-title>
          , in: Proceed-
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <source>arXiv:2212.13492. ings of the 10th Italian Conference on Computa</source>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hui</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. QU</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          , tional
          <string-name>
            <surname>Linguistics</surname>
          </string-name>
          (CLiC-it
          <year>2024</year>
          ), Pisa, Italy, Decem-
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>B.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Huo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , C. Ma, G. Li, ber 4 - December 6,
          <year>2024</year>
          , CEUR Workshop Proceed-
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>K.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          , R. Cheng, Y. Li, Can LLM al
          <article-title>- ings, CEUR-WS</article-title>
          .org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <article-title>ready serve as a database interface? a BIg bench</article-title>
          [14]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ranaldi</surname>
          </string-name>
          , G. Pucci, Does the English matter?
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <source>in: Thirty-seventh Conference on Neural Informa- els</source>
          , in: D.
          <string-name>
            <surname>Ataman</surname>
          </string-name>
          (Ed.),
          <source>Proceedings of the 3rd</source>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Track</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: https://openreview.net/forum? ing (MRL), Association for Computational Linguis-
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <source>id=dI4wzAE6uV. tics, Singapore</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>173</fpage>
          -
          <lpage>183</lpage>
          . URL: https: [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hui</surname>
          </string-name>
          , G. Qu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          , //aclanthology.org/
          <year>2023</year>
          .mrl-
          <volume>1</volume>
          .14. doi:
          <volume>10</volume>
          .18653/
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <given-names>B.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Huo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , C. Ma, v1/
          <year>2023</year>
          .mrl-
          <volume>1</volume>
          .
          <fpage>14</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. C. C.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          , R. Cheng, Y. Li, [15]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ranaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ranaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Ruzzetti</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          to-sqls,
          <year>2023</year>
          . arXiv:
          <volume>2305</volume>
          .03111. H.
          <string-name>
            <surname>Gomez</surname>
          </string-name>
          , S. Bethard (Eds.), Findings of the Associ-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>