=Paper= {{Paper |id=Vol-3878/130_calamita_long |storemode=property |title=Termite Italian Text-to-SQL: A CALAMITA Challenge |pdfUrl=https://ceur-ws.org/Vol-3878/130_calamita_long.pdf |volume=Vol-3878 |authors=Federico Ranaldi,Elena Sofia Ruzzetti,Dario Onorati,Fabio Massimo Zanzotto,Leonardo Ranaldi |dblpUrl=https://dblp.org/rec/conf/clic-it/RanaldiROZR24 }} ==Termite Italian Text-to-SQL: A CALAMITA Challenge== https://ceur-ws.org/Vol-3878/130_calamita_long.pdf
                                Termite Italian Text-to-SQL: A CALAMITA Challenge
                                Federico Ranaldi1,*,† , Elena Sofia Ruzzetti1 , Dario Onorati3 , Fabio Massimo Zanzotto1 and
                                Leonardo Ranaldi1,2
                                1
                                  Human-Centric ART, University of Rome Tor Vergata, Italy.
                                2
                                  School of Informatics, University of Edinburgh, UK.
                                3
                                  University of Rome La Sapienza, Italy.


                                               Abstract
                                               Relational databases play an important role in business, science, and beyond. However, the operability of relational databases
                                               is restricted to users familiar with specific languages such as SQL, which limits the analytical power that they could deliver.
                                               Although earlier techniques have been proposed to automatically generate SQL from natural language, such as Text-to-SQL
                                               large-scale datasets, they are predominantly built-in English and are automatically constructed using surface web data. This
                                               phenomenon limits evaluation and use in settings beyond English and also limits fair assessment, given the origin of the
                                               datasets, as the data may have already been seen in pre-training corpora.
                                                    In this work, we introduce Termite, which is a definitely unseen resource for evaluating Text-to-SQL in Italian. Specifically,
                                               we transfer evaluation pipelines beyond English, proposing novel, definitely unseen resources that avoid data-contamination
                                               phenomena while assessing the ability of models to perform Text-to-SQL tasks when natural language queries are written in
                                               Italian. We establish an evaluation grid based on execution accuracy. Our code and datasets are available at link.

                                               Keywords
                                               Text-to-SQL, Italian LLMs, CALAMITA, CLiC-it



                                1. Introduction                                                                 language. In fact, in contrast to native English bench-
                                                                                                                mark translation methods, Termite is designed to be
                                The Text-to-SQL is an important NLP task, which                                 used as an assessment pipeline, ensuring that it remains
                                maps input questions to meaningful and executable SQL                           a resource not exposed to search engines as it is locked
                                queries, enabling users to interact with databases in a                         by an encryption key distributed with the dataset, reduc-
                                more intuitive and user-friendly way. Despite the sub-                          ing accidentally inclusion in a new commercial or search
                                stantial number of state-of-the-art systems [1, 2, 3] and                       LLMs training set.
                                benchmarks [4, 5, 6] for Text-to-SQL, most of them are                             Termite is structurally designed to resemble Spider.
                                in English and this limits the operability to non-English                       However, it complements Spider’s extensions into other
                                users.                                                                          languages by proposing a series of databases originally
                                   Dou et al. [5] proposed extensions beyond English                            hand-crafted in Italian. Specifically, part of the Termite
                                Spider [4]. This still highlights significant limitations                       content comes from a thorough reworking of databases
                                because the resources in specific languages were gen-                           initially designed by students from the University of
                                erated from automatic translations for a few languages.                         Rome Tor Vergata. This aspect, enriched by the invisibil-
                                On the other hand, publicly released resources could be                         ity to search engines, makes Termite a valuable resource
                                translated and adapted to the Text-to-SQL task, but these                       for evaluating models on a practical and theoretically
                                could be the panacea of contamination as they are often                         significant task.
                                publicly available (e.g., Kaggle or Wikipedia as in the                            Moreover, evaluating Text-to-SQL models in languages
                                case of [4, 7]). Indeed, portions of these resources are                        beyond English is essential for broadening their practi-
                                included in the huge corpora employed to conduct the                            cal use and understanding of their linguistic behavior.
                                pre-training phases of large language models (LLM), i.e.,                       Assessing how these models handle the same problem
                                the data-contamination phenomenon [8, 9, 10, 11, 12].                           presented in different languages is critical for gaining
                                   To tackle these problems, in the context of CALAMTIA                         insights into their adaptability and consistency across
                                [13] we propose Termite (Text-to-SQL Repository Made                            multilingual contexts [9, 14, 15, 16].
                                Invisible to Engines), a novel Text-to-SQL resource cre-
                                ated and conceived for the Italian. We aim to reduce the
                                possibility of increased performance due to data contam-                        2. Background
                                ination while proposing a suitable resource for a specific
                                                                                                                                      In this section, we provide a formal problem definition of
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,                                  Text-to-SQL (§2.1), addressing typical aspects that define
                                Dec 04 — 06, 2024, Pisa, Italy
                                                                                                                                      it beyond a natural language understanding or code gen-
                                $ federico.ranaldi99@gmail.com (F. Ranaldi)
                                         © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License eration problem. Then, we discuss the potential impact
                                         Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
of data contamination on this task and how our Termite that were already seen during the pre-training phase, we
serves as a measure against it, outlining several consid- would face an issue of data contamination.
erations that mitigate contamination risks (§2.2). Finally,
in §2.3 we introduce the challenges that leverage our 2.2. Data Contamination in Modern
contribution through the Termite resource.
                                                                     Benchmarks
2.1. The Task                                                  Data contamination is an increasingly recognized chal-
                                                               lenge in the field of machine learning, with a growing
Text-to-SQL is a fundamental task within Natural Lan-          number of studies dedicated to its investigation. Sev-
guage Processing (NLP) that involves not only under-           eral recent studies such as [21] and [22] have explored
standing natural language queries and generating cor-          the issue of data contamination, proposing a compre-
responding SQL code, but also establishing a mapping           hensive taxonomy of methods to detect and address it.
between data expressed in natural language and data            Due to its nature, the text-to-SQL task is susceptible to
represented within the database schema. This requires          overestimation issues, particularly related to data con-
the model to accurately link natural language terms with       tamination. Therefore, a good practice when evaluating
database structures such as tables, columns, and values,       a model on this task is to ensure that there is no overlap
making it a more complex challenge than simple code            between the test data and the pre-training data. On the
generation or natural language understanding.                  other hand, this becomes challenging when dealing with
   This task is crucial in making relational database inter-   closed-source models, where there is no clear knowledge
actions more accessible to users who may not be familiar       of the pre-training data, such as in the case of the GPT
with SQL syntax. The foundational work was based on            family [23].
rule-based and heuristic approaches [1], (et. alia). The          Hence, taking inspiration from Golchin and Surdeanu
actual automatic processing of Text-to-SQL pipelines be-       [24] and Deng et al. [25] who treated the issue of Data
came meaningful with the advent of neural network-             Contamination in closed-source models, Ranaldi et al.
based approaches. The shift towards neural models was          [12] proposed a novel method for detecting Data Contam-
facilitated by the introduction of resources such as Spider    ination applied to text-to-SQL. This consists in carefully
[4] and the more recent [17], which delivered various and      comparing the model’s performance on a novel test set
complex natural language to SQL demonstrations.                (such as Termite) with that on a well-known test set
   The most recent advancements in Text-to-SQL involve         (such as Spider), whose content is suspected to have been
the use of Large Language Models (LLMs), which have            exposed to the model’s pre-training data. The results
demonstrated remarkable capabilities in handling various       showed that GPT models exhibit a drop in performance
tasks without needing specific pretraining or fine-tuning      on Termite compared to Spider. Furthermore, it was
tailored to each task.                                         observed that even perturbing Spider by removing infor-
   Gao et al. [18] and Pourreza and Rafiei [3] shown that      mation from the dump provided with the prompt had no
GPTs are effective Text-to-SQL coders on Spider, widely        significant impact on performance. The study of contam-
acknowledged as an effective benchmark for assessing           inating test sets continues to expand into other tasks, to
performance in this specific task                              the extent that an index of contaminated datasets [26]
   On the same dataset, approaches that deconstruct the        has been established.
problem in smaller ones via in-context learning are even
actually examined [3].
   The emergence of LLMs as a key paradigm for the
                                                               2.3. Termite
Text-to-SQL task has also led to a more in-depth study of      Our contribution complements [12] in particular by in-
various prompt engineering methods. These efforts aim          troducing Termite. We aim to provide an Italian text-to-
to understand what best enhances a model’s performance         SQL dataset and a tool for analysing the contamination of
in text-to-SQL translation. In [19], the performance of the    Spider data for LLMs. Indeed, the structural complexity
GPT family is evaluated across different prompt scenarios,     of Termite mirrors that of the Spider test set. Moreover,
which vary based on how much information about the             to prevent data contamination from compromising its
database is provided to the model for the translation          usefulness, it is freely accessible, but its content is not
process. Results show that providing a specific set of         provided in a fully transparent form.
additional information significantly improves the model’s         In the following sections, we describe the composition
ability to generate accurate SQL queries [19].                 of Termite in detail and provide a basic evaluation to
   This last aspect enlights how LLMs appear to be be-         facilitate usability and reproducibility. In addition, to
haviourally influenced by both the in-context prompt           encourage usability, we share the resources and code.
[20] and the text used during the pre-training [11]. Con-
sequently, if LLMs perform better on tasks with data
3. Dataset                                                         freely available datasets are easily accessed and tracked
                                                                   by engines, they are at risk of being contaminated in the
Our main intent is to provide an evaluation resource               near future if they are not already contaminated.
for Text-to-SQL on data that is definitely unknown and,               To address these challenges, we propose Termite2 .
therefore, not present in well-known pre-training cor-             Termite aims to be a permanently fresh dataset. Termite
pora. However, since several robust evaluation pipelines           will be invisible to search engines since it is locked under
exist in state of the art, the first step is understanding their   an encryption key delivered along the resource. This trick
structure and operation. Therefore, beyond the de-facto            will reduce the accidental inclusion in a novel training
standards resources (§3.1), we introduce our Termite               set for commercial or research GPTs.
conceived as a novel unseen Italian resource (§3.2).                  Hence, by following characteristics of Spider, Termite
                                                                   contains hand-crafted databases in different domains.
3.1. Spider: Characteristics and Content                           Each database has a balanced set of NL-SQL query pairs:
                                                                   we defined an average of 5 queries per hardness-level.
Among the best-known Text-to-SQL resources is Spider               The entire dataset was designed to be comparable to
[4]. This resource is the de-facto standard for training           the Spider Validation Set, not only in terms of database
and testing systems on the Text-to-SQL task.                       characteristics such as size and table count (Table 1) but
   Spider appears as a collection of databases and asso-           also in terms of query difficulty, which was measured
ciated sets of pairs of natural language (NL) questions            using the same definition provided by Spider. Moreover,
and the corresponding SQL translations. Databases are              as in Spider, during the construction of Termite, we
structurally represented inside the dataset in the form            took care to write unambiguous, direct NL questions that
of SQL dumps, which include the CREATE TABLE opera-                can be solved by a model relying only on its linguistic
tions and a limited number of INSERT DATA operations               proficiency and an analysis of the schema, with no ex-
for each table.                                                    ternal knowledge needed. The style adopted in the NL
   NL questions are organized into four difficulty levels:         questions is plain and colloquial in line with the style
EASY, MEDIUM, HARD, and EXTRA-HARD. For the defini-                of Spider’s NL questions. Spider and Termite are also
tion of the hardness level, we refer to the categoriza-            comparable in terms of number of tables and columns
tion originally made in Spider [4]. The difficulty of an           in each dataset. We curated the column names to make
NL question is assessed by considering the correspond-             them similar to the ones in Spider, using a similar per-
ing SQL query. Hence, the difficulty is correlated with            centage of abbreviations and compound names (see Table
the number and kind of operations that the gold query              1). This equivalence will be crucial to limit the influence
contains: the presence of JOIN operations, aggregation,            of the dataset itself on the following evaluations and will
and WHERE conditions contribute to the hardness of the             be further explored in Section 4.2.
query. EASY queries do not involve more than one table.               However, there is a significant and fundamental dif-
MEDIUM and HARD queries span multiple tables: MEDIUM               ference between the two datasets, as the Termite is not
queries contain only a JOIN or aggregation operation               openly available on the web or easily retrievable nor built
whereas HARD queries are more complex both in terms of             on pre-existing openly available resources.
number of JOIN and aggregations. Finally, EXTRA-HARD                  This aspect is crucial because the way it is made avail-
queries may contain nested queries, and other operators            able certainly reduces the risk of falling into the LM
like UNION and INTERSECT 1 .                                       contamination index ([26]).

3.2. Termite: a Text-to-SQL Repository                             3.3. Comparing Hardness of Termite vs.
     Made Invisible to Engines                                          Spider
The driving idea for proposing a novel resource for the            When introducing a new dataset for benchmarking a
Text-to-SQL task is to reduce the possibility of boosting          particular task, it is important to ensure it aligns with
performance due to data contamination. Indeed, publicly            the established and commonly used datasets within the
available datasets are not suitable for this purpose. Even         community to maintain consistency and comparability.
though novel datasets are made available, they are built              Our Termite is designed to resemble Spider in terms
from publicly open-access resources such as Kaggle or              of measurable aspects, like the number of columns and
Wikipedia (this is the case for recently developed datasets        tables per database, as well as the lexicon used in the
like BIRD [7] or Spider itself). Hence, these do not guar-         schema definition. However, it remains difficult to quan-
antee that they are as new as required. The same issue             tify via some simple statistics how hard it is to understand
may also be faced for hidden test sets. Moreover, since
                                                                   2
                                                                       The repository is available here under GPL-3.0 license. To access,
1
    More details are available on the official Spider repository       use the password "youshallnotpass".
                                          Dataset
                                                              define Execution Accuracy as the evaluation metric of
                                      Spider Termite
                                                              choice for evaluating the model, as it offers a practical
   #DB                                   20       10
                                                              method for determining the correctness of SQL query
   avg #TABLES per DB                   4.2      4.0
   avg #COLUMNS per TABLE               5.46     5.56         generation within this framework.
   #QUERY                              1035      202
   avg #QUERY per DB                   51.75     20.2         4.1. Prompting LLMs in Italian for
   avg #FK/#COLUMNS per DB              0.16     0.13
   avg #Compound/#COLUMNS per           0.63     0.51              Text-to-SQL Translation
   DB                                                         Given instructions in natural language, LLMs can trans-
   avg #Abbr/#COLUMNS per DB           0.10      0.12
                                                              late the request into code (i.e., SQL queries) to answer
Table 1                                                       the given request. Specifically, models for generating
Spider and fact sheet. Termite is designed to be comparable   text have undergone training to process both natural lan-
to the validation set of Spider.                              guage and code. As a result of the inputs they receive,
                                                              these models produce text-based outputs. For this reason,
                                                              it is possible to frame the Text-to-SQL as a translation
how to translate a natural language question into an SQL      task: given a dump for a database and a query in natu-
statement.                                                    ral language, the model is asked to translate the latter
   To compare hardness of Termite and Spider, we              in the corresponding SQL query, referring to tables and
adopted a human-centered definition: if humans can            columns into the considered database. The desiderata is
translate questions into an SQL queries on both Spider        an executable query, semantically equivalent to a gold
and Termite with the same level of challenge, then it         human-generated query. In the next paragraphs, we first
means that their hardness, at least for a SQL-proficient      describe how GPT-3.5 (gpt-3.5-turbo) is prompted in
human annotator, is the same.                                 order to obtain the translations .
   Therefore, ten annotators were asked to judge the
equivalence in terms of hardness of the SQL translations      Text-to-SQL as a Translation Task OpenAI API’s
that compose Spider and Termite by examining a ran-           enable to interrogate a model in a multi-turn conversa-
dom sample of queries of both datasets.                       tion format: chat models receive a series of messages as
   To measure the hardness of the two datasets, we de-        input and generate a message as output. We test the abil-
signed a simple test. Given a Entity-Relationship schema      ity of GPT-3.5 on the Text-to-SQL task by framing each
of a database and a question in natural language, each        translation from natural language to SQL as a separate
annotator is asked to choose among three options the          conversation.
correct translation in SQL of the question. Appendix ??          The proposed approach, aimed at analysing the
presents details on the construction of the test.             model’s in-context learning abilities in zero-shot scenar-
   On both Spider and Termite, taking as join annotation      ios, is very similar to "Code Representation" [19] and has
the answer chosen by the majority of annotators leads         been specifically tested in Italian [9].
to almost perfect classification (0.975 accuracy on Spi-         In particular, the first message of a target database
der and maximum accuracy on Termite). The average             gives the model the dump of the database. In each dump,
accuracy per annotator is 0.91(±0.05) on Spider and           information about the database’s tables is provided by
0.94(±0.07) on Termite. Moreover, Fleiss’s Kappa co-          the CREATE TABLE statements. In the CREATE instruc-
efficients are rather high (0.79 and 0.85 respectively) for   tions, the constraints of the primary and foreign keys are
both Spider and Termite. Hence, we can conclude that          also encoded. In addition, some realistic data to fill the
humans do not find one dataset more difficult than the        tables are provided by INSERT instructions. Given the
other. The two datasets can then be considered equiva-        dump, the model answers by producing an interpretation
lent in terms of the hardness of translations.                of the dump. Typically, this model response contains an
                                                              explanation of the dump’s contents. For example, consid-
                                                              ering the database bowling in Termite dataset, the first
4. Methods                                                    messages in the conversation are the following:
Current evaluation pipelines exploit the behaviour of
                                                                user: Considera    il  seguente  database:
models by defining robust prompting strategies since the        CREATE TABLE "pista" [...]; CREATE TABLE
generations delivered by these are strongly correlated to       "giocatori" [...];
the in-context structures [19].                                 GPT-3.5: Questo database rappresenta una
   Thus, in §4.1, we introduce the technique for the Text-      struttura per la gestione di un centro di
to-SQL task as the suggested evaluation metric for an           bowling...
initial exploration of Termite. Furthermore, in §4.2, we
   Then, given the dump and the model’s interpretation            The complete test is composed of 20 randomly selected
of it, a message containing the natural language question      queries from each dataset, Hence, the resulting 40 ques-
to be translated is sent. In particular, the selected prompt   tions are shared to 10 SQL-proficient annotators: 60% of
ensures that the model translates natural language ques-       them are Computer Science Master students, the remain-
tions into SQL queries with a limited amount of text that      ing are already graduated. Five annotators work in a field
is not SQL. These steps are repeated for each question         that requires daily use of the SQL query language. Finally,
separately to obtain translations independently. How-          we divided the test into two trials of 20 queries each. We
ever, to ensure that the model’s understanding of each         administered it to the annotators at two different times
database is comparable across all questions, the database      to limit errors due to gradual loss of concentration.
dump and the same interpretation initially produced by            Our approach is completely zero-shot to minimize
the model are sent as context, in the form of preceding        the effect that the prompt itself–rather than data
messages, before each translation is requested. Hence,         contamination–can have on performance. Once the trans-
building from the previous example, a conversation to          lation process is completed, the SQL code produced by
translate a question on the bowling database would be          the model is retrieved to evaluate whether or not the
completed by the following messages:                           generated query satisfies the natural language query.

  user: Traduci   in SQL la seguente query.
                                                             Execution Accuracy: the Evaluation Metric The
  Rispondi usando esclusivamente linguaggio
                                                             evaluation metric adopted is execution accuracy intro-
  SQL. Conta il numero di giocatori per
  partita.
                                                             duced by Yu et al. [4], which assesses the correctness
  GPT-3.5: SELECT ora_inizio,tenuta_il,id_pista,             of the generated SQL query by executing it against the
  COUNT(*)     FROM    ’partita’    GROUP   BY               database and comparing the result with the expected
  ora_inizio,tenuta_il,id_pista;                             output.
                                                                The Execution Accuracy (EA) can be formally defined
                                                             as follows:
                                                                Let 𝑞 represent the gold query and 𝑔 represent the
4.2. Measuring Hardness of queries in                        generated query. The execution accuracy compares the
       Spider and Termite                                    execution results of 𝑔 and 𝑞 on a database 𝐷.
We need to ensure that Spider and Termite are hard-                                     {︃
ness comparable. Termite is designed with a similar                                        1 if 𝑔(𝐷) = 𝑞(𝐷)
                                                                        𝐸𝐴(𝑔, 𝑞, 𝐷) =
annotation protocol; however, a similarity in terms of the                                 0 if 𝑔(𝐷) ̸= 𝑞(𝐷)
hardness of the natural language questions used is hard
                                                                where 𝑔(𝐷) and 𝑞(𝐷) represent the outputs of the
to quantify. For this reason, we asked 10 SQL-proficient
                                                             queries on 𝐷. Execution accuracy is 1 if the results are
annotators to perform a simple yet effective test to mea-
                                                             the same and 0 otherwise.
sure how difficult it is for them to translate questions
                                                                In case of syntactic errors in the generated SQL query,
both from Spider and from Termite. The main idea is
                                                             it is considered definitively incorrect, as adherence to
that if they can translate both Spider and Termite ques-
                                                             SQL grammar is part of the model’s evaluation.
tions with the same accuracy level, then the challenge
                                                                The execution accuracy metric is prone to false posi-
level is similar on both datasets.
                                                             tives, as two different queries can return the same output
   In particular, given an E-R database schema and a nat-
                                                             under specific database record configurations. For this
ural language utterance, each test question asks the an-
                                                             reason, in [12], the Test Suite Accuracy metric is adopted.
notator to choose from three SQL query options that
                                                             Test Suite Accuracy, introduced in Zhong et al. [27], es-
satisfy the request. All three options are syntactically
                                                             sentially involves performing execution accuracy on the
correct SQL queries, but the incorrect answers are se-
                                                             same query across many randomly generated database
mantically different from the correct ones. The authors
                                                             record configurations called Test Suite.
designed the first incorrect option, perturbing the correct
                                                                In this paper, we propose EA as an evaluation metric
answer by removing or replacing some operations or re-
                                                             because the way queries and database records are de-
trieved columns and changing the field and table names
                                                             signed in Termite aims to minimize the occurrence of
with non-matching ones. The second incorrect answer
                                                             false positives. Additionally, to encourage experimenta-
is another query extracted from the same dataset as the
                                                             tion with Termite, we recommend initially employing
correct one. The selected query is the most similar under
                                                             simple and computationally inexpensive evaluation met-
the Bag of Words assumption concerning the correct one.
                                                             rics, in contrast to Test Suite Accuracy. Moreover, we
To retrieve this third option, the similarity of two queries
                                                             suggest disregarding the query difficulty evaluation met-
is measured via the cosine similarity of their BOW vector
                                                             ric proposed by [4].
representations.
   Hence, in link is available, an automated script eval- exceeding 50%, is only seen for the "farma" and "galleria"
uates generated SQL queries using Execution Accuracy databases, where 69% and 62% accuracy were achieved,
as the metric. It can be run locally as it is a lightweight respectively.
program that executes queries on an SQL server and
processes the output as our metric requires.
                                                               6. Limitations & Future Works
5. Experiments                                                 The idea of Termite is to propose a new resource con-
                                                               ceived and realized for the Italian language. During the
Our Termite aims to extend the Text-to-SQL evaluation          discussion of the contribution, we introduced the un-
pipeline to Italian while preserving data integrity and        derlying motivations that support our choices regarding
thus preventing possible contamination. To prove its           encryption and baseline evaluations.
operability, we propose a baseline assessment in §5.1 and         However, we plan to extend our contribution to lan-
discuss the obtained results in §5.2.                          guages beyond Italian in future developments. We also
                                                               aim to propose efficient alignment techniques to enable
5.1. Experimental Setup                                        smaller models to cope with more demanding tasks such
                                                               as text-to-SQL by adopting teacher-student alignment
We systematically evaluated GPT-3.5 (gpt-3.5-turbo-16k)        techniques [28, 29].
performance on the Termite dataset for the Text-to-SQL
task. We employed the API to generate SQL translations
for each query in the dataset. To ensure consistency in the    7. Conclusions
results, we set the temperature parameter to 1, allowing
for greater flexibility and diversity in the model’s output.   We have introduced Termite, a resource that, to the best
For each natural language query, a translation request         of our knowledge, is unique in that the databases and
was sent to the model. The generated SQL query was             queries were natively conceived in Italian. Its structural
then saved and subsequently processed according to the         alignment with well-known datasets like Spider makes
aforementioned metric (§4.2).                                  it a solid benchmarking tool for analysing Text-to-SQL
                                                               results when the test set languages differ.
    Database Name         EA_SCORE (%)        Queries
                                                                  Additionally, its uniqueness lies in the fact that it is
                                                               not publicly accessible by search engines, making it less
    bowling                    50.79             24            exposed to the increasingly prominent issue of data con-
    centri                     56.25             19            tamination, particularly when dealing with closed-source
    coronavirus                40.00             20            large language models.
                                                                  Extending Termite to include queries where the com-
    farma                      62.50             20            plexity is not only driven by the SQL query itself but also
    farmacia                   50.00             20            by tasks such as commonsense and arithmetic reasoning
    galleria                   69.15             23            would further enrich the dataset. This is in line with
                                                               approaches like those seen in Archer [30], which address
    hackathon                  46.25             19
                                                               these additional challenges.
    pratica                    50.11             22
    recensioni                 20.00             18
                                                               Acknowledgments
    voli                       56.25             17
                                                         We would like to express our gratitude to the Human-
Table 2                                                  Centric Art team for their valuable collaboration in the
Execution Accuracy (EA_SCORE (%)) achieved by GPT-3.5
                                                         creation of the Termite dataset. Special thanks go to
and Number of Queries for each Database
                                                         the annotators whose work was essential in affirming
                                                         the comparability between Termite and Spider. Finally
                                                         we extend our appreciation to the Computer Science’s
5.2. Baseline Results                                    students of the University of Rome Tor Vergata for pro-
                                                         viding the original hand-crafted databases, which were
The results achieved in the baseline assessment reveal subsequently the subject of extensive reworking and re-
the intrinsic challenges of the text-to-SQL task perfor- finement.
mance. In fact, Table 2 reports the Execution Accuracy
percentages (EA_SCORE (%)) achieved by GPT-3.5 on
each of the 10 datasets that compose our Termite. It can
be observed that an acceptable accuracy, significantly
References                                                     [8] I. Magar, R. Schwartz, Data contamination:
                                                                   From memorization to exploitation, 2022.
[1] A. Giordani, A. Moschitti, Translating questions               arXiv:2203.08242.
    to SQL queries with generative parsers discrimina-         [9] L. R. D. V. C. G. A. F. R. R. F. M. Z. Federico Ranaldi,
    tively reranked, in: M. Kay, C. Boitet (Eds.), Proceed-        Elena Sofia Ruzzetti, Prompting llms in italian lan-
    ings of COLING 2012: Posters, The COLING 2012                  guage for text-to-sql translation, in: Proceedings
    Organizing Committee, Mumbai, India, 2012, pp.                 of CLIC 2023, Location, 2023.
    401–410. URL: https://aclanthology.org/C12-2040.          [10] L. Ranaldi, A. Nourbakhsh, E. S. Ruzzetti, A. Pa-
[2] T. Scholak, N. Schucher, D. Bahdanau,               PI-        trizi, D. Onorati, M. Mastromattei, F. Fallucchi, F. M.
    CARD: Parsing incrementally for constrained                    Zanzotto, The dark side of the language: Pre-
    auto-regressive decoding from language mod-                    trained transformers in the DarkNet, in: R. Mitkov,
    els,     in: M.-F. Moens, X. Huang, L. Specia,                 G. Angelova (Eds.), Proceedings of the 14th Inter-
    S. W.-t. Yih (Eds.), Proceedings of the 2021 Con-              national Conference on Recent Advances in Natu-
    ference on Empirical Methods in Natural Lan-                   ral Language Processing, INCOMA Ltd., Shoumen,
    guage Processing, Association for Computational                Bulgaria, Varna, Bulgaria, 2023, pp. 949–960. URL:
    Linguistics, Online and Punta Cana, Domini-                    https://aclanthology.org/2023.ranlp-1.102.
    can Republic, 2021, pp. 9895–9901. URL: https:            [11] L. Ranaldi, E. S. Ruzzetti, F. M. Zanzotto, Pre-
    //aclanthology.org/2021.emnlp-main.779. doi:10.                Cog: Exploring the relation between memoriza-
    18653/v1/2021.emnlp-main.779.                                  tion and performance in pre-trained language mod-
[3] M. Pourreza, D. Rafiei, DIN-SQL: Decomposed in-                els, in: R. Mitkov, G. Angelova (Eds.), Proceed-
    context learning of text-to-SQL with self-correction,          ings of the 14th International Conference on Re-
    in: Thirty-seventh Conference on Neural Infor-                 cent Advances in Natural Language Processing, IN-
    mation Processing Systems, 2023. URL: https://                 COMA Ltd., Shoumen, Bulgaria, Varna, Bulgaria,
    openreview.net/forum?id=p53QDxSIc5.                            2023, pp. 961–967. URL: https://aclanthology.org/
[4] T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang,                2023.ranlp-1.103.
    Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang,          [12] F. Ranaldi, E. S. Ruzzetti, D. Onorati, L. Ranaldi,
    D. Radev, Spider: A large-scale human-labeled                  C. Giannone, A. Favalli, R. Romagnoli, F. M. Zan-
    dataset for complex and cross-domain semantic                  zotto, Investigating the impact of data contam-
    parsing and text-to-SQL task, in: E. Riloff, D. Chi-           ination of large language models in text-to-SQL
    ang, J. Hockenmaier, J. Tsujii (Eds.), Proceed-                translation, in: L.-W. Ku, A. Martins, V. Srikumar
    ings of the 2018 Conference on Empirical Meth-                 (Eds.), Findings of the Association for Computa-
    ods in Natural Language Processing, Association                tional Linguistics ACL 2024, Association for Com-
    for Computational Linguistics, Brussels, Belgium,              putational Linguistics, Bangkok, Thailand and vir-
    2018, pp. 3911–3921. URL: https://aclanthology.org/            tual meeting, 2024, pp. 13909–13920. URL: https:
    D18-1425. doi:10.18653/v1/D18-1425.                            //aclanthology.org/2024.findings-acl.827.
[5] L. Dou, Y. Gao, M. Pan, D. Wang, W. Che,                  [13] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Fran-
    D. Zhan, J.-G. Lou, Multispider: Towards bench-                cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri-
    marking multilingual text-to-sql semantic pars-                naldi, D. Scalena, CALAMITA: Challenge the Abili-
    ing, 2022. URL: https://arxiv.org/abs/2212.13492.              ties of LAnguage Models in ITAlian, in: Proceed-
    arXiv:2212.13492.                                              ings of the 10th Italian Conference on Computa-
[6] J. Li, B. Hui, G. QU, J. Yang, B. Li, B. Li, B. Wang,          tional Linguistics (CLiC-it 2024), Pisa, Italy, Decem-
    B. Qin, R. Geng, N. Huo, X. Zhou, C. Ma, G. Li,                ber 4 - December 6, 2024, CEUR Workshop Proceed-
    K. Chang, F. Huang, R. Cheng, Y. Li, Can LLM al-               ings, CEUR-WS.org, 2024.
    ready serve as a database interface? a BIg bench          [14] L. Ranaldi, G. Pucci, Does the English matter?
    for large-scale database grounded text-to-SQLs,                elicit cross-lingual abilities of large language mod-
    in: Thirty-seventh Conference on Neural Informa-               els, in: D. Ataman (Ed.), Proceedings of the 3rd
    tion Processing Systems Datasets and Benchmarks                Workshop on Multi-lingual Representation Learn-
    Track, 2023. URL: https://openreview.net/forum?                ing (MRL), Association for Computational Linguis-
    id=dI4wzAE6uV.                                                 tics, Singapore, 2023, pp. 173–183. URL: https:
[7] J. Li, B. Hui, G. Qu, J. Yang, B. Li, B. Li, B. Wang,          //aclanthology.org/2023.mrl-1.14. doi:10.18653/
    B. Qin, R. Cao, R. Geng, N. Huo, X. Zhou, C. Ma,               v1/2023.mrl-1.14.
    G. Li, K. C. C. Chang, F. Huang, R. Cheng, Y. Li,         [15] L. Ranaldi, G. Pucci, F. Ranaldi, E. S. Ruzzetti,
    Can llm already serve as a database interface? a               F. M. Zanzotto, A tree-of-thoughts to broaden
    big bench for large-scale database grounded text-              multi-step reasoning across languages, in: K. Duh,
    to-sqls, 2023. arXiv:2305.03111.                               H. Gomez, S. Bethard (Eds.), Findings of the Associ-
     ation for Computational Linguistics: NAACL 2024,             marks for large language models, 2024. URL: https:
     Association for Computational Linguistics, Mex-              //arxiv.org/abs/2311.09783. arXiv:2311.09783.
     ico City, Mexico, 2024, pp. 1229–1241. URL: https:      [26] Contaminated datasets index, https://hitz-zentroa.
     //aclanthology.org/2024.findings-naacl.78. doi:10.           github.io/lm-contamination/, 2023. Accessed: 2024-
     18653/v1/2024.findings-naacl.78.                             09-23.
[16] L. Ranaldi, G. Pucci, A. Freitas, Empowering cross-     [27] R. Zhong, T. Yu, D. Klein, Semantic evaluation for
     lingual abilities of instruction-tuned large language        text-to-SQL with distilled test suites, in: B. Webber,
     models by translation-following demonstrations, in:          T. Cohn, Y. He, Y. Liu (Eds.), Proceedings of the 2020
     L.-W. Ku, A. Martins, V. Srikumar (Eds.), Findings           Conference on Empirical Methods in Natural Lan-
     of the Association for Computational Linguistics             guage Processing (EMNLP), Association for Com-
     ACL 2024, Association for Computational Linguis-             putational Linguistics, Online, 2020, pp. 396–411.
     tics, Bangkok, Thailand and virtual meeting, 2024,           URL: https://aclanthology.org/2020.emnlp-main.29.
     pp. 7961–7973. URL: https://aclanthology.org/2024.           doi:10.18653/v1/2020.emnlp-main.29.
     findings-acl.473.                                       [28] L. Ranaldi, A. Freitas, Aligning large and small
[17] J. Li, B. Hui, G. Qu, J. Yang, B. Li, B. Li, B. Wang,        language models via chain-of-thought reasoning,
     B. Qin, R. Cao, R. Geng, N. Huo, X. Zhou, C. Ma,             in: Y. Graham, M. Purver (Eds.), Proceedings of the
     G. Li, K. C. C. Chang, F. Huang, R. Cheng, Y. Li,            18th Conference of the European Chapter of the
     Can llm already serve as a database interface? a             Association for Computational Linguistics (Volume
     big bench for large-scale database grounded text-            1: Long Papers), Association for Computational
     to-sqls, 2023. URL: https://arxiv.org/abs/2305.03111.        Linguistics, St. Julian’s, Malta, 2024, pp. 1812–1827.
     arXiv:2305.03111.                                            URL: https://aclanthology.org/2024.eacl-long.109.
[18] D. Gao, H. Wang, Y. Li, X. Sun, Y. Qian, B. Ding,       [29] L. Ranaldi, G. Pucci, F. M. Zanzotto, Modeling eas-
     J. Zhou, Text-to-sql empowered by large lan-                 iness for training transformers with curriculum
     guage models: A benchmark evaluation, 2023.                  learning, in: R. Mitkov, G. Angelova (Eds.), Pro-
     arXiv:2308.15363.                                            ceedings of the 14th International Conference on
[19] D. Gao, H. Wang, Y. Li, X. Sun, Y. Qian, B. Ding,            Recent Advances in Natural Language Processing,
     J. Zhou, Text-to-sql empowered by large language             INCOMA Ltd., Shoumen, Bulgaria, Varna, Bulgaria,
     models: A benchmark evaluation, 2023. URL: https:            2023, pp. 937–948. URL: https://aclanthology.org/
     //arxiv.org/abs/2308.15363. arXiv:2308.15363.                2023.ranlp-1.101.
[20] L. Ranaldi, G. Pucci, When large language models        [30] D. Zheng, M. Lapata, J. Z. Pan, Archer: A
     contradict humans? large language models’ syco-              human-labeled text-to-sql dataset with arith-
     phantic behaviour, 2024. URL: https://arxiv.org/abs/         metic, commonsense and hypothetical reason-
     2311.09410. arXiv:2311.09410.                                ing, 2024. URL: https://arxiv.org/abs/2402.12554.
[21] C. Deng, Y. Zhao, Y. Heng, Y. Li, J. Cao, X. Tang,           arXiv:2402.12554.
     A. Cohan, Unveiling the spectrum of data con-
     tamination in language model: A survey from de-
     tection to remediation, in: L.-W. Ku, A. Martins,
     V. Srikumar (Eds.), Findings of the Association for
     Computational Linguistics ACL 2024, Association
     for Computational Linguistics, Bangkok, Thailand
     and virtual meeting, 2024, pp. 16078–16092. URL:
     https://aclanthology.org/2024.findings-acl.951.
[22] M. Ravaut, B. Ding, F. Jiao, H. Chen, X. Li, R. Zhao,
     C. Qin, C. Xiong, S. Joty, How much are large lan-
     guage models contaminated? a comprehensive sur-
     vey and the llmsanitize library, 2024. URL: https:
     //arxiv.org/abs/2404.00699. arXiv:2404.00699.
[23] OpenAI, Gpt’s family, 2023. URL: https://platform.
     openai.com/docs/models.
[24] S. Golchin, M. Surdeanu, Time travel in llms: Trac-
     ing data contamination in large language mod-
     els, 2024. URL: https://arxiv.org/abs/2308.08493.
     arXiv:2308.08493.
[25] C. Deng, Y. Zhao, X. Tang, M. Gerstein, A. Cohan,
     Investigating data contamination in modern bench-