<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>SEBD</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>the Promise and Pitfalls</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Sala</string-name>
          <email>luca.sala@unimore.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Sullutrone</string-name>
          <email>giovanni.sullutrone@unimore.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sonia Bergamaschi</string-name>
          <email>sonia.bergamaschi@unimore.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Large Language Models, Text-to-SQL, Relational Databases, SQL</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Modena and Reggio Emilia</institution>
          ,
          <addr-line>UNIMORE</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>32</volume>
      <fpage>23</fpage>
      <lpage>26</lpage>
      <abstract>
        <p>The emergence of Large Language Models (LLMs) represents a fundamental change in the ever-evolving ifeld of natural language processing (NLP). Over the past few years, the enhanced capabilities of these models have led to their widespread use across various fields, in both practical applications and research contexts. In particular, as data science intersects with LLMs, new research opportunities and insights emerge, notably in translating text into Structured Query Language (Text-to-SQL). The application of this technology to such task poses a unique set of opportunities and related issues that have significant implications for information retrieval. This discussion paper delves into these intricacies and limitations, focusing on challenges that jeopardise eficacy and reliability. This research investigates the scalability, accuracy, and concerning issue of hallucinated responses, questioning the trustworthiness of LLMs. Furthermore, we point out the limits of the current usage of test dataset created for research purposes in capturing real-world complexities. Finally, we consider the performance of Text-to-SQL with LLMs from diferent perspectives. Our investigation identifies the key challenges faced by LLMs and proposes viable solutions to facilitate the exploitation of these models to advance data retrieval, bridging the gap between academic researcher and real-world application scenarios.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        In recent years, natural language processing (NLP) has been fundamentally changed by the rise
from Transformers) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and GPT (Generative Pretrained Transformer) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], trained on massive
corpora of written data, have shown impressive capabilities in grasping semantic relationships
and solving complex tasks.
      </p>
      <p>This has made them powerful tools for human-computer interaction that need an extensive
semantic and domain knowledge in order to be up-to-par with the requirements of real-world
applications. In particular, their ability to interpret natural language requests and translate
them into executable SQL statements has the potential to revolutionize database querying,
from bridging the gap between complex database systems and end-users to making data-driven
insights more accessible to a broader audience.
(S. Bergamaschi)</p>
      <p>
        Furthermore, the impressive adaptability and learning capabilities of LLMs ofer promises of
continuous improvement in query understanding and processing. As these models are exposed
to more domain-specific data, their efectiveness in handling queries in various specialized
ifelds, from healthcare to financial services, is expected to improve [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This not only enhances
the accuracy of query conversion, but also opens up possibilities for personalized database
interactions, where the model adjusts to the user’s language and query patterns.
      </p>
      <p>This paper explores the current limitations of Text-to-SQL systems powered by LLMs. It
focuses on potential pitfalls and readily applicable solutions to improve performance for
realworld use cases. The structure is as follows: Section 2 provides background on Text-to-SQL
with LLMs; Section 3 addresses challenges, limitations, and solutions; followed by conclusions
and future perspectives.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Background</title>
      <sec id="sec-3-1">
        <title>2.1. Text-to-SQL</title>
        <p>
          The inherent complexity of the Text-to-SQL task comes from the fundamental diferences
between natural language and SQL. Natural language is characterized by ambiguity, flexibility,
and implicit context, whereas SQL adheres to a strict, formal syntax and requires explicit
representation of relationships within a database schema. Early approaches relied heavily on
handcrafted rules and grammars [
          <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
          ], leading to systems that were dificult to generalize to new
domains. With the rise of machine learning, new techniques started to take shape, employing
elements like sequence-to-sequence models to learn the mapping between natural language
and SQL, showing improved robustness compared to previous ones [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>
          Since LLMs are neural network-based models pre-trained on massive text corpora, enabling
them to capture rich linguistic patterns and world knowledge, their advent has further
revolutionized the Text-to-SQL field. Key to their success is the Transformer architecture, which
excels at processing sequential data and modeling long-range dependencies [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>
          Their pre-training process exposes them to diverse language usage and domain knowledge
[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] that can be readily made available to convert natural language into queries. Furthermore,
LLMs can efectively model the logical structure of SQL, handling complex elements like nested
structures and aggregations [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Notably, these models display potential for zero-shot or
fewshot learning in Text-to-SQL, suggesting they can generate SQL queries for new database
schemas with minimal or even no additional fine-tuning, thus increasing their adaptability [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>
          The integration of LLMs with Text-to-SQL is currently a thriving area of research. Benchmarks
like WikiSQL [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], Spider [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], and BIRD [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] play a crucial role in driving progress and providing
standard evaluation metrics. These datasets consist of paired natural language questions and
corresponding SQL translations across various domains.
        </p>
        <p>
          Diverse strategies have been explored to harness the power of this technology. Among
them [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] used an incremental pre-training procedure and fine-tuning on task-specific labeled
data. Additionally, interest has been placed in In-Context learning (ICL) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], where LLMs
are prompted with natural language instructions, examples, and carefully engineered input
sequences to generate the SQL output [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Finally, researchers are exploring hybrid approaches
that combine the strengths of LLMs with decoding constraints or intermediate representations
to enhance the structure and controllability of the generated SQL queries [
          <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. The need of Text-to-SQL</title>
        <p>
          Relational databases, characterized by their eficient, structured, and reliable data management
capabilities, have been instrumental in supporting transactional data storage and critical business
operations for decades. In 2022, the market value of relational databases was an impressive
USD 55.9 billion, with forecasts predicting a growth to USD 161.4 billion by 2032, showcasing
a compound annual growth rate (CAGR) of 12.50% [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. This substantial growth underscores
the continuing reliance on relational databases in the digital age and highlights the increasing
amount of data being processed and stored.
        </p>
        <p>However, accessing and analyzing this vast reservoir of data poses a challenge, particularly
for non-experts. The traditional method of interacting with databases through structured query
languages such as SQL requires a deep understanding of database schemas and precise command
syntax. Through NL querying, users can communicate with databases in plain text, bypassing
the need to master complex query languages.</p>
        <p>Integrating Text-to-SQL capabilities into data management systems can therefore significantly
accelerate the data exploration process, enabling faster decision-making and insight discovery.
It allows users to ask iterative questions, refine their queries based on previous results, and
explore data relationships and patterns without the bottleneck of formulating precise SQL
queries.</p>
        <p>In summary, the need for Text-to-SQL technologies is driven by the growing complexity and
volume of data stored in relational databases and the necessity to make this data accessible
to a wider audience. As such, investing in and developing these technologies is crucial for
organizations aiming to stay competitive in the data-driven landscape of the 21st century.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Addressing Challenges and Limitations</title>
      <sec id="sec-4-1">
        <title>3.1. Response Time and Performance</title>
        <p>In the realm of database interaction, response time, the time elapsed before receiving a query
result, plays a vital role in ensuring smooth operation and a seamless user experience. The
introduction of LLMs for query generation shifts our perspective on these metrics, placing
emphasis on their inference1 speed as they act as an additional translation layer between the
user’s requests and the extracted data. Understanding response time from this perspective
requires a nuanced look at factors like Time to First Token (TTFT), which indicates the model’s
initial responsiveness, and Time Per Output Token (TPOT), which determines how eficiently it
generates subsequent parts of the query. Together, TTFT and TPOT give us latency, a measure
of the total time needed to produce a complete response or, in our case, the converted SQL query.
Throughput, on the other hand, quantifies the server’s ability to produce output tokens across
multiple requests. While these metrics ofer valuable insights, it’s important to acknowledge
1Inference refers to the process of getting a response from the trained LLM model for the user’s query or prompts.
that the hardware used to deploy any LLM has the biggest impact on these factors, making
them highly susceptible to the specific context of application.</p>
        <p>A significant gap in current research is the lack of direct comparisons between the time
it takes a language model to create a query versus the time it takes a human to do the same
task. This obscurity hinders our understanding of the potential advantages this technology
ofer. User expectations have been shaped by the immediate feedback search engines provide; it
follows naturally that benchmarks should also account for the desire for fast responses.</p>
        <p>
          Evaluating Text-to-SQL performance extends beyond the raw capabilities of the LLM; the
methods used for assessment play a decisive role. Benchmarks like Spider [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] ofer insuficient
analysis of how models compare to human performance in this task. The BIRD benchmark [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]
partially addresses this shortcoming by incorporating human ratings but omits crucial elements
such as the number of attempts and time required for humans to write valid SQL queries.
Incorporating these measurements would enable a more in-depth comparison between model
and human eficiency.
        </p>
        <p>
          As database complexity grows, the interplay between response time and performance becomes
even more critical. Maintaining responsiveness without compromising reliability demands
advanced techniques. Ironically, methods designed to improve LLM accuracy can sometimes
worsen response time. The Chain of Thought (CoT) approach [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], for example, helps tackle
complex queries by breaking them into sub-problems, while techniques like Least-to-Most [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]
and Self-Consistency [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] involve repeated questioning to gain clarity and improve precision.
Although beneficial for complex queries, this subdivision into steps introduces variability into
both the computational resources needed and the overall time taken to generate a response.
This presents a challenge in ensuring predictability and eficiency.
        </p>
        <p>
          One possible workaround is to use specialized inference engine like the Language
Processing Unit (LPU) introduced by Groq [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] that shows 3-18x improvements in Output Tokens
Throughput compared to traditional providers. Furthermore, it guarantees consistent
Time-toFirst-Token reducing drastically the variability of responses.
        </p>
        <p>Balancing the benefits of advanced LLM techniques with the need for predictable and eficient
database interactions remains a critical area for ongoing research and development in the field
of NLP and database management.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Scalability</title>
        <p>
          The rapid expansion of available data and the increasing complexity of databases present
significant challenges for applying LLMs to the task of Text-to-SQL. Current models struggle
with large databases and real-world datasets that often contain inconsistencies or ’noisy’ values
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Additionally, the inherent complexity of databases, combined with the limited context
window which determines how much information they can hold in memory, can lead to
significant compression of the prompt, hindering their understanding of the underlying data
structure.
        </p>
        <p>Current methodologies, in fact, base the pre-trained model’s grounding on two main elements:
schema linking and example value sampling.</p>
        <p>
          Schema linking identifies references to database elements (tables, columns, etc.) within
the natural language query to be added to the prompt [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. As databases scale, queries may
reference a broader range of tables, making schema linking more dificult and forcing a stricter
selection, impacting overall performance [
          <xref ref-type="bibr" rid="ref20 ref9">9, 20</xref>
          ].
        </p>
        <p>
          Value sampling aims to provide the LLM with representative examples from the linked tables
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. However, with larger tables, these samples may not adequately reflect the full distribution
of data, potentially misleading the LLM.
        </p>
        <p>Fortunately, the ongoing evolution of these models suggests that scalability issues may be
addressed intrinsically as these models improve.</p>
        <p>
          Starting from early models like GPT-3.5-turbo which had a context window of 4096 tokens
and GPT-4 with 8192 tokens, significant progress has been made in GPT-3.5-turbo-16k-0613
and GPT-4-32k-0613 with their limits increased to 16384 and 32768 tokens, respectively. Two
of today’s most advanced models, Claude 3 [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] and Gemini 1.5 Pro [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], ofer even more
impressive context windows, up to 200,000 tokens for the former and up to 1 million tokens for
the latter.
        </p>
        <p>
          A potential drawback for long context models, however, is the performance drop in specific
positions of their memory which could result in a loss of task-essential information. It has
been observed that performance is often highest when relevant information is located at the
beginning or end of the input context, while it degrades significantly otherwise [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ].
        </p>
        <p>
          However, the most recent models claim to have mitigated the problem. Gemini 1.5 Pro
achieves near-perfect (&gt;99%) recall up to multiple millions of tokens of in all modalities, i.e., text,
video, and audio, and even maintaining this recall performance when extending to 10M tokens
in the all three modalities [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. Additionally Claude 3 Opus not only achieved near-perfect recall,
surpassing 99% accuracy, but in some cases, it even identified the limitations of the evaluation
itself by recognizing that the ”needle” sentence used to test the information retrieval capability
appeared to be artificially inserted into the original text by a human [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Hallucinations</title>
        <p>The term ”hallucinations”, in the context of LLMs, refers to instances where the model generates
inaccurate or misleading information. This phenomenon can arise due to various factors, such
as the inherent complexities of natural language, biases within the training data, and limitations
of the model itself. Hallucinations represent a challenge in the field of Text-to-SQL, where
accuracy and precision in relation to the underlying database and its schema are paramount.</p>
        <p>Within these systems, hallucinations manifest when the LLM fabricates incorrect assumptions
about the database structure or invents non-existent tables, columns, or data values. These
hallucinations pose a serious threat to the model’s performance and reliability, as they can lead
to SQL queries that are either invalid or generate incorrect results.</p>
        <p>
          Researchers have observed that hallucinations involving the creation of fictional table data are
a particularly prevalent issue in large-scale databases [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Even when schema linking techniques
are employed to align the generated query with the structure of the target database, these
problems persist.
        </p>
        <p>
          Mitigating hallucinations is an active area of research that has seen various interesting
proposals. Recent solutions include techniques like response selectors that use beam search 2
2A decoding strategy that, instead of selecting only the single most likely word at each step, keeps track of multiple
likely sequences
to choose executable SQL queries to use as final answer [
          <xref ref-type="bibr" rid="ref14 ref24">24, 14</xref>
          ]. Another technique is to use
an output calibration step that encompasses, among others, a fuzzy search to find the closest
matching columns to potentially resolve invalid ones [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. A new avenue of research, however,
is the use of Uncertainty Quantification (UQ) to assess the confidence of an LLM’s generated
output as UQ methods can assign confidence scores to diferent parts of the model’s output.
[
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] shows empirically that UQ techniques allow relatively inexpensive fact-checking. This
could have a twofold application: to highlight possible hallucinated terms in the converted
query to be changed by the user or as additional information for a self-correction procedure of
the model itself.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>3.4. Dataset Representativity</title>
        <p>In the realm of NLP and database query creation, various datasets and benchmarks have been
developed in order to fill the gap between human language and structured database queries.</p>
        <p>
          Among these, there are ATIS [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] and GEO [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] datasets which contain less than 500 unique
SQL queries. On the other hand, WikiSQL [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] includes a larger number of queries and
significantly larger tables, but it only covers basic queries. Spider [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] aims to address the limitations
of WikiSQL by incorporating more complex, multi-table queries and a broader diversity of
SQL queries, thus improving the ability of models to understand and generate intricate SQL
commands from natural language inputs. Following Spider, BIRD [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] aims to further advance
this domain by focusing on being more realistic in collecting data from real-world scenarios,
while retaining all the complexity and variability of such data in the dataset.
        </p>
        <p>However, BIRD is not without its limitations. Firstly, it exhibits bias in the generation of
NL questions, primarily due to the presumed background knowledge of the user regarding
the database’s structure and terminology. This assumption can lead to a gap between ad-hoc
and real-world query formulations, as a typical user may not recall specific details about the
database or might use incorrect terms.</p>
        <p>Including non-experts in the creation of NL questions or limiting their schema knowledge are
two potential ways to mitigate these biases. This approach may guarantee a closer representation
of generated queries to that of a larger user base.</p>
        <p>
          Secondly, in BIRD, tables or fields not accessible due to user privileges or absence, are
not explored, raising concerns about its practicality in real-world scenarios. One way to
better capture real-world facets is by intentionally including non-implementable queries. This
intentional introduction of real-world imperfections would enable more robust testing. To
this end, we suggest introducing one promising strategy proposed in [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]. Applied in the
Text-to-SQL field, this would entail fine-tuning a model on a dataset where such queries are
intentionally tagged with an ”I don’t know” response. This approach encourages models to
recognize the limits of their ability and avoid the tendency to ”hallucinate” solutions that violate
database constraints or permissions. The key insight is that a model capable of acknowledging
its limitations is likely to be far more valuable in a practical setting than one that produces
incorrect or misleading results.
        </p>
        <p>Furthermore, existing Text-to-SQL datasets and benchmarks often underutilize the vast
knowledge and contextual understanding capabilities of LLMs. While they excel at incorporating
domain knowledge, datasets currently lack queries designed to test these abilities. Considering
that non-expect user may naturally create questions incorporating cultural references (e.g. ”list
movies released in the year of the dragon”) or requiring the translation of colloquial terms into
precise expressions (e.g. ”show me sales figures for the summer months”). This gap represents
a significant missed opportunity.</p>
      </sec>
      <sec id="sec-4-5">
        <title>3.5. Knowledge Acquisition Methods</title>
        <p>
          For accurate Text-to-SQL conversion in professional settings, models must incorporate
fieldspecific linguistic, domain, and mathematical knowledge [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]. The first enables the model to
deal with terminology that may be diferent between question and underlying schema, the
second allows the conversion of domain specific concepts and the last provides the implicit
mathematical or SQL operations needed to solve complex requests.
        </p>
        <p>Current solutions either utilize fine-tuning (FT) or In-context Learning (ICL).</p>
        <p>Fine-tuning is the more traditional approach for adapting LLM to specific tasks. It involves
updating a pre-trained model’s weights through gradient descent using a related labeled dataset.
ICL, on the other hand, guides model behavior without weight updates providing input-output
pairs within the prompt itself, demonstrating the desired response for the task. Both
methodologies have intrinsic cost considerations.</p>
        <p>FT in spite of the improved data eficiency thanks to the pre-trained weights, still needs a not
insignificant amount of high quality labeled data to work correctly, resulting in a specialized
model on the specific task at hand, hindering its use for multiple concurrent downstream
tasks. Futhermore the high computational expenditure of tuning a LLM can’t be ignored. This
methodology, however, provides a clear view of the costs since they are limited to the additional
training phase.</p>
        <p>
          ICL, instead, has the drawbacks of processing the additional examples provided at each
execution, increasing the memory usage and time to first token, resulting in a model which
performances lag behind the fine-tuning procedure [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and is highly sensitive to wording [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ]
and pair ordering [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ]. The retrieval of relevant examples from a database has also to be
accounted for in the resource consumption. This combination of elements makes the long-term
costs and efectiveness of in-context learning more opaque.
        </p>
        <p>
          Recently a new idea has been proposed as third option. [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] introduced the use of
ParameterEficient Fine-Tuning (PEFT), speficially Low-Rank Adaptation (LoRa) [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ], to create a
modelagnostic framework to eficiently adapt pre-trained models to the task at hand by changing
only a small amount of parameters. Additionally, to solve the limits of fine-tuning for multiple
domains, a ”Plugin hub” has been introduced to both enable the hot-swap of specialized weights
to tackle diferent databases and plugin (i.e. weights) creation starting from merged field-related
ones [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ].
        </p>
        <p>Regardless of the chosen methodology, a critical challenge lies in eficiently acquiring and
providing the necessary field-specific knowledge to the model.</p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref34">34, 35</xref>
          ], diferent examples are annotated and used as fine-tuning source. This, however,
has high generation costs since there is a need for expert human annotators to instill a diverse
and accurate understanding in the model and, therefore, the data.
        </p>
        <p>[36] tries to solve this by utilizing publicly available resources to retrieve relevant field
information. This ”bank” of knowledge is then used to guide the model towards the correct
schema linking and conversion. The proposed methodology does mitigate the incurred initial
cost, but the bank creation, without constant updates or improvements, can miss useful data or
lag behind fast evolving fields. Another issue is that, without careful filtering during the set up
of the knowledge archive, the extraction process may generate noisy or conflicting information
with a negative impact on the following retrieval operations.</p>
        <p>
          One possible solution to obtain the best of both worlds would be to use the recent
advancements in LLM’s tool-usage to enable the creation of the bank of knowledge at run-time. In
particular, we envision a pipeline where the model, given the natural language prompt, is able to
actively scour the internet to extract the knowledge needed for a correct conversion. This could
be both applied for augmenting existing datasets and at inference time to help translating the
user’s intention into query. This pipeline can also be easily merged with the proposed solution
in [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] to create an ever-evolving ”plugin hub” that is able to adapt to new terminologies,
concepts or requirements.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusion and Future Perspective</title>
      <p>In this paper, we have shown that LLMs have the potential to bridge the gap between natural
language and SQL queries. However, this promise demands additional research to be truly
realised. While this new technology demonstrates an impressive ability to interpret and
translate natural language into structured queries, it comes with several significant challenges that
must be acknowledged. These include the need to efectively mitigate hallucinations, ensure
scalability for complex databases, reduce response times to practical levels, and develop
robust methods for integrating domain-specific knowledge. Constructing representative training
datasets is also paramount, ensuring the models can adapt to diverse linguistic expressions,
handle unanswerable queries, and reflect the nuances of real-world user interactions. By
systematically overcoming these hurdles, we can pave the way for truly intuitive and
accessible database interaction tools, fostering widespread data democratization and significantly
enhancing decision-making processes across various domains.</p>
    </sec>
    <sec id="sec-6">
      <title>5. Acknowledgments</title>
      <p>This work was supported by the PNRR project Italian Strengthening of Esfri RI Resilience
(ITSERR) funded by the European Union – NextGenerationEU (CUP:B53C22001770006).
[35] J. Herzig, J. Berant, Don’t paraphrase, detect! rapid and efective data collection for
semantic parsing, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019
Conference on Empirical Methods in Natural Language Processing and the 9th
International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),
Association for Computational Linguistics, Hong Kong, China, 2019, pp. 3810–3820. URL:
https://aclanthology.org/D19-1394. doi:10.18653/v1/D19- 1394.
[36] L. Dou, Y. Gao, X. Liu, M. Pan, D. Wang, W. Che, D. Zhan, M.-Y. Kan, J.-G. Lou,
Towards knowledge-intensive text-to-sql semantic parsing with formulaic knowledge, 2023.
arXiv:2301.01067.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ). URL: https://arxiv.org/abs/
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neelakantan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Shyam</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          , arXiv preprint arXiv:
          <year>2005</year>
          .
          <volume>14165</volume>
          (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2005</year>
          .14165.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>67</lpage>
          . URL: http://jmlr.org/papers/v21/
          <fpage>20</fpage>
          -
          <lpage>074</lpage>
          .html.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Févotte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Idier</surname>
          </string-name>
          ,
          <article-title>Algorithms for nonnegative matrix factorization with the β-divergence</article-title>
          ,
          <source>Neural computation 23</source>
          (
          <year>2011</year>
          ). doi:
          <volume>10</volume>
          .1162/NECO_a_
          <fpage>00168</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bergamaschi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guerra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rota</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Velegrakis</surname>
          </string-name>
          ,
          <article-title>A hidden markov model approach to keyword-based search over relational databases</article-title>
          ,
          <source>in: Conceptual Modeling-ER</source>
          <year>2011</year>
          : 30th International Conference,
          <string-name>
            <surname>ER</surname>
          </string-name>
          <year>2011</year>
          , Brussels, Belgium,
          <source>October 31-November 3</source>
          ,
          <year>2011</year>
          . Proceedings 30, Springer,
          <year>2011</year>
          , pp.
          <fpage>411</fpage>
          -
          <lpage>420</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <article-title>Seq2sql: Generating structured queries from natural language using reinforcement learning</article-title>
          ,
          <source>arXiv preprint arXiv:1709.00103</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          , Attention is all you need,
          <year>2023</year>
          . arXiv:
          <volume>1706</volume>
          .
          <fpage>03762</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <article-title>How much knowledge can you pack into the parameters of a language model?</article-title>
          , in: B.
          <string-name>
            <surname>Webber</surname>
            , T. Cohn,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>5418</fpage>
          -
          <lpage>5426</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .emnlp-main.
          <volume>437</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp- main.437.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hui</surname>
          </string-name>
          , G. Qu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Huo</surname>
          </string-name>
          , et al.,
          <article-title>Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-</article-title>
          <string-name>
            <surname>sqls</surname>
          </string-name>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Fosler-Lussier</surname>
          </string-name>
          ,
          <article-title>How to prompt llms for text-to-sql: A study in zero-shot, single-domain, and cross-domain settings</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2305</volume>
          .
          <fpage>11853</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yasunaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roman</surname>
          </string-name>
          , et al.,
          <article-title>Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task</article-title>
          , arXiv preprint arXiv:
          <year>1809</year>
          .
          <volume>08887</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Zhu,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Codes:
          <article-title>Towards building open-source language models for text-to-</article-title>
          <string-name>
            <surname>sql</surname>
          </string-name>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2402</volume>
          .
          <fpage>16347</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pourreza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rafiei</surname>
          </string-name>
          ,
          <article-title>Din-sql: Decomposed in-context learning of text-to-sql with selfcorrection</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2304</volume>
          .
          <fpage>11015</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Resdsql:
          <article-title>Decoupling schema linking and skeleton parsing for text-to-</article-title>
          <string-name>
            <surname>sql</surname>
          </string-name>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2302</volume>
          .
          <fpage>05965</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Relational database market research report: Information by type (in-memory, disk-based, and others), by deployment (cloud-based, and on-premises) by end user (bfsi, it &amp; telecom, retail &amp; e-commerce, manufacturing, healthcare, and others), and by region (north america, europe, asia-pacific, and rest of the world) -market forecast till 2032</article-title>
          ., https://www.marketresearchfuture.com/reports/relational-database-market-
          <volume>18851</volume>
          ,
          <year>2024</year>
          . Accessed:
          <fpage>2024</fpage>
          -03-02.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ichter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Chain-ofthought prompting elicits reasoning in large language models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2201</volume>
          .
          <fpage>11903</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schärli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Scales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bousquet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          , E. Chi,
          <article-title>Least-to-most prompting enables complex reasoning in large language models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2205</volume>
          .
          <fpage>10625</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chowdhery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Self-consistency improves chain of thought reasoning in language models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2203</volume>
          .
          <fpage>11171</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <article-title>Inference speed is the key to unleashing ai's potential</article-title>
          ,
          <year>2024</year>
          . https://wow.groq.
          <article-title>com/ inference-speed-is-the-key-to-unleashing-ai-potential/ [</article-title>
          <source>Accessed: (17 Mar</source>
          <year>2024</year>
          )
          <article-title>]</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pourreza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rafiei</surname>
          </string-name>
          ,
          <article-title>Din-sql: Decomposed in-context learning of text-to-sql with self-correction</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <article-title>Introducing the next generation of claude</article-title>
          ,
          <year>2024</year>
          . https://www.anthropic.com/news/ claude-3
          <source>-family [Accessed: (13 Mar</source>
          <year>2024</year>
          )
          <article-title>]</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>M.</given-names>
            <surname>Reid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Savinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Teplyashin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lepikhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lillicrap</surname>
          </string-name>
          , J.-b. Alayrac,
          <string-name>
            <given-names>R.</given-names>
            <surname>Soricut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lazaridou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Firat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schrittwieser</surname>
          </string-name>
          , et al.,
          <source>Gemini</source>
          <volume>1</volume>
          .
          <article-title>5: Unlocking multimodal understanding across millions of tokens of context</article-title>
          ,
          <source>arXiv preprint arXiv:2403.05530</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hewitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paranjape</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bevilacqua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <article-title>Lost in the middle: How language models use long contexts</article-title>
          ,
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>12</volume>
          (
          <year>2024</year>
          )
          <fpage>157</fpage>
          -
          <lpage>173</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Suhr</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shaw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Exploring unexplored generalization challenges for cross-database semantic parsing</article-title>
          ,
          <source>in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>8372</fpage>
          -
          <lpage>8388</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Finsql: Model-agnostic llms-based text-to-sql framework for financial analysis</article-title>
          , arXiv e-prints (
          <year>2024</year>
          ) arXiv-
          <fpage>2401</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>E.</given-names>
            <surname>Fadeeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rubashevskii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shelmanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Petrakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mubarak</surname>
          </string-name>
          , E. Tsymbalov,
          <string-name>
            <given-names>G.</given-names>
            <surname>Kuzmin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Baldwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Panov</surname>
          </string-name>
          ,
          <article-title>Fact-checking the output of large language models via token-level uncertainty quantification</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2403</volume>
          .
          <fpage>04696</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Hemphill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Godfrey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. R.</given-names>
            <surname>Doddington</surname>
          </string-name>
          ,
          <article-title>The atis spoken language systems pilot corpus</article-title>
          ,
          <source>in: Speech and Natural Language: Proceedings of a Workshop Held</source>
          at Hidden Valley, Pennsylvania, June 24-27,
          <year>1990</year>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>C.</given-names>
            <surname>Finegan-Dollak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. K.</given-names>
            <surname>Kummerfeld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ramanathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sadasivam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , D. Radev,
          <article-title>Improving text-to-sql evaluation methodology</article-title>
          , arXiv preprint arXiv:
          <year>1806</year>
          .
          <volume>09029</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>K.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Wallace</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tomlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <article-title>Unfamiliar finetuning examples control how language models hallucinate</article-title>
          , arXiv e-prints (
          <year>2024</year>
          ) arXiv-
          <fpage>2403</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>L.</given-names>
            <surname>Dou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Che</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhan</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kan</surname>
          </string-name>
          , J.-G. Lou,
          <article-title>Towards knowledge-intensive text-to-sql semantic parsing with formulaic knowledge</article-title>
          ,
          <source>arXiv preprint arXiv:2301.01067</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>A.</given-names>
            <surname>Webson</surname>
          </string-name>
          , E. Pavlick,
          <article-title>Do prompt-based models really understand the meaning of their prompts?</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2109</volume>
          .
          <fpage>01247</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>T. Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Wallace</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Calibrate before use: Improving few-shot performance of language models</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2102</volume>
          .
          <fpage>09690</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Lora:
          <article-title>Low-rank adaptation of large language models</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2106</volume>
          .
          <fpage>09685</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Berant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <article-title>Building a semantic parser overnight</article-title>
          , in: C.
          <string-name>
            <surname>Zong</surname>
          </string-name>
          , M. Strube (Eds.),
          <source>Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Beijing, China,
          <year>2015</year>
          , pp.
          <fpage>1332</fpage>
          -
          <lpage>1342</lpage>
          . URL: https://aclanthology.org/P15-1129. doi:
          <volume>10</volume>
          .3115/v1/
          <fpage>P15</fpage>
          - 1129.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>