Text-to-SQL with Large Language Models: Exploring
                                the Promise and Pitfalls
                                Luca Sala1,∗,† , Giovanni Sullutrone1,† and Sonia Bergamaschi1,†
                                1
                                    University of Modena and Reggio Emilia, UNIMORE


                                              Abstract
                                              The emergence of Large Language Models (LLMs) represents a fundamental change in the ever-evolving
                                              field of natural language processing (NLP). Over the past few years, the enhanced capabilities of these
                                              models have led to their widespread use across various fields, in both practical applications and research
                                              contexts. In particular, as data science intersects with LLMs, new research opportunities and insights
                                              emerge, notably in translating text into Structured Query Language (Text-to-SQL). The application of
                                              this technology to such task poses a unique set of opportunities and related issues that have significant
                                              implications for information retrieval. This discussion paper delves into these intricacies and limitations,
                                              focusing on challenges that jeopardise efficacy and reliability. This research investigates the scalability,
                                              accuracy, and concerning issue of hallucinated responses, questioning the trustworthiness of LLMs.
                                              Furthermore, we point out the limits of the current usage of test dataset created for research purposes
                                              in capturing real-world complexities. Finally, we consider the performance of Text-to-SQL with LLMs
                                              from different perspectives. Our investigation identifies the key challenges faced by LLMs and proposes
                                              viable solutions to facilitate the exploitation of these models to advance data retrieval, bridging the gap
                                              between academic researcher and real-world application scenarios.

                                              Keywords
                                              Large Language Models, Text-to-SQL, Relational Databases, SQL


                                1. Introduction
                                In recent years, natural language processing (NLP) has been fundamentally changed by the rise
                                of Large Language Models (LLMs). Models like BERT (Bidirectional Encoder Representations
                                from Transformers) [1] and GPT (Generative Pretrained Transformer) [2], trained on massive
                                corpora of written data, have shown impressive capabilities in grasping semantic relationships
                                and solving complex tasks.
                                   This has made them powerful tools for human-computer interaction that need an extensive
                                semantic and domain knowledge in order to be up-to-par with the requirements of real-world
                                applications. In particular, their ability to interpret natural language requests and translate
                                them into executable SQL statements has the potential to revolutionize database querying,
                                from bridging the gap between complex database systems and end-users to making data-driven
                                insights more accessible to a broader audience.
                                SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy
                                ∗
                                    Corresponding author.
                                †
                                    These authors contributed equally.
                                Envelope-Open luca.sala@unimore.it (L. Sala); giovanni.sullutrone@unimore.it (G. Sullutrone); sonia.bergamaschi@unimore.it
                                (S. Bergamaschi)
                                Orcid 0000-0002-4833-8882 (L. Sala); 0009-0006-5556-1827 (G. Sullutrone); 0000-0001-8087-6587 (S. Bergamaschi)
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   Furthermore, the impressive adaptability and learning capabilities of LLMs offer promises of
continuous improvement in query understanding and processing. As these models are exposed
to more domain-specific data, their effectiveness in handling queries in various specialized
fields, from healthcare to financial services, is expected to improve [3]. This not only enhances
the accuracy of query conversion, but also opens up possibilities for personalized database
interactions, where the model adjusts to the user’s language and query patterns.
   This paper explores the current limitations of Text-to-SQL systems powered by LLMs. It
focuses on potential pitfalls and readily applicable solutions to improve performance for real-
world use cases. The structure is as follows: Section 2 provides background on Text-to-SQL
with LLMs; Section 3 addresses challenges, limitations, and solutions; followed by conclusions
and future perspectives.


2. Background
2.1. Text-to-SQL
The inherent complexity of the Text-to-SQL task comes from the fundamental differences
between natural language and SQL. Natural language is characterized by ambiguity, flexibility,
and implicit context, whereas SQL adheres to a strict, formal syntax and requires explicit
representation of relationships within a database schema. Early approaches relied heavily on
handcrafted rules and grammars [4, 5], leading to systems that were difficult to generalize to new
domains. With the rise of machine learning, new techniques started to take shape, employing
elements like sequence-to-sequence models to learn the mapping between natural language
and SQL, showing improved robustness compared to previous ones [6].
   Since LLMs are neural network-based models pre-trained on massive text corpora, enabling
them to capture rich linguistic patterns and world knowledge, their advent has further revo-
lutionized the Text-to-SQL field. Key to their success is the Transformer architecture, which
excels at processing sequential data and modeling long-range dependencies [7].
   Their pre-training process exposes them to diverse language usage and domain knowledge
[8] that can be readily made available to convert natural language into queries. Furthermore,
LLMs can effectively model the logical structure of SQL, handling complex elements like nested
structures and aggregations [9]. Notably, these models display potential for zero-shot or few-
shot learning in Text-to-SQL, suggesting they can generate SQL queries for new database
schemas with minimal or even no additional fine-tuning, thus increasing their adaptability [10].
   The integration of LLMs with Text-to-SQL is currently a thriving area of research. Benchmarks
like WikiSQL [6], Spider [11], and BIRD [9] play a crucial role in driving progress and providing
standard evaluation metrics. These datasets consist of paired natural language questions and
corresponding SQL translations across various domains.
   Diverse strategies have been explored to harness the power of this technology. Among
them [12] used an incremental pre-training procedure and fine-tuning on task-specific labeled
data. Additionally, interest has been placed in In-Context learning (ICL) [2], where LLMs
are prompted with natural language instructions, examples, and carefully engineered input
sequences to generate the SQL output [13]. Finally, researchers are exploring hybrid approaches
that combine the strengths of LLMs with decoding constraints or intermediate representations
to enhance the structure and controllability of the generated SQL queries [13, 14].

2.2. The need of Text-to-SQL
Relational databases, characterized by their efficient, structured, and reliable data management
capabilities, have been instrumental in supporting transactional data storage and critical business
operations for decades. In 2022, the market value of relational databases was an impressive
USD 55.9 billion, with forecasts predicting a growth to USD 161.4 billion by 2032, showcasing
a compound annual growth rate (CAGR) of 12.50% [15]. This substantial growth underscores
the continuing reliance on relational databases in the digital age and highlights the increasing
amount of data being processed and stored.
   However, accessing and analyzing this vast reservoir of data poses a challenge, particularly
for non-experts. The traditional method of interacting with databases through structured query
languages such as SQL requires a deep understanding of database schemas and precise command
syntax. Through NL querying, users can communicate with databases in plain text, bypassing
the need to master complex query languages.
   Integrating Text-to-SQL capabilities into data management systems can therefore significantly
accelerate the data exploration process, enabling faster decision-making and insight discovery.
It allows users to ask iterative questions, refine their queries based on previous results, and
explore data relationships and patterns without the bottleneck of formulating precise SQL
queries.
   In summary, the need for Text-to-SQL technologies is driven by the growing complexity and
volume of data stored in relational databases and the necessity to make this data accessible
to a wider audience. As such, investing in and developing these technologies is crucial for
organizations aiming to stay competitive in the data-driven landscape of the 21st century.


3. Addressing Challenges and Limitations
3.1. Response Time and Performance
In the realm of database interaction, response time, the time elapsed before receiving a query
result, plays a vital role in ensuring smooth operation and a seamless user experience. The
introduction of LLMs for query generation shifts our perspective on these metrics, placing
emphasis on their inference1 speed as they act as an additional translation layer between the
user’s requests and the extracted data. Understanding response time from this perspective
requires a nuanced look at factors like Time to First Token (TTFT), which indicates the model’s
initial responsiveness, and Time Per Output Token (TPOT), which determines how efficiently it
generates subsequent parts of the query. Together, TTFT and TPOT give us latency, a measure
of the total time needed to produce a complete response or, in our case, the converted SQL query.
Throughput, on the other hand, quantifies the server’s ability to produce output tokens across
multiple requests. While these metrics offer valuable insights, it’s important to acknowledge

1
    Inference refers to the process of getting a response from the trained LLM model for the user’s query or prompts.
that the hardware used to deploy any LLM has the biggest impact on these factors, making
them highly susceptible to the specific context of application.
   A significant gap in current research is the lack of direct comparisons between the time
it takes a language model to create a query versus the time it takes a human to do the same
task. This obscurity hinders our understanding of the potential advantages this technology
offer. User expectations have been shaped by the immediate feedback search engines provide; it
follows naturally that benchmarks should also account for the desire for fast responses.
   Evaluating Text-to-SQL performance extends beyond the raw capabilities of the LLM; the
methods used for assessment play a decisive role. Benchmarks like Spider [11] offer insufficient
analysis of how models compare to human performance in this task. The BIRD benchmark [9]
partially addresses this shortcoming by incorporating human ratings but omits crucial elements
such as the number of attempts and time required for humans to write valid SQL queries.
Incorporating these measurements would enable a more in-depth comparison between model
and human efficiency.
   As database complexity grows, the interplay between response time and performance becomes
even more critical. Maintaining responsiveness without compromising reliability demands
advanced techniques. Ironically, methods designed to improve LLM accuracy can sometimes
worsen response time. The Chain of Thought (CoT) approach [16], for example, helps tackle
complex queries by breaking them into sub-problems, while techniques like Least-to-Most [17]
and Self-Consistency [18] involve repeated questioning to gain clarity and improve precision.
Although beneficial for complex queries, this subdivision into steps introduces variability into
both the computational resources needed and the overall time taken to generate a response.
This presents a challenge in ensuring predictability and efficiency.
   One possible workaround is to use specialized inference engine like the Language Process-
ing Unit (LPU) introduced by Groq [19] that shows 3-18x improvements in Output Tokens
Throughput compared to traditional providers. Furthermore, it guarantees consistent Time-to-
First-Token reducing drastically the variability of responses.
   Balancing the benefits of advanced LLM techniques with the need for predictable and efficient
database interactions remains a critical area for ongoing research and development in the field
of NLP and database management.

3.2. Scalability
The rapid expansion of available data and the increasing complexity of databases present
significant challenges for applying LLMs to the task of Text-to-SQL. Current models struggle
with large databases and real-world datasets that often contain inconsistencies or ’noisy’ values
[9]. Additionally, the inherent complexity of databases, combined with the limited context
window which determines how much information they can hold in memory, can lead to
significant compression of the prompt, hindering their understanding of the underlying data
structure.
   Current methodologies, in fact, base the pre-trained model’s grounding on two main elements:
schema linking and example value sampling.
   Schema linking identifies references to database elements (tables, columns, etc.) within
the natural language query to be added to the prompt [20]. As databases scale, queries may
reference a broader range of tables, making schema linking more difficult and forcing a stricter
selection, impacting overall performance [9, 20].
   Value sampling aims to provide the LLM with representative examples from the linked tables
[10]. However, with larger tables, these samples may not adequately reflect the full distribution
of data, potentially misleading the LLM.
   Fortunately, the ongoing evolution of these models suggests that scalability issues may be
addressed intrinsically as these models improve.
   Starting from early models like GPT-3.5-turbo which had a context window of 4096 tokens
and GPT-4 with 8192 tokens, significant progress has been made in GPT-3.5-turbo-16k-0613
and GPT-4-32k-0613 with their limits increased to 16384 and 32768 tokens, respectively. Two
of today’s most advanced models, Claude 3 [21] and Gemini 1.5 Pro [22], offer even more
impressive context windows, up to 200,000 tokens for the former and up to 1 million tokens for
the latter.
   A potential drawback for long context models, however, is the performance drop in specific
positions of their memory which could result in a loss of task-essential information. It has
been observed that performance is often highest when relevant information is located at the
beginning or end of the input context, while it degrades significantly otherwise [23].
   However, the most recent models claim to have mitigated the problem. Gemini 1.5 Pro
achieves near-perfect (>99%) recall up to multiple millions of tokens of in all modalities, i.e., text,
video, and audio, and even maintaining this recall performance when extending to 10M tokens
in the all three modalities [22]. Additionally Claude 3 Opus not only achieved near-perfect recall,
surpassing 99% accuracy, but in some cases, it even identified the limitations of the evaluation
itself by recognizing that the ”needle” sentence used to test the information retrieval capability
appeared to be artificially inserted into the original text by a human [21].

3.3. Hallucinations
The term ”hallucinations”, in the context of LLMs, refers to instances where the model generates
inaccurate or misleading information. This phenomenon can arise due to various factors, such
as the inherent complexities of natural language, biases within the training data, and limitations
of the model itself. Hallucinations represent a challenge in the field of Text-to-SQL, where
accuracy and precision in relation to the underlying database and its schema are paramount.
   Within these systems, hallucinations manifest when the LLM fabricates incorrect assumptions
about the database structure or invents non-existent tables, columns, or data values. These
hallucinations pose a serious threat to the model’s performance and reliability, as they can lead
to SQL queries that are either invalid or generate incorrect results.
   Researchers have observed that hallucinations involving the creation of fictional table data are
a particularly prevalent issue in large-scale databases [9]. Even when schema linking techniques
are employed to align the generated query with the structure of the target database, these
problems persist.
   Mitigating hallucinations is an active area of research that has seen various interesting
proposals. Recent solutions include techniques like response selectors that use beam search 2
2
    A decoding strategy that, instead of selecting only the single most likely word at each step, keeps track of multiple
    likely sequences
to choose executable SQL queries to use as final answer [24, 14]. Another technique is to use
an output calibration step that encompasses, among others, a fuzzy search to find the closest
matching columns to potentially resolve invalid ones [25]. A new avenue of research, however,
is the use of Uncertainty Quantification (UQ) to assess the confidence of an LLM’s generated
output as UQ methods can assign confidence scores to different parts of the model’s output.
[26] shows empirically that UQ techniques allow relatively inexpensive fact-checking. This
could have a twofold application: to highlight possible hallucinated terms in the converted
query to be changed by the user or as additional information for a self-correction procedure of
the model itself.

3.4. Dataset Representativity
In the realm of NLP and database query creation, various datasets and benchmarks have been
developed in order to fill the gap between human language and structured database queries.
   Among these, there are ATIS [27] and GEO [28] datasets which contain less than 500 unique
SQL queries. On the other hand, WikiSQL [6] includes a larger number of queries and signifi-
cantly larger tables, but it only covers basic queries. Spider [11] aims to address the limitations
of WikiSQL by incorporating more complex, multi-table queries and a broader diversity of
SQL queries, thus improving the ability of models to understand and generate intricate SQL
commands from natural language inputs. Following Spider, BIRD [9] aims to further advance
this domain by focusing on being more realistic in collecting data from real-world scenarios,
while retaining all the complexity and variability of such data in the dataset.
   However, BIRD is not without its limitations. Firstly, it exhibits bias in the generation of
NL questions, primarily due to the presumed background knowledge of the user regarding
the database’s structure and terminology. This assumption can lead to a gap between ad-hoc
and real-world query formulations, as a typical user may not recall specific details about the
database or might use incorrect terms.
   Including non-experts in the creation of NL questions or limiting their schema knowledge are
two potential ways to mitigate these biases. This approach may guarantee a closer representation
of generated queries to that of a larger user base.
   Secondly, in BIRD, tables or fields not accessible due to user privileges or absence, are
not explored, raising concerns about its practicality in real-world scenarios. One way to
better capture real-world facets is by intentionally including non-implementable queries. This
intentional introduction of real-world imperfections would enable more robust testing. To
this end, we suggest introducing one promising strategy proposed in [29]. Applied in the
Text-to-SQL field, this would entail fine-tuning a model on a dataset where such queries are
intentionally tagged with an ”I don’t know” response. This approach encourages models to
recognize the limits of their ability and avoid the tendency to ”hallucinate” solutions that violate
database constraints or permissions. The key insight is that a model capable of acknowledging
its limitations is likely to be far more valuable in a practical setting than one that produces
incorrect or misleading results.
   Furthermore, existing Text-to-SQL datasets and benchmarks often underutilize the vast
knowledge and contextual understanding capabilities of LLMs. While they excel at incorporating
domain knowledge, datasets currently lack queries designed to test these abilities. Considering
that non-expect user may naturally create questions incorporating cultural references (e.g. ”list
movies released in the year of the dragon”) or requiring the translation of colloquial terms into
precise expressions (e.g. ”show me sales figures for the summer months”). This gap represents
a significant missed opportunity.

3.5. Knowledge Acquisition Methods
For accurate Text-to-SQL conversion in professional settings, models must incorporate field-
specific linguistic, domain, and mathematical knowledge [30]. The first enables the model to
deal with terminology that may be different between question and underlying schema, the
second allows the conversion of domain specific concepts and the last provides the implicit
mathematical or SQL operations needed to solve complex requests.
   Current solutions either utilize fine-tuning (FT) or In-context Learning (ICL).
   Fine-tuning is the more traditional approach for adapting LLM to specific tasks. It involves
updating a pre-trained model’s weights through gradient descent using a related labeled dataset.
ICL, on the other hand, guides model behavior without weight updates providing input-output
pairs within the prompt itself, demonstrating the desired response for the task. Both method-
ologies have intrinsic cost considerations.
   FT in spite of the improved data efficiency thanks to the pre-trained weights, still needs a not
insignificant amount of high quality labeled data to work correctly, resulting in a specialized
model on the specific task at hand, hindering its use for multiple concurrent downstream
tasks. Futhermore the high computational expenditure of tuning a LLM can’t be ignored. This
methodology, however, provides a clear view of the costs since they are limited to the additional
training phase.
   ICL, instead, has the drawbacks of processing the additional examples provided at each
execution, increasing the memory usage and time to first token, resulting in a model which
performances lag behind the fine-tuning procedure [2] and is highly sensitive to wording [31]
and pair ordering [32]. The retrieval of relevant examples from a database has also to be
accounted for in the resource consumption. This combination of elements makes the long-term
costs and effectiveness of in-context learning more opaque.
   Recently a new idea has been proposed as third option. [25] introduced the use of Parameter-
Efficient Fine-Tuning (PEFT), speficially Low-Rank Adaptation (LoRa) [33], to create a model-
agnostic framework to efficiently adapt pre-trained models to the task at hand by changing
only a small amount of parameters. Additionally, to solve the limits of fine-tuning for multiple
domains, a ”Plugin hub” has been introduced to both enable the hot-swap of specialized weights
to tackle different databases and plugin (i.e. weights) creation starting from merged field-related
ones [25].
   Regardless of the chosen methodology, a critical challenge lies in efficiently acquiring and
providing the necessary field-specific knowledge to the model.
   In [34, 35], different examples are annotated and used as fine-tuning source. This, however,
has high generation costs since there is a need for expert human annotators to instill a diverse
and accurate understanding in the model and, therefore, the data.
   [36] tries to solve this by utilizing publicly available resources to retrieve relevant field
information. This ”bank” of knowledge is then used to guide the model towards the correct
schema linking and conversion. The proposed methodology does mitigate the incurred initial
cost, but the bank creation, without constant updates or improvements, can miss useful data or
lag behind fast evolving fields. Another issue is that, without careful filtering during the set up
of the knowledge archive, the extraction process may generate noisy or conflicting information
with a negative impact on the following retrieval operations.
   One possible solution to obtain the best of both worlds would be to use the recent advance-
ments in LLM’s tool-usage to enable the creation of the bank of knowledge at run-time. In
particular, we envision a pipeline where the model, given the natural language prompt, is able to
actively scour the internet to extract the knowledge needed for a correct conversion. This could
be both applied for augmenting existing datasets and at inference time to help translating the
user’s intention into query. This pipeline can also be easily merged with the proposed solution
in [25] to create an ever-evolving ”plugin hub” that is able to adapt to new terminologies,
concepts or requirements.


4. Conclusion and Future Perspective
In this paper, we have shown that LLMs have the potential to bridge the gap between natural
language and SQL queries. However, this promise demands additional research to be truly
realised. While this new technology demonstrates an impressive ability to interpret and trans-
late natural language into structured queries, it comes with several significant challenges that
must be acknowledged. These include the need to effectively mitigate hallucinations, ensure
scalability for complex databases, reduce response times to practical levels, and develop ro-
bust methods for integrating domain-specific knowledge. Constructing representative training
datasets is also paramount, ensuring the models can adapt to diverse linguistic expressions,
handle unanswerable queries, and reflect the nuances of real-world user interactions. By
systematically overcoming these hurdles, we can pave the way for truly intuitive and acces-
sible database interaction tools, fostering widespread data democratization and significantly
enhancing decision-making processes across various domains.


5. Acknowledgments
This work was supported by the PNRR project Italian Strengthening of Esfri RI Resilience
(ITSERR) funded by the European Union – NextGenerationEU (CUP:B53C22001770006).


References
 [1] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). URL:
     https://arxiv.org/abs/1810.04805.
 [2] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, arXiv
     preprint arXiv:2005.14165 (2020). URL: https://arxiv.org/abs/2005.14165.
 [3] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu,
     Exploring the limits of transfer learning with a unified text-to-text transformer, Journal of
     Machine Learning Research 21 (2020) 1–67. URL: http://jmlr.org/papers/v21/20-074.html.
 [4] C. Févotte, J. Idier, Algorithms for nonnegative matrix factorization with the β-divergence,
     Neural computation 23 (2011). doi:10.1162/NECO_a_00168 .
 [5] S. Bergamaschi, F. Guerra, S. Rota, Y. Velegrakis, A hidden markov model approach
     to keyword-based search over relational databases, in: Conceptual Modeling–ER 2011:
     30th International Conference, ER 2011, Brussels, Belgium, October 31-November 3, 2011.
     Proceedings 30, Springer, 2011, pp. 411–420.
 [6] V. Zhong, C. Xiong, R. Socher, Seq2sql: Generating structured queries from natural
     language using reinforcement learning, arXiv preprint arXiv:1709.00103 (2017).
 [7] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo-
     sukhin, Attention is all you need, 2023. arXiv:1706.03762 .
 [8] A. Roberts, C. Raffel, N. Shazeer, How much knowledge can you pack into the parameters
     of a language model?, in: B. Webber, T. Cohn, Y. He, Y. Liu (Eds.), Proceedings of the 2020
     Conference on Empirical Methods in Natural Language Processing (EMNLP), Association
     for Computational Linguistics, Online, 2020, pp. 5418–5426. URL: https://aclanthology.org/
     2020.emnlp-main.437. doi:10.18653/v1/2020.emnlp- main.437 .
 [9] J. Li, B. Hui, G. Qu, J. Yang, B. Li, B. Li, B. Wang, B. Qin, R. Geng, N. Huo, et al., Can
     llm already serve as a database interface? a big bench for large-scale database grounded
     text-to-sqls, Advances in Neural Information Processing Systems 36 (2024).
[10] S. Chang, E. Fosler-Lussier, How to prompt llms for text-to-sql: A study in zero-shot,
     single-domain, and cross-domain settings, 2023. arXiv:2305.11853 .
[11] T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman,
     et al., Spider: A large-scale human-labeled dataset for complex and cross-domain semantic
     parsing and text-to-sql task, arXiv preprint arXiv:1809.08887 (2018).
[12] H. Li, J. Zhang, H. Liu, J. Fan, X. Zhang, J. Zhu, R. Wei, H. Pan, C. Li, H. Chen, Codes:
     Towards building open-source language models for text-to-sql, 2024. arXiv:2402.16347 .
[13] M. Pourreza, D. Rafiei, Din-sql: Decomposed in-context learning of text-to-sql with self-
     correction, 2023. arXiv:2304.11015 .
[14] H. Li, J. Zhang, C. Li, H. Chen, Resdsql: Decoupling schema linking and skeleton parsing
     for text-to-sql, 2023. arXiv:2302.05965 .
[15] S. Singh, Relational database market research report: Information by type (in-memory,
     disk-based, and others), by deployment (cloud-based, and on-premises) by end user (bfsi,
     it & telecom, retail & e-commerce, manufacturing, healthcare, and others), and by region
     (north america, europe, asia-pacific, and rest of the world) –market forecast till 2032.,
     https://www.marketresearchfuture.com/reports/relational-database-market-18851, 2024.
     Accessed: 2024-03-02.
[16] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou, Chain-of-
     thought prompting elicits reasoning in large language models, 2023. arXiv:2201.11903 .
[17] D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet,
     Q. Le, E. Chi, Least-to-most prompting enables complex reasoning in large language
     models, 2023. arXiv:2205.10625 .
[18] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, D. Zhou,
     Self-consistency improves chain of thought reasoning in language models, 2023.
     arXiv:2203.11171 .
[19] Inference speed is the key to unleashing ai’s potential, 2024. https://wow.groq.com/
     inference-speed-is-the-key-to-unleashing-ai-potential/ [Accessed: (17 Mar 2024)].
[20] M. Pourreza, D. Rafiei, Din-sql: Decomposed in-context learning of text-to-sql with
     self-correction, Advances in Neural Information Processing Systems 36 (2024).
[21] Introducing the next generation of claude, 2024. https://www.anthropic.com/news/
     claude-3-family [Accessed: (13 Mar 2024)].
[22] M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut,
     A. Lazaridou, O. Firat, J. Schrittwieser, et al., Gemini 1.5: Unlocking multimodal under-
     standing across millions of tokens of context, arXiv preprint arXiv:2403.05530 (2024).
[23] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, P. Liang, Lost in the
     middle: How language models use long contexts, Transactions of the Association for
     Computational Linguistics 12 (2024) 157–173.
[24] A. Suhr, M.-W. Chang, P. Shaw, K. Lee, Exploring unexplored generalization challenges
     for cross-database semantic parsing, in: Proceedings of the 58th Annual Meeting of the
     Association for Computational Linguistics, 2020, pp. 8372–8388.
[25] C. Zhang, Y. Mao, Y. Fan, Y. Mi, Y. Gao, L. Chen, D. Lou, J. Lin, Finsql: Model-agnostic
     llms-based text-to-sql framework for financial analysis, arXiv e-prints (2024) arXiv–2401.
[26] E. Fadeeva, A. Rubashevskii, A. Shelmanov, S. Petrakov, H. Li, H. Mubarak, E. Tsymbalov,
     G. Kuzmin, A. Panchenko, T. Baldwin, P. Nakov, M. Panov, Fact-checking the output of large
     language models via token-level uncertainty quantification, 2024. arXiv:2403.04696 .
[27] C. T. Hemphill, J. J. Godfrey, G. R. Doddington, The atis spoken language systems pilot
     corpus, in: Speech and Natural Language: Proceedings of a Workshop Held at Hidden
     Valley, Pennsylvania, June 24-27, 1990, 1990.
[28] C. Finegan-Dollak, J. K. Kummerfeld, L. Zhang, K. Ramanathan, S. Sadasivam, R. Zhang,
     D. Radev, Improving text-to-sql evaluation methodology, arXiv preprint arXiv:1806.09029
     (2018).
[29] K. Kang, E. Wallace, C. Tomlin, A. Kumar, S. Levine, Unfamiliar finetuning examples
     control how language models hallucinate, arXiv e-prints (2024) arXiv–2403.
[30] L. Dou, Y. Gao, X. Liu, M. Pan, D. Wang, W. Che, D. Zhan, M.-Y. Kan, J.-G. Lou, To-
     wards knowledge-intensive text-to-sql semantic parsing with formulaic knowledge, arXiv
     preprint arXiv:2301.01067 (2023).
[31] A. Webson, E. Pavlick, Do prompt-based models really understand the meaning of their
     prompts?, 2022. arXiv:2109.01247 .
[32] T. Z. Zhao, E. Wallace, S. Feng, D. Klein, S. Singh, Calibrate before use: Improving few-shot
     performance of language models, 2021. arXiv:2102.09690 .
[33] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank
     adaptation of large language models, 2021. arXiv:2106.09685 .
[34] Y. Wang, J. Berant, P. Liang, Building a semantic parser overnight, in: C. Zong, M. Strube
     (Eds.), Proceedings of the 53rd Annual Meeting of the Association for Computational
     Linguistics and the 7th International Joint Conference on Natural Language Processing
     (Volume 1: Long Papers), Association for Computational Linguistics, Beijing, China, 2015,
     pp. 1332–1342. URL: https://aclanthology.org/P15-1129. doi:10.3115/v1/P15- 1129 .
[35] J. Herzig, J. Berant, Don’t paraphrase, detect! rapid and effective data collection for
     semantic parsing, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019
     Conference on Empirical Methods in Natural Language Processing and the 9th Inter-
     national Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Associ-
     ation for Computational Linguistics, Hong Kong, China, 2019, pp. 3810–3820. URL:
     https://aclanthology.org/D19-1394. doi:10.18653/v1/D19- 1394 .
[36] L. Dou, Y. Gao, X. Liu, M. Pan, D. Wang, W. Che, D. Zhan, M.-Y. Kan, J.-G. Lou, To-
     wards knowledge-intensive text-to-sql semantic parsing with formulaic knowledge, 2023.
     arXiv:2301.01067 .