Scalable Database-Driven KGs can help Text-to-SQL
                                Zhongqiu Li1,† , Zhenhe Wu1,2,‡,† , Mengxiang Li3 , Zhongjiang He1 , Ruiyu Fang1 ,
                                Jie Zhang1 , Yu Zhao1 , Yongxiang Li1 , Zhoujun Li2,* and Shuangyong Song1,*
                                1
                                  Institute of Artificial Intelligence (TeleAI), China Telecom Corp Ltd, Xicheng District, Beijing, China
                                2
                                  BeiHang University, No.37 XueYuan Road, HaiDian District, Beijing, China
                                3
                                  China Telecom Corporation, No.31 Financial Street, Xicheng District, Beijing, China


                                                                         Abstract
                                                                         The text-to-SQL task aims to covert natural language questions into SQL queries. Large Language Models
                                                                         (LLMs) have demonstrated remarkable performance on this task, which relied on in-context learing
                                                                         or Supervised Fine-Tuning (SFT). However, the heterogeneity of database and the complexity of the
                                                                         knowledge acquisition process pose significant challenges in previous works. To address these, we
                                                                         propose a novel text-to-SQL framework that enhances the performance of LLMs through Knowledge
                                                                         Graphs (KGs). We construct the KGs based on schemas, which are structured representations of the
                                                                         relationships and attributes within the databases. Then, we utilize LLMs to extract descriptions and
                                                                         dependencies from historical queries, which are used to complete contextual knowledge in KGs. We
                                                                         leverage retrieval model to recall benefit nodes and edges from KGs and then employ LLMs to generate
                                                                         task-specific evidence. Based on the evidence and retrieved information, we define a unified KGs-based
                                                                         schema for LLMs to generate SQL queries. Our paper conducts experiments on public datasets BIRD and
                                                                         Spider, and the results indicate that our framework significantly improves the text-to-SQL performance.

                                                                         Keywords
                                                                         Text-to-SQL, Large Language Models, Knowledge Graph, Knowledge Generation


                                1. Introduction
                                Text-to-SQL task aims to generate Structured Query Language (SQL) queries for databases from
                                natural language questions[1, 2]. The process facilitates non-expert data analysts in automati-
                                cally extracting desired information from widely accessible databases using natural language.
                                Database schema is the structural representation of database. It defines the organization of data
                                storage, including tables, columns, data types, relationships and constraints within the database.
                                In different databases, schemas also include descriptions, as well as domain dependencies related
                                to columns. Advanced studies focus on in-context learning to guide LLMs to understand and
                                generate SQL queries. These approaches typically involve providing questions and schemas
                                into LLMs as context, while adapting domain knowledge through few-shot learning or Super-

                                Posters, Demos, and Industry Tracks at ISWC 2024, November 13–15, 2024, Baltimore, USA
                                ‡
                                  Work done while interning at Institute of Artificial Intelligence (TeleAI), China Telecom Corp Ltd.
                                †
                                  These authors contributed equally.
                                *
                                  Corresponding authors.
                                $ lizq48@chinatelecom.cn (Z. Li); wuxs97@163.com (Z. Wu); limengx@126.com (M. Li); hezj@chinatelecom.cn
                                (Z. He); fangry@chinatelecom.cn (R. Fang); zhangj157@chinatelecom.cn (J. Zhang); zhaoy11@chinatelecom.cn
                                (Y. Zhao); liyx25@chinatelecom.cn (Y. Li); lizj@buaa.edu.cn (Z. Li); songshy@chinatelecom.cn (S. Song)
                                                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                                                                                                                                                                                      Question: What is the highest eligible free rate for K-12
                      Schema                                        Data storage                                                                                                           Update     students in the schools in Alameda County?
                                                                                                                                                                                                                                                                     Historical
                                                                                                        Number of Test Takers                                                                         SQL: SELECT `Free Meal Count (K-12)` / `Enrollment (K-
                                                        Table ：       ...     ...      …                                                                                                                                                                              Library
       Data            Table         Primary                                                                                                             Number of Test                               12)` FROM frpm WHERE `County Name` = 'Alameda'...
                                                        frpm         CDSCode     IRC     …
       types           name            keys                           … cds … rtype    …   …                                     e=      r               Takers Whose ...                             Question: ....
                                                        satscores         …       …      …                                    Rat TstTak
                                                                                                                           ce
                                                        ...                                                            llen Num        NumGE1500                                                      Knowledge: cr =（Free Meal Count (K-12), eligible free rate for K-
                                                                            …        …     …                       Exce 500 /
                                                                                                                                                                                                     12=Free Meal Count (K-12)/`Enrollment (K-12)`,`Enrollment (K-12)`)
      Column          Foreign
                                                                                   ……                               GE1

                                                                                                                                                                                                                                                                                          SQL query
                                                                                                                  m
                                      Indexs                                                                   Nu
       name            keys                                                                                 NumTstTakr            satscores
                      ……                                            Query Views                                    school name                                                                          Target question: Give the names of the schools with
                                                                                                                                         cds               K-12: 1st grade - 12nd grade
                                                                                                                                                                                                        the percent eligible for free meals in K-12 is more


                                                                                                                                           Foreign
                                                                                                                                                                                                        than 0.1 and test takers whose test score is greater                                                  Instructions
                                                                                             Generate          ['Foothill High',...]


                                                                                                                                             key
                                                                                                                                                                 Enrollment (K-12)                      than or equal to 1500?                                                                                Target question
                                                                                                                                       CDSCode


                                                                                                                                                                                 =...
                                                                                                                                                                                                                                                                                                              KGs-based schema
               frpm
                                                                                                                                                         frpm


                                                                                                                                                                              free rate
                                                                                                                                                                                                       KGs-based schema:                                                                                      Generated evidence
                                                                                                        California Department Schools.
                    Name                     Description             ...       values                                                                                                                  # Table: frpm                                                                                          Few-shot learning
                                                                                                                                                                                                         (Free Meal Count (K-12)): type is REAL,...),


                                                                                                                                                                            eligible
                   CDSCode          California Department Schools    ...     ['1010',... ]                                        ['10101',... ]
  satscores                                                                                                                                                 Free Meal                                    (Enrollment (K-12): ...,K-12 means 1st grade - 12nd
            Enrollment (K-12)     K-12: 1st grade - 12nd grade
                                                         ….         ...            ...                                                                     Count (K-12)                                grade,...),
        NameFree Meal Count Description         -        relationship        ... , `F`]
                                                                                                               …
                                                                                                                            …            …
                                                                                                                                                                                                         (NumGE1500: ...,Number of Test Takers Whose                       Generated evidence: The schools with the percent eligible
                                                                                                                                                                                                       Total SAT Scores Are Greater or Equal to 1500.)                    for free meals in K-12 more than 10% refers to `Free Meal
         cds        (K-12)
                      California Department Schools       (satscores.cds,    ...
                                                                                                                                                …
                                                                                                                                                            …                                            (CDSCode: type is TEXT, primary key, Value                       Count (K-12)`/ `Enrollment (K-12)` > 0.1 in the table frpm;
                                                              foreign              [0,1]                                                                               …                               examples...)                                                       test takers whose test score is greater than or equal to 1500
                                                        key,frpm.CDSCode)
                                                                                                           …                …                                                                          # Table: satscores                                                 refers to `NumGE1500` > 0 in the table satscores; school
     NumGE1500         Number of Test Takers Whose        （...,Excellence    ...                                                                 …                                                       (cds: ...,California Department Schools,...),                    names are mapped to `School Name` in the table frpm.
                      Total SAT Scores Are Greater or   Rate = NumGE1500                                                                                        …                                        (sname: ...,school name. Value examples: [ 'Middle
                               Equal to 1500             / NumTstTakr,...)                                     …             …                                                                         College High', ....),

     School Name               School Name                                   ...                                                                                                …         Retrieve       (NumGE1500: ...,Number of Test Takers Whose
                                                                                                                                                                                                       Total SAT Scores Are Greater or Equal to 1500,...)
     NumTstTakr            Number of Test Takers          （...,Excellence    ...                                                                     …          …                                      ......
                                             ….                                                                                   …
                                                        Rate = NumGE1500                                                                                                                               # relationships                                                                                   Use GPT4 to generate evidence,
                                                         / NumTstTakr,...)                                                                                                                             frpm.`CDSCode` = satscores.`cds`                                                                  which maps natural language
                                                                                                                                                                                                       eligible free rate for K-12=frpm.`Free Meal Count (K-                                             questions to database schemas.
                                             ….                                                                                                                                                        12)`/frpm.`Enrollment (K-12)`


Figure 1: Framework of the Scalable Database-Driven KGs for Text-to-SQL. Schema, Data storage, and
Query Views are fundamental components of a database. First, we use LLMs to construct KGs from
schema and data values. Then, we retrieve knowledge that is closely related to the user question and
generate evidence. Finally, the selected knowledge is used to complete the SQL generation.


vised Fine-Tuning (SFT). However, practical applications encounter several challenges. Firstly,
different database structures and application scenarios, leading to significant variations in data
styles that complicate schema representation. Secondly, SQL queries often involve large tables
with an excessive number of columns, exceeding the typical window size of LLMs.
   Our paper converts schemas into nodes and edges within Knowledge Graphs (KGs). To enrich
these KGs, we use LLMs to analyze database interaction history, extracting detailed descriptions
and dependencies to complete the KGs. We then retrieve nodes and edges closely related to
user queries from these KGs, generate relevant evidence, and use this information as context
for generating precise SQL queries.
   Li et al.[3] discover the critical importance of supplementary evidence when handling com-
plex database values. Their work explicitly defines “evidenc” for the first time and annotates
a standard dataset with evidence, namely BIRD. Evidence typically consists of concise state-
ments that link multiple entity mentions in natural language to functions, tables, columns,
and constraints in the database. It acts as an intermediary state between user questions and
the database schema, substantially improving models’ understanding of the database schema.
However, the evidence they provide is manually annotated, which limits its applicability to
other datasets. Nonetheless, evidence is extremely scarce across databases. To address this
deficiency, we utilize LLMs to generate evidence through KGs-based schema, which is highly
effective for SQL generation.


2. Our Approach
We introduce a scalable KGs-based text-to-SQL framework. As illustrated in Figure 1, our
framework comprises four stages. The first stage is KGs construction, where we design two
types of nodes and three types of predefined relationships based on database components. The
second stage is knowledge completion, where we utilize LLMs to extract implicit knowledge
from user historical questions and SQL pairs, in order to expand the attributes and dependencies
in the KGs. The third stage is knowledge selection and generation. We design a retrieval model
to identify and select the most relevant knowledge from the expanded KGs, and then use this
knowledge to generate evidence and context for SQL query formulation. The last stage is SQL
generation, where we combine the user question and the previously obtained schema knowledge
as context and use LLMs to generate SQL queries using a least-to-most method.
   Knowledge Graph Construction. Our approach aims to construct the KGs based on tabular
structures and columns. The graph mainly consists of two key entity types, i.e.,Table and
Column. The former is defined by the attributes “Table Name” and “Table Description ”.
The latter is achieved by the attributes “Column Name”, “Column Description”, “Data Type”,
“Example Value”, and “Value Explanation”. We design three predefined relational categories:
inter-table relationship, inclusion relationship between tables and columns, and foreign key
relationship. Overall, we use the distinct operation of databases for extracting metadata (i.e., the
above mentioned entity types and relational categories) in order to obtain structured database
information for constructing KGs. Figure 1 illustrates our approach of using value examples as
attributes for columns to manage the diversity in data storage formats among different databases.
This diversity stems from the different conventions for data representation.
   Knowledge Completion with LLMs. We propose a knowledge completion strategy based
on LLMs that helps to enrich the attributes of nodes and the relationships between nodes in
the knowledge graph. Our method collects tuples {𝑞, 𝑠} comprising user historical questions
and corresponding SQL queries. These tuples are processed using LLMs to extract column
knowledge as 𝑐𝑑 = {𝑐, 𝑑} and functional dependencies as 𝑐𝑟 = {𝑐ℎ , 𝑟, 𝑐𝑡 }. 𝑐 denotes the column
name and 𝑑 represents its description in the question. 𝑐ℎ and 𝑐𝑟 respectively denote the head
and tail columns extracted from SQL, while 𝑟 describes the relationship between 𝑐ℎ and 𝑐𝑟
from question. Both 𝑐𝑑 and 𝑐𝑟 serve as supplementary updates to the graph. For reducing
the addition of redundant information, we assess the relevance of existing column knowledge
before updating and employ LLMs to determine the necessity of update.
   Knowledge Selection and Generation. As illustrated in Figure 1, we design a retrieval model
to recall columns from KGs, which is proved effective in Dense Passage Retrieval (DPR)[4]. We
convert the candidate columns into continuous text by concatenating attributes of columns as
{𝑐 = 𝑐𝑛𝑎𝑚𝑒 ||𝑐𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛 ||𝑐𝑒𝑥𝑎𝑚𝑝𝑙𝑒 ||𝑐𝑒𝑥𝑝𝑙𝑎𝑛𝑎𝑡𝑖𝑜𝑛 }. Afterwards, we encode question 𝑞 and column
𝑐 using encoders to obtain embeddings 𝑂𝑞 and 𝑂𝑐 , which are utilized for the interaction. We
calculate the final similarity score of 𝑂𝑞 and 𝑂𝑐 as follows:

                                                                                                  (1)
                                                (︀      (︀               )︀)︀
                          𝑂𝑞/𝑐 = Normalize CNN BERT𝑄/𝐶 (𝑞/𝑐)
                                                ∑︁
                                 𝑠𝑐𝑜𝑟𝑒𝑞,𝑡 =             max 𝑂𝑞𝑖 · 𝑂𝑐𝑗                             (2)
                                                         𝑗∈[|𝑂𝑐 |]
                                             𝑖∈[|𝑂𝑞 |]

𝑖 and 𝑗 are token embeddings indices of 𝑂𝑞 and 𝑂𝑐 . By setting a threshold of score, we
disambiguate the candidates to select the final columns.
   Through knowledge selection, we extract general knowledge from KGs. And then, we input
the KGs-based schema, user question, and instructions into the LLMs to generate evidence,
which is crucial for establishing the correlation between questions and database. We specifically
design the evidence generation process which comprises three steps. First, we utilize LLMs to
extract key terms from user questions. These key terms typically include important entities,
attribute names, and query conditions. In practice, these terms may not always be explicit
Table 1
Execution Accuracy on BIRD and Spider.

                                                               BIRD        Spider
                          Method
                                                                dev     dev     test
                          GPT-4                                47.97   81.18 83.40
                          GPT-4 + KGs                          51.20   82.15 85.61
                          GPT-4 + KGs + evidence (generated)   54.33   84.79 86.51
                          GPT-4 + KGs + evidence (gold)        57.17     −        −


and clear, and can sometimes be ambiguous or non-standardized. Next, based on these terms,
we use LLMs to enumerate normalized schema blocks such as related tables, columns, and
database function expressions. LLMs map these terms to the database schema. During this
mapping process, LLMs not only correspond the terms to standardized expressions but may also
elucidate the dependencies in the query. For instance, they identify relevant tables and columns,
determine logical relationships between these columns (such as ratios or comparisons), and
may infer necessary data connections (such as joins) and filtering conditions. This detailed
mapping by LLMs facilitates the precise extraction of results that meet the specified conditions,
ensuring the accuracy and completeness of the query process. We summarize the results of this
process into a concise sentence as evidence.
   SQL Generation. We employ LLMs with in-context learning to address SQL generation[5,
6, 7, 8]. Specifically, our approach involves organizing nodes and relationships extracted from
KGs into a redefined schema. We then input the redefined schema and generated evidence
into the LLMs as context. The models use this contextual information to decompose the user’s
questions into several sub-questions and generate SQL queries for each sub-question using a
least-to-most prompting strategy. To enhance the LLMs’ SQL generation accuracy, we manually
crafted three examples to guide the model through few-shot learning, improving the overall
effectiveness of the final SQL query.


3. Experiments and Conclusion
We ran our experiments on two widely used benchmark SQL datasets: BIRD[3] and Spider 1.0[9].
We use GPT-4 to complete KGs and generate SQLs. In our paper, we fine-tune a BERT-based
dual-tower model using training datasets. It is employed to generate embeddings for entities
and relations within the KGs, thereby enhancing graph retrieval capabilities.
   As indicated in Table 1, the row for GPT-4 indicates using the origin database schema directly,
rather than leveraging schema KGs. In comparison, by organizing schemas components into KGs
and utilizing retrieval models to recall tables and columns, we observe a notable enhancement
of 3.23% , 0.97% and 2.21% on BIRD and Spider. Secondly, we employ generated evidences for
SQL generation. It contributes to additional improvement of 3.13% , 2.62% and 0.90%.
References
[1] B. Qin, B. Hui, L. Wang, M. Yang, J. Li, B. Li, R. Geng, R. Cao, J. Sun, L. Si, F. Huang,
    Y. Li, A survey on text-to-sql parsing: Concepts, methods, and future directions, CoRR
    abs/2208.13629 (2022).
[2] R. Sun, S. Ö. Arik, H. Nakhost, H. Dai, R. Sinha, P. Yin, T. Pfister, Sql-palm: Improved large
    language model adaptation for text-to-sql, CoRR abs/2306.00739 (2023).
[3] J. Li, B. Hui, G. Qu, J. Yang, B. Li, B. Li, B. Wang, B. Qin, R. Geng, N. Huo, X. Zhou, C. Ma,
    G. Li, K. C. Chang, F. Huang, R. Cheng, Y. Li, Can llm already serve as a database interface?
    a big bench for large-scale database grounded text-to-sqls (2024).
[4] V. Karpukhin, B. Oguz, S. Min, P. S. H. Lewis, L. Wu, S. Edunov, D. Chen, W. Yih, Dense
    passage retrieval for open-domain question answering, in: B. Webber, T. Cohn, Y. He, Y. Liu
    (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language
    Processing, EMNLP 2020, Online, November 16-20, 2020, Association for Computational
    Linguistics, 2020, pp. 6769–6781.
[5] C. Tai, Z. Chen, T. Zhang, X. Deng, H. Sun, Exploring chain of thought style prompting for
    text-to-sql, in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the 2023 Conference on
    Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December
    6-10, 2023, Association for Computational Linguistics, 2023, pp. 5376–5393.
[6] D. Gao, H. Wang, Y. Li, X. Sun, Y. Qian, B. Ding, J. Zhou, Text-to-sql empowered by large
    language models: A benchmark evaluation, Proc. VLDB Endow. 17 (2024) 1132–1145.
[7] M. Pourreza, D. Rafiei, DIN-SQL: decomposed in-context learning of text-to-sql with self-
    correction, in: A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, S. Levine (Eds.),
    Advances in Neural Information Processing Systems 36: Annual Conference on Neural
    Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 -
    16, 2023, 2023.
[8] B. Wang, C. Ren, J. Yang, X. Liang, J. Bai, Q. Zhang, Z. Yan, Z. Li, MAC-SQL: A multi-agent
    collaborative framework for text-to-sql, CoRR abs/2312.11242 (2023).
[9] T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang,
    D. R. Radev, Spider: A large-scale human-labeled dataset for complex and cross-domain
    semantic parsing and text-to-sql task, in: E. Riloff, D. Chiang, J. Hockenmaier, J. Tsujii
    (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Pro-
    cessing, Brussels, Belgium, October 31 - November 4, 2018, Association for Computational
    Linguistics, 2018, pp. 3911–3921.