Netted?! How to Improve the Usefulness of Spider & Co.
Benjamin Hättasch1 , Nadja Geisler1 and Carsten Binnig
Technical University of Darmstadt (TU Darmstadt), Department of Computer Science, Hochschulstraße 10, 64289 Darmstadt, Germany
1
  Both authors contributed equally to this research


                                          Abstract
                                          Natural language interfaces for databases (NLIDBs) are an intuitive way to access and explore structured data. That makes
                                          challenges like Spider (Yale’s semantic parsing and text-to-SQL challenge) valuable, as they produce a series of approaches
                                          for NL-to-SQL-translation. However, the resulting contributions leave something to be desired. In this paper, we analyze
                                          the usefulness of those submissions to the leaderboard for future research. We also present a prototypical implementation
                                          called UniverSQL that makes these approaches easier to use in information access systems. We hope that this lowered barrier
                                          encourages (future) participants of these challenges to add support for actual usage of their submissions. Finally, we discuss
                                          what could be done to improve future benchmarks and shared tasks for (not only) NLIDBs.


1. Introduction                                              of any user to interactively explore large datasets with-
                                                             out help or extensive manual preparation work [1]. As
In a world where ever more data is generated, processed, one of the biggest challenges, the application of NLIDBs
and relied upon, it becomes continually more significant requires the means to translate natural language (NL)
that data is not only accessible to a small group of people. into SQL queries (NL2SQL) (for a recent comprehensive
Information can be contained in text, relational databases, overview of methods and open problems refer to Kim et
knowledge graphs, and many other formats—but users do al. [2]). However, before such NLIDBs can be used as one
not want to deal with heterogeneous sources. What they of many interfaces for information access (i.e., users can
are interested in is accessing information in an easy man- enter their information request using arbitrary words
ner. The borders between structured and unstructured and get a correct answer without knowledge about the
information keep blurring: when using Google for fac- database), further research is needed.
tual questions, infoboxes might show the answer without
the need to open a search result. That result might even Contributions We show that current benchmarks, es-
be wrapped in a generated sentence when voice search pecially the Spider challenge [3] and the related chal-
was used and nobody cares whether the sentence was lenges SparC [4] and CoSQL [5] are not sufficient to
extracted from a web page or generated from a database. measure all relevant aspects and support the emergence
   On the other hand, there are good reasons why these of ready-to-use NLIDBs. Yet, to foster research not only
different ways of storing information exist. Informa- on NLIDBs but on systems that integrate and use them,
tion access methods should leverage the possibilities of we publish an API called UniverSQL1 to integrate sub-
each while providing convenient and ideally unified inter- missions to the challenges into research prototypes and
faces. With this goal in mind, natural language interfaces existing systems. Its core functionality is a wrapper im-
emerged as a data retrieval method, leveraging one of plementation to allow the execution of arbitrary queries
our most flexible and intuitive means of communication. on pre- or custom-trained models. We additionally pro-
   Relational databases are an essential type of informa- vide two sample implementations of this wrapper for
tion storage. To query them, users require knowledge existing NL2SQL translators (EditSQL [6] and IRNet [7]).
of the domain, query language (e.g., SQL), and database The code is published under an open source license.
schema. Contrarily, the vision for natural language in-         Finally, we provide an overview of the advantages and
terfaces to databases (NLIDBs) encompasses the ability flaws of Spider and other benchmarks and provide ideas
                                                                                                                   on how the evaluation of NLIDBs could advance.
DESIRES 2021 – 2nd International Conference on Design of                                                             We hope that this research encourages the use of
Experimental Search & Information REtrieval Systems, September
15–18, 2021, Padua, Italy
                                                                                                                   NLIDBs and further development of approaches and
$ benjamin.haettasch@cs.tu-darmstadt.de (B. Hättasch);                                                             benchmarks. Hopefully, this will help make more in-
nadja.geisler@cs.tu-darmstadt.de (N. Geisler);                                                                     formation accessible to everyone, regardless of their of
carsten.binnig@cs.tu-darmstadt.de (C. Binnig)                                                                      background.
 https://benjaminhaettasch.de/ (B. Hättasch);
http://www.dm.tu-darmstadt.de/ (N. Geisler)
 0000-0001-8949-3611 (B. Hättasch); 0000-0002-5245-6718                                                           Outline The rest of the paper is organized as follows:
(N. Geisler); 0000-0002-2744-7836 (C. Binnig)                                                                      After briefly describing the Spider challenge and its sib-
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                    Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR Workshop Proceedings (CEUR-WS.org)                                           1
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                                                                                                          https://datamanagementlab.github.io/univerSQL/
lings in Section 2, we analyze how reproducible and us-       comparable to Spider in size, complexity and databases.
able the submissions to the shared tasks are in Section 3.    However, queries are arranged in user interactions, pro-
In Section 4, we present our prototypical implementation      viding dialogue-like context. Therefore, it is not sufficient
UniverSQL that makes more of these systems usable for         to just translate the current NL utterance into SQL but
research. We examine strengths, weaknesses, and pos-          information from previous queries has to be taken into ac-
sible further developments of benchmarks in Section 5,        count. Analogous to Spider, SparC features a leaderboard
before providing a brief final summary in Section 6.          for variants with and without value handling.
                                                                 CoSQL [5] takes the challenge to the level of a real
                                                              conversational agent. It consists of both dialogues and
2. What are SPIDER, SparC and                                 annotated SQL queries simulating real-world DB explo-
   CoSQL?                                                     ration scenarios. Therefore, the system has to maintain
                                                              a state. CoSQL defines several challenges, the simplest
The Spider challenge [3] has become one of the standard       one mainly adds further context to interpret compared
evaluations for NLIDBs since its publication in 2018. So      to SparC, the other ones cover generation of suitable
far, it was cited 217 times and 71 submissions were made      responses and intention detection/classification.
to the shared task. The dataset aims to surpass most exist-
ing datasets in size by at least one order of magnitude. At
the same time it covers a diverse set of simple and com-      3. How reproducible and usable
plex SQL queries. This provides the necessary basis for          are the challenge submissions?
data-driven systems to translate joins, nestings etc., and
challenges them to do so to achieve good performance          All three challenges (SPIDER, SparC and CoSQL) fea-
on the development and test data splits.                      ture a public leaderboard where different approaches
   Alongside the dataset, Spider provides a shared task:      and their scores on the public, development as well as
Since such a dataset is expensive to create, it is not fea-   the unpublished test set are listed. In this section, we
sible to create one every time the NLIDB is applied to        will investigate the state of the submissions particularly
a new database. The authors suggest that this problem         with regard to how reproducible the submissions are and
is solved by NLIDB systems capable of generalizing to         whether they can be used outside of the exact task. An
new databases and performing well across domains. This        overview of our analysis can be found in Tables 1 and 2
idea is not entirely new: Systems by e.g., Rangel et al.      (as in June 2021). We will quickly interpret those results.
[8] or Wang et al. [9] already attempted to be domain-
independent in one way or another. However, Spider is         SPIDER The leaderboard for the primary Spider task
the first dataset of its size, complexity and quality. The    (without value handling) featured 62 entries in June 2021.
split ensures that each database occurs in exactly one set    Some of them are only small variations of the same sys-
(training, development, and test). This provides a con-       tem, nevertheless, this boils down to 51 different ap-
crete task description and evaluation process, allowing       proaches. Yet, only little more than half (36 or 58 %) of
accurate and comparable measurements of success.              those approaches are published in some way, the remain-
   Yu et al. [3] also propose a way of categorizing SQL       ing approaches are anonymous or contain only names of
queries with regard to difficulty in the context of the       authors or institutions (so far). For 25 submissions, a link
translation task. The concept regards the number of SQL       to code is provided, yet, some repositories are empty or
components, selections, and conditions to label a query       the link is invalid. In total, 20 approaches (32 %) have at
as easy, medium, hard, or extra hard. A SQL query is          least some code that could be used as starting point for
estimated to be harder if it contains more SQL keywords,      reproduction. Unfortunately, this is not evenly spaced,
e.g., a query is considered to be hard if it contains nest-   only for two of the top ten current submissions (and for
ings, the EXCEPT keyword, or three (or more) columns          four of the top twenty) code is provided.
in the SELECT statements, three (or more) WHERE con-             Two approaches deserve special mention: Shi et al.
ditions, and a GROUP BY over two columns. Even more           [10] provide a Jupyter Notebook for translation of user-
structures or keywords in one query are considered extra      specified queries on custom data2 and the code of Lin
hard. The Spider shared task encourages the submis-           et al. [11] allows interaction with pretrained checkpoints
sion of models to show up in the leaderboard. There           through a command line interface.
are two variants: the original task does not check value
accuracy, but there is also a leaderboard for systems that    SPIDER (with Value Predection) This variation of
handle/predict values (not just queries with placehold-       the task (additionally covering value handling necessary
ers).
   SparC [4] is the multi-turn variant of Spider. It deals         2
                                                                     https://github.com/awslabs/gap-text2sql/blob/main/
with cross-domain semantic parsing in context and is          rat-sql-gap/notebook.ipynb
Table 1
Analysis of the leaderboard entries for Spider (with (+v) and without (-v) value prediction), SparC & CoSQL. We checked how
many different approaches are presented, how many of them reference a publication and how often there is code to at least
try to reproduce the approach.
                                              Spider (-v)    Spider (+v)     SparC       CoSQL
                           Entries                 62              7            17          10
                           Diff. appr.             51              5            15           8
                           - Publications      36 (58 %)       6 (86 %)      8 (47 %)    9 (90 %)
                           - Code              20 (32 %)       4 (57 %)      4 (24 %)    5 (50 %)

Table 2
Analysis of the available repositories for the different challenges. We report whether the repositories are empty or contain
code, whether checkpoints/pre-trained models are provided for download and whether the usage of this approach on own
data/tables is in some way prepared.
                                              Spider (-v)    Spider (+v)     SparC        CoSQL
                           Repositories            15             2              4            5
                           - Empty?            2 (13 %)        0 (0 %)        0 (0 %)     1 (20 %)
                           - Code?             13 (87 %)      2 (100 %)      3 (75 %)     4 (80 %)
                           - Checkpoints       9 (60 %)       2 (100 %)      3 (75 %)     2 (40 %)
                           - Own data?         2 (13 %)        0 (0 %)        0 (0 %)     0 (0 %)


for translating real NL queries) unfortunately received        in computer science where reproducibility is still a
substantially fewer submissions (seven entries for five        challenge. ACM conferences try to tackle this through
approaches and all but one with publication). Four ap-         reproducibility challenges and badges in the ACM
proaches, provide code (only two of six publications).         Digital Library.3 Yet, publishing code and artifacts that
                                                               allow others to redo the experiments is still optional.
SparC Although this challenge was published just nine             While it is surely not feasible to change the whole
months after the Spider challenge, it received consider-       publishing and reviewing process at once, we think that
ably fewer submissions so far. The leaderboard for the         shared tasks are a good place to start. Of course it is
variant without value handling has 17 entries for 15 dif-      fine that submissions are anonymous until the approach
ferent approaches. For less than half of them (47 %) pub-      was reviewed and published. But we advocate that once
lications are referenced, for 24 % there is code and three     names are revealed, it should also be necessary to ref-
submissions provide pre-trained models for download.           erence publication and code. Authors of a challenge set
For the variant with value prediction, there are only two      the requirements for submissions to be included in a
entries, out of which only one references a publication        leaderboard—and they should take advantage of that.
and no code is provided at all. We therefore did not              Moreover, it should be honored when authors of an
include this variant in Tables 1 and 2.                        approach or research prototype invest that extra time to
                                                               make it directly usable for others and their research.
CoSQL At the time of writing, the challenge was                   A very good example (from a slightly different domain)
public for around 20 months. There were only baseline          is SentenceBERT [12]. Although it is an implementation
implementations or entries without publications for two        accompanying a research paper, it is extremely easy to
of the three variants, only one entry included value           use: install via pip, import, specify which model to use.
handling. The main task received ten submissions by            The installation scripts will install dependencies and the
eight different approaches with a publication ratio of         system will download required files/checkpoints, making
90 %. For half of the approaches there is code, but in         it possible to build research on top of it in minutes.
only two cases checkpoints can be downloaded and                  That case is already the cream of the crop, in many
there is no preparation for the use of the models outside      cases significantly less effort would help: pinning ver-
the evaluation scripts at all.                                 sions of dependencies (especially machine learning li-
                                                               braries often introduce breaking changes in just months),
Overall, we have to conclude that reproducibility              run the code on a second machine under a different user-
of the approaches submitted to the leaderboards of all         name, add an installation script to download required
challenges is at best mediocre, which is in line with               3
                                                                      https://www.acm.org/publications/policies/
problems of the community and especially research              artifact-review-and-badging-current
                                                           the code not on the benchmark data but on individual
                                                           natural language queries. There has to be a better way.
                                                           The Spider and SparC challenges do not enforce a cer-
                                                           tain architecture (i.e., their aim is to foster research on
                                                           all kinds of approaches to solve the task and not tie it
                                                           down to e.g., a hyper-parameter optimization for a fixed
                                                           architecture). This has the downside of making it even
                                                           harder to use the resulting approaches in other applica-
                                                           tions. As a community service we therefore provide a
                                                           simple API implementation called UniverSQL that can be
                                                           used in prototypes for information access i.e., ones that
                                                           use NLIDBs (and maybe other components) but do not
                                                           focus on implementing them. The idea is that this API
                                                           can be used as a unified interface to NLIDBs regardless of
                                                           their architecture. This allows researchers to concentrate
                                                           on their task—and allows them to make use of approaches
                                                           that would otherwise be difficult to use.
                                                              UniverSQL is a small python application that serve as a
                                                           translation server. The API allows unified access to most
                                                           important functionalities (select a database, select a trans-
                                                           lator, do the actual translation) and some convenience
                                                           and debugging functions like logging. It can be used for
                                                           individual translations but also for (context preserving)
                                                           multi-turn interactions as in the SparC challenge. An
                                                           overview of available endpoints can be seen in Figure 1.
                                                              The core of UniverSQL is a wrapper implementation
                                                           to allow running arbitrary queries on pre-trained models.
                                                           We provide two sample implementations of this wrapper
Figure 1: Endpoints of the UniverSQL API                   for systems from the Spider leaderboard: EditSQL [6]
                                                           and IRNet [7]. It also includes a script to setup these
                                                           two systems and download required dependencies and
                                                           model dumps. We publish our code together with an
external data or add environment variables for config- extensive documentation how to create wrappers for
uration. Each of these steps can make it substantially other NL2SQL approaches and scripts for simple setup.
easier to run foreign code (or your own after a while). It We hope that this itself evolves into a challenge were
is not about providing perfectly fast and robust industry- researchers provide such a wrapper implementation and
grade software for production use—that is something installation script for their approach and will therefore
(academic) researchers usually cannot accomplish and maintain a ready to use list as part of the published code.4
also should not spend their time on—but to allow quick
prototypical usage to decide whether it is feasible to use
an approach in research and maybe investing time in 5. What are we (still) missing?
improving it. We therefore argue that shared tasks like
Spider should require this in the future for submissions Modern data driven approaches would not be possible
to their leaderboards, and find it a great pity that most without big amounts of data, but curating and annotating
of the current submissions are difficult to reproduce and it is out of scope for many researchers. Hence, it is not
even more difficult to utilize for further research.       surprising that Spider and SparC, but also other datasets,
                                                           have strongly advanced research in the field. However,
                                                           we believe that further advancements are still possible:
4. Does it translate?                                         We already outlined some flaws of Spider such as a
                                                           missing focus on reproducibility. Yet, we also want to
As shown in the last section, in June 2021 there were 86 highlight advantages like the manually annotated and
submission in total for Spider and SparC. If one wants to high-quality data, which deservedly currently makes it
build a system on top of them, currently one has to pick the most important benchmark for of NL-2-SQL transla-
one of the best performing approaches from the leader- tors.
board, obtain the code, install dependencies, download
pre-trained models (if any) and then find a way to run         4
                                                                 https://datamanagementlab.github.io/univerSQL/
   In addition to Spider, there are other datasets and ap-    answer as a bunch of numbers. Yet, a framework to asses
proaches for benchmarking of NLIDBs, but, like Spider,        a system in respect to these kind of questions would help
they have some flaws. We will take a glance at some of        to better decide on which improvements it is worth to fo-
them, to outline typical problems:                            cus. We therefore hope that this user perspective will be
   The WikiSQL Benchmark by Zhong et al. [13] is a large      considered more regularly in computer science research—
dataset (though smaller than Spider) that also features       not as a separate field of research but an integral part to
a leaderboard. Unfortunately, it consists only of a small     drive research in an direction that is suitable to support
number of unique query patterns [14] (in fact, half of        humans best in whatever they want to accomplish.
the questions in the dataset are generated from one sin-
gle pattern). In particular, it contains neither joins nor
nestings. Furthermore, the NL questions are often low         6. Conclusion
quality (i.e., many are grammatically incorrect), some
                                                              In this paper, we analyzed the reproducibility and prepa-
do not have a proper semantic meaning and make little
                                                              ration for use in further research of the submissions to
sense when read by humans and some NL questions do
                                                              the Spider, SparC and CoSQL challenges. Unfortunately,
not have the same meaning as the associated SQL query.
                                                              we found that only for about 40 % of the submissions
   Utama et al. [15] published ParaphraseBench, an ap-
                                                              code is available and for even fewer submissions artifacts
proach that tries to measure translation difficulty by di-
                                                              like pre-trained model dumps are provided. Additionally,
viding queries into classes. The benchmark was manually
                                                              the code is in most cases only capable to do the batch
curated but is quite small and covers only one table.
                                                              translation of specific data required for the evaluation
   A recent paper by Gkini et al. [16] tries to benchmark
                                                              scripts of the challenges but not prepared for use on other
existing translation systems. They focus on system as-
                                                              real world data. We therefore presented a prototypical
pects like execution times or resource consumption and
                                                              API implementation called UniverSQL that provides a
not on translation accuracy. However, their analysis
                                                              simple interface for NL to SQL translation and boils down
leaves some open questions: First, their dataset which is
                                                              the task of adapting an approach for individual transla-
unfortunately not publicly available (yet) appears to be
                                                              tion to implementing a simple wrapper class. The imple-
quite small, it consists only of 216 keyword-based and 241
                                                              mentation is provided as open source software. Finally,
natural language queries. Second, although they cite Spi-
                                                              we analyzed further shortcomings of Spider and other
der, they did not include the high-performing approaches
                                                              benchmarks and advocated for a stronger user perspec-
from the leaderboard in their evaluation. Overall, this
                                                              tive when designing similar benchmarks in the future.
approach does not appear sufficient for an evaluation
that takes the user’s perspective into account.
   Even if we combined all these approaches, the result       Acknowledgments
would still not be the best way to evaluate NLIDBs. There-
fore, we will conclude with a brief outlook on what is        We thank Janina Hartmann and Sebastian Bremser for
still needed and what would be possible in this area.         their substantial contribution to the implementation.
   As mentioned before, it would probably boost the us-          This work has been supported by the German Fed-
age of the approaches if they allowed for direct/easy use.    eral Ministry of Education and Research as part of the
Enforcing this is not an inherent part of a benchmark but     Project Software Campus 2.0 (TUDA), Microproject IN-
could be done as part of the setup of a shared task.          TEXPLORE, under grant ZN 01IS17050, by the German
   Much more difficult but probably also even more im-        Research Foundation as part of the Research Training
portant is taking the user’s perspective into account. One    Group Adaptive Preparation of Information from Hetero-
way to do so could be end-to-end benchmarks that do not       geneous Sources (AIPHES) under grant No. GRK 1994/1,
only evaluate the translation accuracy but the real perfor-   as well as the German Federal Ministry of Education and
mance in a data exploration task from input to the output     Research and the State of Hesse through the National
(SparC and especially CoSQL do this to some extend).          High-Performance Computing Program.
But there are many other highly interesting questions:           This research and development project is/was partly
We can measure the accuracy of a system like an NLIDB,        funded by the German Federal Ministry of Education
but what accuracy should we strive for? Are all errors        and Research (BMBF) within the “The Future of Value
equally bad? Can a slightly wrong translation still be        Creation – Research on Production, Services and Work”
sufficient? What is the influence of a suboptimal transla-    program (funding number 02L19C150) and managed by
tion? Will the user be satisfied by a system with 100 %       the Project Management Agency Karlsruhe (PTKA). The
translation accuracy? Or do they expect something that        authors are responsible for the content of this publica-
cannot be accomplished even by perfectly working sys-         tion.
tems? Answering such questions is hard, it can probably
not always be automated and it is difficult to frame the
References                                                            Javier González, A. Gelbukh, G. Sidorov, M. M. J.
                                                                      Rodríguez, A domain independent natural lan-
[1] R. J. L. John, N. Potti, J. M. Patel, Ava: From data to           guage interface to databases capable of processing
    insights through conversations., in: CIDR, 2017.                  complex queries, in: A. Gelbukh, Á. de Albornoz,
[2] H. Kim, B.-H. So, W.-S. Han, H. Lee, Natural lan-                 H. Terashima-Marín (Eds.), MICAI 2005: Advances
    guage to SQL: Where are we today?, Proceed-                       in Artificial Intelligence, Springer Berlin Heidel-
    ings of the VLDB Endowment 13 (2020) 1737–1750.                   berg, Berlin, Heidelberg, 2005, pp. 833–842.
    doi:10.14778/3401960.3401970.                                 [9] Y. Wang, J. Berant, P. Liang, Building a semantic
[3] T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang,                   parser overnight, in: Proceedings of the 53rd An-
    Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang,                  nual Meeting of the Association for Computational
    D. Radev, Spider: A large-scale human-labeled                     Linguistics and the 7th International Joint Con-
    dataset for complex and cross-domain semantic                     ference on Natural Language Processing (Volume
    parsing and text-to-sql task, in: Proceedings of                  1: Long Papers), Association for Computational
    the 2018 Conference on Empirical Methods in Nat-                  Linguistics, Beijing, China, 2015, pp. 1332–1342.
    ural Language Processing, Association for Com-                    URL: https://www.aclweb.org/anthology/P15-1129.
    putational Linguistics, Brussels, Belgium, 2018, pp.              doi:10.3115/v1/P15-1129.
    3911–3921.                                                   [10] P. Shi, P. Ng, Z. Wang, H. Zhu, A. H. Li,
[4] T. Yu, R. Zhang, M. Yasunaga, Y. C. Tan, X. V. Lin,               J. Wang, C. N. dos Santos, B. Xiang, Learn-
    S. Li, I. L. Heyang Er, B. Pang, T. Chen, E. Ji, S. Dixit,        ing contextual representations for semantic pars-
    D. Proctor, S. Shim, V. Z. Jonathan Kraft, C. Xiong,              ing with generation-augmented pre-training, 2020.
    R. Socher, D. Radev, SParC: Cross-domain seman-                   arXiv:2012.10309.
    tic parsing in context, in: Proceedings of the 57th          [11] X. V. Lin, R. Socher, C. Xiong, Bridging textual and
    Annual Meeting of the Association for Computa-                    tabular data for cross-domain text-to-sql semantic
    tional Linguistics, Association for Computational                 parsing, 2020. arXiv:2012.12627.
    Linguistics, Florence, Italy, 2019, pp. 4511–4523.           [12] N. Reimers, I. Gurevych, Sentence-bert: Sentence
[5] T. Yu, R. Zhang, H. Er, S. Li, E. Xue, B. Pang, X. V.             embeddings using siamese bert-networks, in: Pro-
    Lin, Y. C. Tan, T. Shi, Z. Li, et al., CoSQL: A conver-           ceedings of the 2019 Conference on Empirical Meth-
    sational text-to-sql challenge towards cross-domain               ods in Natural Language Processing, Association
    natural language interfaces to databases, Proceed-                for Computational Linguistics, 2019, pp. 3982 – 3992.
    ings of the 2019 Conference on Empirical Meth-                    URL: https://arxiv.org/abs/1908.10084.
    ods in Natural Language Processing and the 9th In-           [13] V. Zhong, C. Xiong, R. Socher, Seq2sql: Generating
    ternational Joint Conference on Natural Language                  structured queries from natural language using re-
    Processing (EMNLP-IJCNLP) (2019). URL: http://dx.                 inforcement learning, CoRR abs/1709.00103 (2017).
    doi.org/10.18653/v1/D19-1204. doi:10.18653/v1/               [14] C. Finegan-Dollak, J. K. Kummerfeld, X. Lin, K. Ra-
    d19-1204.                                                         manathan, S. Sadasivam, R. Zhang, D. R. Radev, Im-
[6] R. Zhang, T. Yu, H. Y. Er, S. Shim, E. Xue, X. V. Lin,            proving text-to-sql evaluation methodology, CoRR
    T. Shi, C. Xiong, R. Socher, D. Radev, Editing-based              abs/1806.09029 (2018).
    sql query generation for cross-domain context-               [15] P. Utama, N. Weir, F. Basik, C. Binnig, U. Cetintemel,
    dependent questions, in: Proceedings of the 2019                  B. Hättasch, A. Ilkhechi, S. Ramaswamy, A. Usta,
    Conference on Empirical Methods in Natural Lan-                   An end-to-end neural natural language interface
    guage Processing, Hong Kong, China, 2019, pp.                     for databases, 2018. arXiv:1804.00401.
    5338–5349.                                                   [16] O. Gkini, T. Belmpas, G. Koutrika, Y. Ioannidis,
[7] J. Guo, Z. Zhan, Y. Gao, Y. Xiao, J.-G. Lou, T. Liu,              An in-depth benchmarking of text-to-sql systems,
    D. Zhang, Towards complex text-to-sql in cross-                   in: Proceedings of the 2021 International Confer-
    domain database with intermediate representation,                 ence on Management of Data, SIGMOD/PODS ’21,
    in: Proceeding of the 57th Annual Meeting of the                  Association for Computing Machinery, New York,
    Association for Computational Linguistics (ACL),                  NY, USA, 2021, p. 632–644. URL: https://doi.org/
    Association for Computational Linguistics, 2019, p.               10.1145/3448016.3452836. doi:10.1145/3448016.
    4524–4535.                                                        3452836.
[8] R. A. P. Rangel, O. Joaquín Pérez, B. Juan