<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Padua, Italy
$ benjamin.haettasch@cs.tu-darmstadt.de (B. Hättasch);
nadja.geisler@cs.tu-darmstadt.de (N. Geisler);
carsten.binnig@cs.tu-darmstadt.de (C. Binnig)
 https://benjaminhaettasch.de/ (B. Hättasch);
http://www.dm.tu-darmstadt.de/ (N. Geisler)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Netted?! How to Improve the Usefulness of Spider &amp; Co.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Benjamin Hättasch</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nadja Geisler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carsten Binnig</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Technical University of Darmstadt (TU Darmstadt), Department of Computer Science</institution>
          ,
          <addr-line>Hochschulstraße 10, 64289 Darmstadt</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Natural language interfaces for databases (NLIDBs) are an intuitive way to access and explore structured data. That makes challenges like Spider (Yale's semantic parsing and text-to-SQL challenge) valuable, as they produce a series of approaches for NL-to-SQL-translation. However, the resulting contributions leave something to be desired. In this paper, we analyze the usefulness of those submissions to the leaderboard for future research. We also present a prototypical implementation called UniverSQL that makes these approaches easier to use in information access systems. We hope that this lowered barrier encourages (future) participants of these challenges to add support for actual usage of their submissions. Finally, we discuss what could be done to improve future benchmarks and shared tasks for (not only) NLIDBs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>of any user to interactively explore large datasets
without help or extensive manual preparation work [1]. As
In a world where ever more data is generated, processed, one of the biggest challenges, the application of NLIDBs
and relied upon, it becomes continually more significant requires the means to translate natural language (NL)
that data is not only accessible to a small group of people. into SQL queries (NL2SQL) (for a recent comprehensive
Information can be contained in text, relational databases, overview of methods and open problems refer to Kim et
knowledge graphs, and many other formats—but users do al. [2]). However, before such NLIDBs can be used as one
not want to deal with heterogeneous sources. What they of many interfaces for information access (i.e., users can
are interested in is accessing information in an easy man- enter their information request using arbitrary words
ner. The borders between structured and unstructured and get a correct answer without knowledge about the
information keep blurring: when using Google for fac- database), further research is needed.
tual questions, infoboxes might show the answer without
the need to open a search result. That result might even
be wrapped in a generated sentence when voice search
was used and nobody cares whether the sentence was
extracted from a web page or generated from a database.</p>
      <p>On the other hand, there are good reasons why these
diferent ways of storing information exist.
Information access methods should leverage the possibilities of
each while providing convenient and ideally unified
interfaces. With this goal in mind, natural language interfaces
emerged as a data retrieval method, leveraging one of
our most flexible and intuitive means of communication.</p>
      <p>Relational databases are an essential type of
information storage. To query them, users require knowledge
of the domain, query language (e.g., SQL), and database
schema. Contrarily, the vision for natural language
interfaces to databases (NLIDBs) encompasses the ability
Contributions We show that current benchmarks,
especially the Spider challenge [3] and the related
challenges SparC [4] and CoSQL [5] are not suficient to
measure all relevant aspects and support the emergence
of ready-to-use NLIDBs. Yet, to foster research not only
on NLIDBs but on systems that integrate and use them,
we publish an API called UniverSQL1 to integrate
submissions to the challenges into research prototypes and
existing systems. Its core functionality is a wrapper
implementation to allow the execution of arbitrary queries
on pre- or custom-trained models. We additionally
provide two sample implementations of this wrapper for
existing NL2SQL translators (EditSQL [6] and IRNet [7]).</p>
      <p>The code is published under an open source license.</p>
      <p>Finally, we provide an overview of the advantages and
lfaws of Spider and other benchmarks and provide ideas
on how the evaluation of NLIDBs could advance.</p>
      <p>We hope that this research encourages the use of
NLIDBs and further development of approaches and
benchmarks. Hopefully, this will help make more
information accessible to everyone, regardless of their of
background.</p>
      <sec id="sec-1-1">
        <title>Outline The rest of the paper is organized as follows:</title>
        <p>After briefly describing the Spider challenge and its
sib</p>
      </sec>
      <sec id="sec-1-2">
        <title>1https://datamanagementlab.github.io/univerSQL/</title>
        <p>lings in Section 2, we analyze how reproducible and us- comparable to Spider in size, complexity and databases.
able the submissions to the shared tasks are in Section 3. However, queries are arranged in user interactions,
proIn Section 4, we present our prototypical implementation viding dialogue-like context. Therefore, it is not suficient
UniverSQL that makes more of these systems usable for to just translate the current NL utterance into SQL but
research. We examine strengths, weaknesses, and pos- information from previous queries has to be taken into
acsible further developments of benchmarks in Section 5, count. Analogous to Spider, SparC features a leaderboard
before providing a brief final summary in Section 6. for variants with and without value handling.</p>
        <p>CoSQL [5] takes the challenge to the level of a real
conversational agent. It consists of both dialogues and
2. What are SPIDER, SparC and annotated SQL queries simulating real-world DB
exploCoSQL? ration scenarios. Therefore, the system has to maintain
a state. CoSQL defines several challenges, the simplest
one mainly adds further context to interpret compared
to SparC, the other ones cover generation of suitable
responses and intention detection/classification.</p>
        <p>The Spider challenge [3] has become one of the standard
evaluations for NLIDBs since its publication in 2018. So
far, it was cited 217 times and 71 submissions were made
to the shared task. The dataset aims to surpass most
existing datasets in size by at least one order of magnitude. At
the same time it covers a diverse set of simple and com- 3. How reproducible and usable
plex SQL queries. This provides the necessary basis for are the challenge submissions?
data-driven systems to translate joins, nestings etc., and
challenges them to do so to achieve good performance All three challenges (SPIDER, SparC and CoSQL)
feaon the development and test data splits. ture a public leaderboard where diferent approaches</p>
        <p>Alongside the dataset, Spider provides a shared task: and their scores on the public, development as well as
Since such a dataset is expensive to create, it is not fea- the unpublished test set are listed. In this section, we
sible to create one every time the NLIDB is applied to will investigate the state of the submissions particularly
a new database. The authors suggest that this problem with regard to how reproducible the submissions are and
is solved by NLIDB systems capable of generalizing to whether they can be used outside of the exact task. An
new databases and performing well across domains. This overview of our analysis can be found in Tables 1 and 2
idea is not entirely new: Systems by e.g., Rangel et al. (as in June 2021). We will quickly interpret those results.
[8] or Wang et al. [9] already attempted to be
domainindependent in one way or another. However, Spider is SPIDER The leaderboard for the primary Spider task
the first dataset of its size, complexity and quality. The (without value handling) featured 62 entries in June 2021.
split ensures that each database occurs in exactly one set Some of them are only small variations of the same
sys(training, development, and test). This provides a con- tem, nevertheless, this boils down to 51 diferent
apcrete task description and evaluation process, allowing proaches. Yet, only little more than half (36 or 58 %) of
accurate and comparable measurements of success. those approaches are published in some way, the
remain</p>
        <p>Yu et al. [3] also propose a way of categorizing SQL ing approaches are anonymous or contain only names of
queries with regard to dificulty in the context of the authors or institutions (so far). For 25 submissions, a link
translation task. The concept regards the number of SQL to code is provided, yet, some repositories are empty or
components, selections, and conditions to label a query the link is invalid. In total, 20 approaches (32 %) have at
as easy, medium, hard, or extra hard. A SQL query is least some code that could be used as starting point for
estimated to be harder if it contains more SQL keywords, reproduction. Unfortunately, this is not evenly spaced,
e.g., a query is considered to be hard if it contains nest- only for two of the top ten current submissions (and for
ings, the EXCEPT keyword, or three (or more) columns four of the top twenty) code is provided.
in the SELECT statements, three (or more) WHERE con- Two approaches deserve special mention: Shi et al.
ditions, and a GROUP BY over two columns. Even more [10] provide a Jupyter Notebook for translation of
userstructures or keywords in one query are considered extra specified queries on custom data 2 and the code of Lin
hard. The Spider shared task encourages the submis- et al. [11] allows interaction with pretrained checkpoints
sion of models to show up in the leaderboard. There through a command line interface.
are two variants: the original task does not check value
accuracy, but there is also a leaderboard for systems that SPIDER (with Value Predection) This variation of
handle/predict values (not just queries with placehold- the task (additionally covering value handling necessary
ers).</p>
        <p>SparC [4] is the multi-turn variant of Spider. It deals 2https://github.com/awslabs/gap-text2sql/blob/main/
with cross-domain semantic parsing in context and is rat-sql-gap/notebook.ipynb
for translating real NL queries) unfortunately received in computer science where reproducibility is still a
substantially fewer submissions (seven entries for five challenge. ACM conferences try to tackle this through
approaches and all but one with publication). Four ap- reproducibility challenges and badges in the ACM
proaches, provide code (only two of six publications). Digital Library.3 Yet, publishing code and artifacts that
allow others to redo the experiments is still optional.</p>
        <p>SparC Although this challenge was published just nine While it is surely not feasible to change the whole
months after the Spider challenge, it received consider- publishing and reviewing process at once, we think that
ably fewer submissions so far. The leaderboard for the shared tasks are a good place to start. Of course it is
variant without value handling has 17 entries for 15 dif- fine that submissions are anonymous until the approach
ferent approaches. For less than half of them (47 %) pub- was reviewed and published. But we advocate that once
lications are referenced, for 24 % there is code and three names are revealed, it should also be necessary to
refsubmissions provide pre-trained models for download. erence publication and code. Authors of a challenge set
For the variant with value prediction, there are only two the requirements for submissions to be included in a
entries, out of which only one references a publication leaderboard—and they should take advantage of that.
and no code is provided at all. We therefore did not Moreover, it should be honored when authors of an
include this variant in Tables 1 and 2. approach or research prototype invest that extra time to
make it directly usable for others and their research.</p>
        <p>CoSQL At the time of writing, the challenge was A very good example (from a slightly diferent domain)
public for around 20 months. There were only baseline is SentenceBERT [12]. Although it is an implementation
implementations or entries without publications for two accompanying a research paper, it is extremely easy to
of the three variants, only one entry included value use: install via pip, import, specify which model to use.
handling. The main task received ten submissions by The installation scripts will install dependencies and the
eight diferent approaches with a publication ratio of system will download required files/checkpoints, making
90 %. For half of the approaches there is code, but in it possible to build research on top of it in minutes.
only two cases checkpoints can be downloaded and That case is already the cream of the crop, in many
there is no preparation for the use of the models outside cases significantly less efort would help: pinning
verthe evaluation scripts at all. sions of dependencies (especially machine learning
libraries often introduce breaking changes in just months),
run the code on a second machine under a diferent
username, add an installation script to download required</p>
      </sec>
      <sec id="sec-1-3">
        <title>Overall, we have to conclude that reproducibility</title>
        <p>of the approaches submitted to the leaderboards of all
challenges is at best mediocre, which is in line with
problems of the community and especially research</p>
      </sec>
      <sec id="sec-1-4">
        <title>3https://www.acm.org/publications/policies/</title>
        <p>artifact-review-and-badging-current
external data or add environment variables for
configuration. Each of these steps can make it substantially
easier to run foreign code (or your own after a while). It
is not about providing perfectly fast and robust
industrygrade software for production use—that is something
(academic) researchers usually cannot accomplish and
also should not spend their time on—but to allow quick
prototypical usage to decide whether it is feasible to use
an approach in research and maybe investing time in
improving it. We therefore argue that shared tasks like
Spider should require this in the future for submissions
to their leaderboards, and find it a great pity that most
of the current submissions are dificult to reproduce and
even more dificult to utilize for further research.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Does it translate?</title>
      <p>As shown in the last section, in June 2021 there were 86
submission in total for Spider and SparC. If one wants to
build a system on top of them, currently one has to pick
one of the best performing approaches from the
leaderboard, obtain the code, install dependencies, download
pre-trained models (if any) and then find a way to run
the code not on the benchmark data but on individual
natural language queries. There has to be a better way.
The Spider and SparC challenges do not enforce a
certain architecture (i.e., their aim is to foster research on
all kinds of approaches to solve the task and not tie it
down to e.g., a hyper-parameter optimization for a fixed
architecture). This has the downside of making it even
harder to use the resulting approaches in other
applications. As a community service we therefore provide a
simple API implementation called UniverSQL that can be
used in prototypes for information access i.e., ones that
use NLIDBs (and maybe other components) but do not
focus on implementing them. The idea is that this API
can be used as a unified interface to NLIDBs regardless of
their architecture. This allows researchers to concentrate
on their task—and allows them to make use of approaches
that would otherwise be dificult to use.</p>
      <p>UniverSQL is a small python application that serve as a
translation server. The API allows unified access to most
important functionalities (select a database, select a
translator, do the actual translation) and some convenience
and debugging functions like logging. It can be used for
individual translations but also for (context preserving)
multi-turn interactions as in the SparC challenge. An
overview of available endpoints can be seen in Figure 1.</p>
      <p>The core of UniverSQL is a wrapper implementation
to allow running arbitrary queries on pre-trained models.
We provide two sample implementations of this wrapper
for systems from the Spider leaderboard: EditSQL [6]
and IRNet [7]. It also includes a script to setup these
two systems and download required dependencies and
model dumps. We publish our code together with an
extensive documentation how to create wrappers for
other NL2SQL approaches and scripts for simple setup.
We hope that this itself evolves into a challenge were
researchers provide such a wrapper implementation and
installation script for their approach and will therefore
maintain a ready to use list as part of the published code.4</p>
    </sec>
    <sec id="sec-3">
      <title>5. What are we (still) missing?</title>
      <sec id="sec-3-1">
        <title>Modern data driven approaches would not be possible</title>
        <p>without big amounts of data, but curating and annotating
it is out of scope for many researchers. Hence, it is not
surprising that Spider and SparC, but also other datasets,
have strongly advanced research in the field. However,
we believe that further advancements are still possible:</p>
        <p>We already outlined some flaws of Spider such as a
missing focus on reproducibility. Yet, we also want to
highlight advantages like the manually annotated and
high-quality data, which deservedly currently makes it
the most important benchmark for of NL-2-SQL
translators.</p>
      </sec>
      <sec id="sec-3-2">
        <title>4https://datamanagementlab.github.io/univerSQL/</title>
        <p>In addition to Spider, there are other datasets and ap- answer as a bunch of numbers. Yet, a framework to asses
proaches for benchmarking of NLIDBs, but, like Spider, a system in respect to these kind of questions would help
they have some flaws. We will take a glance at some of to better decide on which improvements it is worth to
fothem, to outline typical problems: cus. We therefore hope that this user perspective will be</p>
        <p>The WikiSQL Benchmark by Zhong et al. [13] is a large considered more regularly in computer science research—
dataset (though smaller than Spider) that also features not as a separate field of research but an integral part to
a leaderboard. Unfortunately, it consists only of a small drive research in an direction that is suitable to support
number of unique query patterns [14] (in fact, half of humans best in whatever they want to accomplish.
the questions in the dataset are generated from one
single pattern). In particular, it contains neither joins nor
nestings. Furthermore, the NL questions are often low 6. Conclusion
quality (i.e., many are grammatically incorrect), some
do not have a proper semantic meaning and make little In this paper, we analyzed the reproducibility and
prepasense when read by humans and some NL questions do ration for use in further research of the submissions to
not have the same meaning as the associated SQL query. the Spider, SparC and CoSQL challenges. Unfortunately,
proUatcahmtahaett tarli.es[1t5o]mpueabsliusrheetdraPnasrlaaptihornasdeiBficuenltcyhb,yandia-p- wcoedefoiusnavdatihlaabtleonanlyd ffoorr eavbeonufte4w0e%rsuobfmthisesisounbsmairstsifiaocntss
viding queries into classes. The benchmark was manually like pre-trained model dumps are provided. Additionally,
curated but is quite small and covers only one table. the code is in most cases only capable to do the batch</p>
        <p>A recent paper by Gkini et al. [16] tries to benchmark translation of specific data required for the evaluation
existing translation systems. They focus on system as- scripts of the challenges but not prepared for use on other
pects like execution times or resource consumption and real world data. We therefore presented a prototypical
not on translation accuracy. However, their analysis API implementation called UniverSQL that provides a
leaves some open questions: First, their dataset which is simple interface for NL to SQL translation and boils down
unfortunately not publicly available (yet) appears to be the task of adapting an approach for individual
translaquite small, it consists only of 216 keyword-based and 241 tion to implementing a simple wrapper class. The
implenatural language queries. Second, although they cite Spi- mentation is provided as open source software. Finally,
der, they did not include the high-performing approaches we analyzed further shortcomings of Spider and other
from the leaderboard in their evaluation. Overall, this benchmarks and advocated for a stronger user
perspecapproach does not appear suficient for an evaluation tive when designing similar benchmarks in the future.
that takes the user’s perspective into account.</p>
        <p>Even if we combined all these approaches, the result Acknowledgments
would still not be the best way to evaluate NLIDBs.
Therefore, we will conclude with a brief outlook on what is We thank Janina Hartmann and Sebastian Bremser for
still needed and what would be possible in this area. their substantial contribution to the implementation.</p>
        <p>As mentioned before, it would probably boost the us- This work has been supported by the German
Fedage of the approaches if they allowed for direct/easy use. eral Ministry of Education and Research as part of the
Enforcing this is not an inherent part of a benchmark but Project Software Campus 2.0 (TUDA), Microproject
INcould be done as part of the setup of a shared task. TEXPLORE, under grant ZN 01IS17050, by the German</p>
        <p>Much more dificult but probably also even more im- Research Foundation as part of the Research Training
portant is taking the user’s perspective into account. One Group Adaptive Preparation of Information from
Heteroway to do so could be end-to-end benchmarks that do not geneous Sources (AIPHES) under grant No. GRK 1994/1,
only evaluate the translation accuracy but the real perfor- as well as the German Federal Ministry of Education and
mance in a data exploration task from input to the output Research and the State of Hesse through the National
(SparC and especially CoSQL do this to some extend). High-Performance Computing Program.
But there are many other highly interesting questions: This research and development project is/was partly
We can measure the accuracy of a system like an NLIDB, funded by the German Federal Ministry of Education
but what accuracy should we strive for? Are all errors and Research (BMBF) within the “The Future of Value
equally bad? Can a slightly wrong translation still be Creation – Research on Production, Services and Work”
suficient? What is the influence of a suboptimal transla- program (funding number 02L19C150) and managed by
tion? Will the user be satisfied by a system with 100 % the Project Management Agency Karlsruhe (PTKA). The
translation accuracy? Or do they expect something that authors are responsible for the content of this
publicacannot be accomplished even by perfectly working sys- tion.
tems? Answering such questions is hard, it can probably
not always be automated and it is dificult to frame the</p>
      </sec>
      <sec id="sec-3-3">
        <title>Javier González, A. Gelbukh, G. Sidorov, M. M. J.</title>
        <p>Rodríguez, A domain independent natural
lan[1] R. J. L. John, N. Potti, J. M. Patel, Ava: From data to guage interface to databases capable of processing
insights through conversations., in: CIDR, 2017. complex queries, in: A. Gelbukh, Á. de Albornoz,
[2] H. Kim, B.-H. So, W.-S. Han, H. Lee, Natural lan- H. Terashima-Marín (Eds.), MICAI 2005: Advances
guage to SQL: Where are we today?, Proceed- in Artificial Intelligence, Springer Berlin
Heidelings of the VLDB Endowment 13 (2020) 1737–1750. berg, Berlin, Heidelberg, 2005, pp. 833–842.
doi:10.14778/3401960.3401970. [9] Y. Wang, J. Berant, P. Liang, Building a semantic
[3] T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, parser overnight, in: Proceedings of the 53rd
AnZ. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, nual Meeting of the Association for Computational
D. Radev, Spider: A large-scale human-labeled Linguistics and the 7th International Joint
Condataset for complex and cross-domain semantic ference on Natural Language Processing (Volume
parsing and text-to-sql task, in: Proceedings of 1: Long Papers), Association for Computational
the 2018 Conference on Empirical Methods in Nat- Linguistics, Beijing, China, 2015, pp. 1332–1342.
ural Language Processing, Association for Com- URL: https://www.aclweb.org/anthology/P15-1129.
putational Linguistics, Brussels, Belgium, 2018, pp. doi:10.3115/v1/P15-1129.</p>
        <p>3911–3921. [10] P. Shi, P. Ng, Z. Wang, H. Zhu, A. H. Li,
[4] T. Yu, R. Zhang, M. Yasunaga, Y. C. Tan, X. V. Lin, J. Wang, C. N. dos Santos, B. Xiang,
LearnS. Li, I. L. Heyang Er, B. Pang, T. Chen, E. Ji, S. Dixit, ing contextual representations for semantic
parsD. Proctor, S. Shim, V. Z. Jonathan Kraft, C. Xiong, ing with generation-augmented pre-training, 2020.
R. Socher, D. Radev, SParC: Cross-domain seman- arXiv:2012.10309.
tic parsing in context, in: Proceedings of the 57th [11] X. V. Lin, R. Socher, C. Xiong, Bridging textual and
Annual Meeting of the Association for Computa- tabular data for cross-domain text-to-sql semantic
tional Linguistics, Association for Computational parsing, 2020. arXiv:2012.12627.</p>
        <p>Linguistics, Florence, Italy, 2019, pp. 4511–4523. [12] N. Reimers, I. Gurevych, Sentence-bert: Sentence
[5] T. Yu, R. Zhang, H. Er, S. Li, E. Xue, B. Pang, X. V. embeddings using siamese bert-networks, in:
ProLin, Y. C. Tan, T. Shi, Z. Li, et al., CoSQL: A conver- ceedings of the 2019 Conference on Empirical
Methsational text-to-sql challenge towards cross-domain ods in Natural Language Processing, Association
natural language interfaces to databases, Proceed- for Computational Linguistics, 2019, pp. 3982 – 3992.
ings of the 2019 Conference on Empirical Meth- URL: https://arxiv.org/abs/1908.10084.
ods in Natural Language Processing and the 9th In- [13] V. Zhong, C. Xiong, R. Socher, Seq2sql: Generating
ternational Joint Conference on Natural Language structured queries from natural language using
reProcessing (EMNLP-IJCNLP) (2019). URL: http://dx. inforcement learning, CoRR abs/1709.00103 (2017).
doi.org/10.18653/v1/D19-1204. doi:10.18653/v1/ [14] C. Finegan-Dollak, J. K. Kummerfeld, X. Lin, K.
Rad19-1204. manathan, S. Sadasivam, R. Zhang, D. R. Radev,
Im[6] R. Zhang, T. Yu, H. Y. Er, S. Shim, E. Xue, X. V. Lin, proving text-to-sql evaluation methodology, CoRR
T. Shi, C. Xiong, R. Socher, D. Radev, Editing-based abs/1806.09029 (2018).
sql query generation for cross-domain context- [15] P. Utama, N. Weir, F. Basik, C. Binnig, U. Cetintemel,
dependent questions, in: Proceedings of the 2019 B. Hättasch, A. Ilkhechi, S. Ramaswamy, A. Usta,
Conference on Empirical Methods in Natural Lan- An end-to-end neural natural language interface
guage Processing, Hong Kong, China, 2019, pp. for databases, 2018. arXiv:1804.00401.
5338–5349. [16] O. Gkini, T. Belmpas, G. Koutrika, Y. Ioannidis,
[7] J. Guo, Z. Zhan, Y. Gao, Y. Xiao, J.-G. Lou, T. Liu, An in-depth benchmarking of text-to-sql systems,
D. Zhang, Towards complex text-to-sql in cross- in: Proceedings of the 2021 International
Conferdomain database with intermediate representation, ence on Management of Data, SIGMOD/PODS ’21,
in: Proceeding of the 57th Annual Meeting of the Association for Computing Machinery, New York,
Association for Computational Linguistics (ACL), NY, USA, 2021, p. 632–644. URL: https://doi.org/
Association for Computational Linguistics, 2019, p. 10.1145/3448016.3452836. doi:10.1145/3448016.
4524–4535. 3452836.
[8] R. A. P. Rangel, O. Joaquín Pérez, B. Juan</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>