1. Introduction

Padua, Italy $ benjamin.haettasch@cs.tu-darmstadt.de (B. Hättasch); nadja.geisler@cs.tu-darmstadt.de (N. Geisler); carsten.binnig@cs.tu-darmstadt.de (C. Binnig) https://benjaminhaettasch.de/ (B. Hättasch); http://www.dm.tu-darmstadt.de/ (N. Geisler)

Netted?! How to Improve the Usefulness of Spider & Co.

Benjamin Hättasch

Nadja Geisler

Carsten Binnig

0 0 Technical University of Darmstadt (TU Darmstadt), Department of Computer Science , Hochschulstraße 10, 64289 Darmstadt , Germany

2021

000 0 0001

Natural language interfaces for databases (NLIDBs) are an intuitive way to access and explore structured data. That makes challenges like Spider (Yale's semantic parsing and text-to-SQL challenge) valuable, as they produce a series of approaches for NL-to-SQL-translation. However, the resulting contributions leave something to be desired. In this paper, we analyze the usefulness of those submissions to the leaderboard for future research. We also present a prototypical implementation called UniverSQL that makes these approaches easier to use in information access systems. We hope that this lowered barrier encourages (future) participants of these challenges to add support for actual usage of their submissions. Finally, we discuss what could be done to improve future benchmarks and shared tasks for (not only) NLIDBs.

1. Introduction

of any user to interactively explore large datasets without help or extensive manual preparation work [1]. As In a world where ever more data is generated, processed, one of the biggest challenges, the application of NLIDBs and relied upon, it becomes continually more significant requires the means to translate natural language (NL) that data is not only accessible to a small group of people. into SQL queries (NL2SQL) (for a recent comprehensive Information can be contained in text, relational databases, overview of methods and open problems refer to Kim et knowledge graphs, and many other formats—but users do al. [2]). However, before such NLIDBs can be used as one not want to deal with heterogeneous sources. What they of many interfaces for information access (i.e., users can are interested in is accessing information in an easy man- enter their information request using arbitrary words ner. The borders between structured and unstructured and get a correct answer without knowledge about the information keep blurring: when using Google for fac- database), further research is needed. tual questions, infoboxes might show the answer without the need to open a search result. That result might even be wrapped in a generated sentence when voice search was used and nobody cares whether the sentence was extracted from a web page or generated from a database.

On the other hand, there are good reasons why these diferent ways of storing information exist. Information access methods should leverage the possibilities of each while providing convenient and ideally unified interfaces. With this goal in mind, natural language interfaces emerged as a data retrieval method, leveraging one of our most flexible and intuitive means of communication.

Relational databases are an essential type of information storage. To query them, users require knowledge of the domain, query language (e.g., SQL), and database schema. Contrarily, the vision for natural language interfaces to databases (NLIDBs) encompasses the ability Contributions We show that current benchmarks, especially the Spider challenge [3] and the related challenges SparC [4] and CoSQL [5] are not suficient to measure all relevant aspects and support the emergence of ready-to-use NLIDBs. Yet, to foster research not only on NLIDBs but on systems that integrate and use them, we publish an API called UniverSQL1 to integrate submissions to the challenges into research prototypes and existing systems. Its core functionality is a wrapper implementation to allow the execution of arbitrary queries on pre- or custom-trained models. We additionally provide two sample implementations of this wrapper for existing NL2SQL translators (EditSQL [6] and IRNet [7]).

The code is published under an open source license.

Finally, we provide an overview of the advantages and lfaws of Spider and other benchmarks and provide ideas on how the evaluation of NLIDBs could advance.

We hope that this research encourages the use of NLIDBs and further development of approaches and benchmarks. Hopefully, this will help make more information accessible to everyone, regardless of their of background.

Outline The rest of the paper is organized as follows:

After briefly describing the Spider challenge and its sib

1https://datamanagementlab.github.io/univerSQL/

lings in Section 2, we analyze how reproducible and us- comparable to Spider in size, complexity and databases. able the submissions to the shared tasks are in Section 3. However, queries are arranged in user interactions, proIn Section 4, we present our prototypical implementation viding dialogue-like context. Therefore, it is not suficient UniverSQL that makes more of these systems usable for to just translate the current NL utterance into SQL but research. We examine strengths, weaknesses, and pos- information from previous queries has to be taken into acsible further developments of benchmarks in Section 5, count. Analogous to Spider, SparC features a leaderboard before providing a brief final summary in Section 6. for variants with and without value handling.

CoSQL [5] takes the challenge to the level of a real conversational agent. It consists of both dialogues and 2. What are SPIDER, SparC and annotated SQL queries simulating real-world DB exploCoSQL? ration scenarios. Therefore, the system has to maintain a state. CoSQL defines several challenges, the simplest one mainly adds further context to interpret compared to SparC, the other ones cover generation of suitable responses and intention detection/classification.

The Spider challenge [3] has become one of the standard evaluations for NLIDBs since its publication in 2018. So far, it was cited 217 times and 71 submissions were made to the shared task. The dataset aims to surpass most existing datasets in size by at least one order of magnitude. At the same time it covers a diverse set of simple and com- 3. How reproducible and usable plex SQL queries. This provides the necessary basis for are the challenge submissions? data-driven systems to translate joins, nestings etc., and challenges them to do so to achieve good performance All three challenges (SPIDER, SparC and CoSQL) feaon the development and test data splits. ture a public leaderboard where diferent approaches

Alongside the dataset, Spider provides a shared task: and their scores on the public, development as well as Since such a dataset is expensive to create, it is not fea- the unpublished test set are listed. In this section, we sible to create one every time the NLIDB is applied to will investigate the state of the submissions particularly a new database. The authors suggest that this problem with regard to how reproducible the submissions are and is solved by NLIDB systems capable of generalizing to whether they can be used outside of the exact task. An new databases and performing well across domains. This overview of our analysis can be found in Tables 1 and 2 idea is not entirely new: Systems by e.g., Rangel et al. (as in June 2021). We will quickly interpret those results. [8] or Wang et al. [9] already attempted to be domainindependent in one way or another. However, Spider is SPIDER The leaderboard for the primary Spider task the first dataset of its size, complexity and quality. The (without value handling) featured 62 entries in June 2021. split ensures that each database occurs in exactly one set Some of them are only small variations of the same sys(training, development, and test). This provides a con- tem, nevertheless, this boils down to 51 diferent apcrete task description and evaluation process, allowing proaches. Yet, only little more than half (36 or 58 %) of accurate and comparable measurements of success. those approaches are published in some way, the remain

Yu et al. [3] also propose a way of categorizing SQL ing approaches are anonymous or contain only names of queries with regard to dificulty in the context of the authors or institutions (so far). For 25 submissions, a link translation task. The concept regards the number of SQL to code is provided, yet, some repositories are empty or components, selections, and conditions to label a query the link is invalid. In total, 20 approaches (32 %) have at as easy, medium, hard, or extra hard. A SQL query is least some code that could be used as starting point for estimated to be harder if it contains more SQL keywords, reproduction. Unfortunately, this is not evenly spaced, e.g., a query is considered to be hard if it contains nest- only for two of the top ten current submissions (and for ings, the EXCEPT keyword, or three (or more) columns four of the top twenty) code is provided. in the SELECT statements, three (or more) WHERE con- Two approaches deserve special mention: Shi et al. ditions, and a GROUP BY over two columns. Even more [10] provide a Jupyter Notebook for translation of userstructures or keywords in one query are considered extra specified queries on custom data 2 and the code of Lin hard. The Spider shared task encourages the submis- et al. [11] allows interaction with pretrained checkpoints sion of models to show up in the leaderboard. There through a command line interface. are two variants: the original task does not check value accuracy, but there is also a leaderboard for systems that SPIDER (with Value Predection) This variation of handle/predict values (not just queries with placehold- the task (additionally covering value handling necessary ers).

SparC [4] is the multi-turn variant of Spider. It deals 2https://github.com/awslabs/gap-text2sql/blob/main/ with cross-domain semantic parsing in context and is rat-sql-gap/notebook.ipynb for translating real NL queries) unfortunately received in computer science where reproducibility is still a substantially fewer submissions (seven entries for five challenge. ACM conferences try to tackle this through approaches and all but one with publication). Four ap- reproducibility challenges and badges in the ACM proaches, provide code (only two of six publications). Digital Library.3 Yet, publishing code and artifacts that allow others to redo the experiments is still optional.

SparC Although this challenge was published just nine While it is surely not feasible to change the whole months after the Spider challenge, it received consider- publishing and reviewing process at once, we think that ably fewer submissions so far. The leaderboard for the shared tasks are a good place to start. Of course it is variant without value handling has 17 entries for 15 dif- fine that submissions are anonymous until the approach ferent approaches. For less than half of them (47 %) pub- was reviewed and published. But we advocate that once lications are referenced, for 24 % there is code and three names are revealed, it should also be necessary to refsubmissions provide pre-trained models for download. erence publication and code. Authors of a challenge set For the variant with value prediction, there are only two the requirements for submissions to be included in a entries, out of which only one references a publication leaderboard—and they should take advantage of that. and no code is provided at all. We therefore did not Moreover, it should be honored when authors of an include this variant in Tables 1 and 2. approach or research prototype invest that extra time to make it directly usable for others and their research.

CoSQL At the time of writing, the challenge was A very good example (from a slightly diferent domain) public for around 20 months. There were only baseline is SentenceBERT [12]. Although it is an implementation implementations or entries without publications for two accompanying a research paper, it is extremely easy to of the three variants, only one entry included value use: install via pip, import, specify which model to use. handling. The main task received ten submissions by The installation scripts will install dependencies and the eight diferent approaches with a publication ratio of system will download required files/checkpoints, making 90 %. For half of the approaches there is code, but in it possible to build research on top of it in minutes. only two cases checkpoints can be downloaded and That case is already the cream of the crop, in many there is no preparation for the use of the models outside cases significantly less efort would help: pinning verthe evaluation scripts at all. sions of dependencies (especially machine learning libraries often introduce breaking changes in just months), run the code on a second machine under a diferent username, add an installation script to download required

Overall, we have to conclude that reproducibility

of the approaches submitted to the leaderboards of all challenges is at best mediocre, which is in line with problems of the community and especially research

3https://www.acm.org/publications/policies/

artifact-review-and-badging-current external data or add environment variables for configuration. Each of these steps can make it substantially easier to run foreign code (or your own after a while). It is not about providing perfectly fast and robust industrygrade software for production use—that is something (academic) researchers usually cannot accomplish and also should not spend their time on—but to allow quick prototypical usage to decide whether it is feasible to use an approach in research and maybe investing time in improving it. We therefore argue that shared tasks like Spider should require this in the future for submissions to their leaderboards, and find it a great pity that most of the current submissions are dificult to reproduce and even more dificult to utilize for further research.

4. Does it translate?

As shown in the last section, in June 2021 there were 86 submission in total for Spider and SparC. If one wants to build a system on top of them, currently one has to pick one of the best performing approaches from the leaderboard, obtain the code, install dependencies, download pre-trained models (if any) and then find a way to run the code not on the benchmark data but on individual natural language queries. There has to be a better way. The Spider and SparC challenges do not enforce a certain architecture (i.e., their aim is to foster research on all kinds of approaches to solve the task and not tie it down to e.g., a hyper-parameter optimization for a fixed architecture). This has the downside of making it even harder to use the resulting approaches in other applications. As a community service we therefore provide a simple API implementation called UniverSQL that can be used in prototypes for information access i.e., ones that use NLIDBs (and maybe other components) but do not focus on implementing them. The idea is that this API can be used as a unified interface to NLIDBs regardless of their architecture. This allows researchers to concentrate on their task—and allows them to make use of approaches that would otherwise be dificult to use.

UniverSQL is a small python application that serve as a translation server. The API allows unified access to most important functionalities (select a database, select a translator, do the actual translation) and some convenience and debugging functions like logging. It can be used for individual translations but also for (context preserving) multi-turn interactions as in the SparC challenge. An overview of available endpoints can be seen in Figure 1.

The core of UniverSQL is a wrapper implementation to allow running arbitrary queries on pre-trained models. We provide two sample implementations of this wrapper for systems from the Spider leaderboard: EditSQL [6] and IRNet [7]. It also includes a script to setup these two systems and download required dependencies and model dumps. We publish our code together with an extensive documentation how to create wrappers for other NL2SQL approaches and scripts for simple setup. We hope that this itself evolves into a challenge were researchers provide such a wrapper implementation and installation script for their approach and will therefore maintain a ready to use list as part of the published code.4

5. What are we (still) missing? Modern data driven approaches would not be possible

without big amounts of data, but curating and annotating it is out of scope for many researchers. Hence, it is not surprising that Spider and SparC, but also other datasets, have strongly advanced research in the field. However, we believe that further advancements are still possible:

We already outlined some flaws of Spider such as a missing focus on reproducibility. Yet, we also want to highlight advantages like the manually annotated and high-quality data, which deservedly currently makes it the most important benchmark for of NL-2-SQL translators.

4https://datamanagementlab.github.io/univerSQL/

In addition to Spider, there are other datasets and ap- answer as a bunch of numbers. Yet, a framework to asses proaches for benchmarking of NLIDBs, but, like Spider, a system in respect to these kind of questions would help they have some flaws. We will take a glance at some of to better decide on which improvements it is worth to fothem, to outline typical problems: cus. We therefore hope that this user perspective will be

The WikiSQL Benchmark by Zhong et al. [13] is a large considered more regularly in computer science research— dataset (though smaller than Spider) that also features not as a separate field of research but an integral part to a leaderboard. Unfortunately, it consists only of a small drive research in an direction that is suitable to support number of unique query patterns [14] (in fact, half of humans best in whatever they want to accomplish. the questions in the dataset are generated from one single pattern). In particular, it contains neither joins nor nestings. Furthermore, the NL questions are often low 6. Conclusion quality (i.e., many are grammatically incorrect), some do not have a proper semantic meaning and make little In this paper, we analyzed the reproducibility and prepasense when read by humans and some NL questions do ration for use in further research of the submissions to not have the same meaning as the associated SQL query. the Spider, SparC and CoSQL challenges. Unfortunately, proUatcahmtahaett tarli.es[1t5o]mpueabsliusrheetdraPnasrlaaptihornasdeiBficuenltcyhb,yandia-p- wcoedefoiusnavdatihlaabtleonanlyd ffoorr eavbeonufte4w0e%rsuobfmthisesisounbsmairstsifiaocntss viding queries into classes. The benchmark was manually like pre-trained model dumps are provided. Additionally, curated but is quite small and covers only one table. the code is in most cases only capable to do the batch

A recent paper by Gkini et al. [16] tries to benchmark translation of specific data required for the evaluation existing translation systems. They focus on system as- scripts of the challenges but not prepared for use on other pects like execution times or resource consumption and real world data. We therefore presented a prototypical not on translation accuracy. However, their analysis API implementation called UniverSQL that provides a leaves some open questions: First, their dataset which is simple interface for NL to SQL translation and boils down unfortunately not publicly available (yet) appears to be the task of adapting an approach for individual translaquite small, it consists only of 216 keyword-based and 241 tion to implementing a simple wrapper class. The implenatural language queries. Second, although they cite Spi- mentation is provided as open source software. Finally, der, they did not include the high-performing approaches we analyzed further shortcomings of Spider and other from the leaderboard in their evaluation. Overall, this benchmarks and advocated for a stronger user perspecapproach does not appear suficient for an evaluation tive when designing similar benchmarks in the future. that takes the user’s perspective into account.

Even if we combined all these approaches, the result Acknowledgments would still not be the best way to evaluate NLIDBs. Therefore, we will conclude with a brief outlook on what is We thank Janina Hartmann and Sebastian Bremser for still needed and what would be possible in this area. their substantial contribution to the implementation.

As mentioned before, it would probably boost the us- This work has been supported by the German Fedage of the approaches if they allowed for direct/easy use. eral Ministry of Education and Research as part of the Enforcing this is not an inherent part of a benchmark but Project Software Campus 2.0 (TUDA), Microproject INcould be done as part of the setup of a shared task. TEXPLORE, under grant ZN 01IS17050, by the German

Much more dificult but probably also even more im- Research Foundation as part of the Research Training portant is taking the user’s perspective into account. One Group Adaptive Preparation of Information from Heteroway to do so could be end-to-end benchmarks that do not geneous Sources (AIPHES) under grant No. GRK 1994/1, only evaluate the translation accuracy but the real perfor- as well as the German Federal Ministry of Education and mance in a data exploration task from input to the output Research and the State of Hesse through the National (SparC and especially CoSQL do this to some extend). High-Performance Computing Program. But there are many other highly interesting questions: This research and development project is/was partly We can measure the accuracy of a system like an NLIDB, funded by the German Federal Ministry of Education but what accuracy should we strive for? Are all errors and Research (BMBF) within the “The Future of Value equally bad? Can a slightly wrong translation still be Creation – Research on Production, Services and Work” suficient? What is the influence of a suboptimal transla- program (funding number 02L19C150) and managed by tion? Will the user be satisfied by a system with 100 % the Project Management Agency Karlsruhe (PTKA). The translation accuracy? Or do they expect something that authors are responsible for the content of this publicacannot be accomplished even by perfectly working sys- tion. tems? Answering such questions is hard, it can probably not always be automated and it is dificult to frame the

Javier González, A. Gelbukh, G. Sidorov, M. M. J.

Rodríguez, A domain independent natural lan[1] R. J. L. John, N. Potti, J. M. Patel, Ava: From data to guage interface to databases capable of processing insights through conversations., in: CIDR, 2017. complex queries, in: A. Gelbukh, Á. de Albornoz, [2] H. Kim, B.-H. So, W.-S. Han, H. Lee, Natural lan- H. Terashima-Marín (Eds.), MICAI 2005: Advances guage to SQL: Where are we today?, Proceed- in Artificial Intelligence, Springer Berlin Heidelings of the VLDB Endowment 13 (2020) 1737–1750. berg, Berlin, Heidelberg, 2005, pp. 833–842. doi:10.14778/3401960.3401970. [9] Y. Wang, J. Berant, P. Liang, Building a semantic [3] T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, parser overnight, in: Proceedings of the 53rd AnZ. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, nual Meeting of the Association for Computational D. Radev, Spider: A large-scale human-labeled Linguistics and the 7th International Joint Condataset for complex and cross-domain semantic ference on Natural Language Processing (Volume parsing and text-to-sql task, in: Proceedings of 1: Long Papers), Association for Computational the 2018 Conference on Empirical Methods in Nat- Linguistics, Beijing, China, 2015, pp. 1332–1342. ural Language Processing, Association for Com- URL: https://www.aclweb.org/anthology/P15-1129. putational Linguistics, Brussels, Belgium, 2018, pp. doi:10.3115/v1/P15-1129.

3911–3921. [10] P. Shi, P. Ng, Z. Wang, H. Zhu, A. H. Li, [4] T. Yu, R. Zhang, M. Yasunaga, Y. C. Tan, X. V. Lin, J. Wang, C. N. dos Santos, B. Xiang, LearnS. Li, I. L. Heyang Er, B. Pang, T. Chen, E. Ji, S. Dixit, ing contextual representations for semantic parsD. Proctor, S. Shim, V. Z. Jonathan Kraft, C. Xiong, ing with generation-augmented pre-training, 2020. R. Socher, D. Radev, SParC: Cross-domain seman- arXiv:2012.10309. tic parsing in context, in: Proceedings of the 57th [11] X. V. Lin, R. Socher, C. Xiong, Bridging textual and Annual Meeting of the Association for Computa- tabular data for cross-domain text-to-sql semantic tional Linguistics, Association for Computational parsing, 2020. arXiv:2012.12627.

Linguistics, Florence, Italy, 2019, pp. 4511–4523. [12] N. Reimers, I. Gurevych, Sentence-bert: Sentence [5] T. Yu, R. Zhang, H. Er, S. Li, E. Xue, B. Pang, X. V. embeddings using siamese bert-networks, in: ProLin, Y. C. Tan, T. Shi, Z. Li, et al., CoSQL: A conver- ceedings of the 2019 Conference on Empirical Methsational text-to-sql challenge towards cross-domain ods in Natural Language Processing, Association natural language interfaces to databases, Proceed- for Computational Linguistics, 2019, pp. 3982 – 3992. ings of the 2019 Conference on Empirical Meth- URL: https://arxiv.org/abs/1908.10084. ods in Natural Language Processing and the 9th In- [13] V. Zhong, C. Xiong, R. Socher, Seq2sql: Generating ternational Joint Conference on Natural Language structured queries from natural language using reProcessing (EMNLP-IJCNLP) (2019). URL: http://dx. inforcement learning, CoRR abs/1709.00103 (2017). doi.org/10.18653/v1/D19-1204. doi:10.18653/v1/ [14] C. Finegan-Dollak, J. K. Kummerfeld, X. Lin, K. Rad19-1204. manathan, S. Sadasivam, R. Zhang, D. R. Radev, Im[6] R. Zhang, T. Yu, H. Y. Er, S. Shim, E. Xue, X. V. Lin, proving text-to-sql evaluation methodology, CoRR T. Shi, C. Xiong, R. Socher, D. Radev, Editing-based abs/1806.09029 (2018). sql query generation for cross-domain context- [15] P. Utama, N. Weir, F. Basik, C. Binnig, U. Cetintemel, dependent questions, in: Proceedings of the 2019 B. Hättasch, A. Ilkhechi, S. Ramaswamy, A. Usta, Conference on Empirical Methods in Natural Lan- An end-to-end neural natural language interface guage Processing, Hong Kong, China, 2019, pp. for databases, 2018. arXiv:1804.00401. 5338–5349. [16] O. Gkini, T. Belmpas, G. Koutrika, Y. Ioannidis, [7] J. Guo, Z. Zhan, Y. Gao, Y. Xiao, J.-G. Lou, T. Liu, An in-depth benchmarking of text-to-sql systems, D. Zhang, Towards complex text-to-sql in cross- in: Proceedings of the 2021 International Conferdomain database with intermediate representation, ence on Management of Data, SIGMOD/PODS ’21, in: Proceeding of the 57th Annual Meeting of the Association for Computing Machinery, New York, Association for Computational Linguistics (ACL), NY, USA, 2021, p. 632–644. URL: https://doi.org/ Association for Computational Linguistics, 2019, p. 10.1145/3448016.3452836. doi:10.1145/3448016. 4524–4535. 3452836. [8] R. A. P. Rangel, O. Joaquín Pérez, B. Juan