Netted?! How to Improve the Usefulness of Spider & Co. Benjamin Hättasch1 , Nadja Geisler1 and Carsten Binnig Technical University of Darmstadt (TU Darmstadt), Department of Computer Science, Hochschulstraße 10, 64289 Darmstadt, Germany 1 Both authors contributed equally to this research Abstract Natural language interfaces for databases (NLIDBs) are an intuitive way to access and explore structured data. That makes challenges like Spider (Yale’s semantic parsing and text-to-SQL challenge) valuable, as they produce a series of approaches for NL-to-SQL-translation. However, the resulting contributions leave something to be desired. In this paper, we analyze the usefulness of those submissions to the leaderboard for future research. We also present a prototypical implementation called UniverSQL that makes these approaches easier to use in information access systems. We hope that this lowered barrier encourages (future) participants of these challenges to add support for actual usage of their submissions. Finally, we discuss what could be done to improve future benchmarks and shared tasks for (not only) NLIDBs. 1. Introduction of any user to interactively explore large datasets with- out help or extensive manual preparation work [1]. As In a world where ever more data is generated, processed, one of the biggest challenges, the application of NLIDBs and relied upon, it becomes continually more significant requires the means to translate natural language (NL) that data is not only accessible to a small group of people. into SQL queries (NL2SQL) (for a recent comprehensive Information can be contained in text, relational databases, overview of methods and open problems refer to Kim et knowledge graphs, and many other formats—but users do al. [2]). However, before such NLIDBs can be used as one not want to deal with heterogeneous sources. What they of many interfaces for information access (i.e., users can are interested in is accessing information in an easy man- enter their information request using arbitrary words ner. The borders between structured and unstructured and get a correct answer without knowledge about the information keep blurring: when using Google for fac- database), further research is needed. tual questions, infoboxes might show the answer without the need to open a search result. That result might even Contributions We show that current benchmarks, es- be wrapped in a generated sentence when voice search pecially the Spider challenge [3] and the related chal- was used and nobody cares whether the sentence was lenges SparC [4] and CoSQL [5] are not sufficient to extracted from a web page or generated from a database. measure all relevant aspects and support the emergence On the other hand, there are good reasons why these of ready-to-use NLIDBs. Yet, to foster research not only different ways of storing information exist. Informa- on NLIDBs but on systems that integrate and use them, tion access methods should leverage the possibilities of we publish an API called UniverSQL1 to integrate sub- each while providing convenient and ideally unified inter- missions to the challenges into research prototypes and faces. With this goal in mind, natural language interfaces existing systems. Its core functionality is a wrapper im- emerged as a data retrieval method, leveraging one of plementation to allow the execution of arbitrary queries our most flexible and intuitive means of communication. on pre- or custom-trained models. We additionally pro- Relational databases are an essential type of informa- vide two sample implementations of this wrapper for tion storage. To query them, users require knowledge existing NL2SQL translators (EditSQL [6] and IRNet [7]). of the domain, query language (e.g., SQL), and database The code is published under an open source license. schema. Contrarily, the vision for natural language in- Finally, we provide an overview of the advantages and terfaces to databases (NLIDBs) encompasses the ability flaws of Spider and other benchmarks and provide ideas on how the evaluation of NLIDBs could advance. DESIRES 2021 – 2nd International Conference on Design of We hope that this research encourages the use of Experimental Search & Information REtrieval Systems, September 15–18, 2021, Padua, Italy NLIDBs and further development of approaches and $ benjamin.haettasch@cs.tu-darmstadt.de (B. Hättasch); benchmarks. Hopefully, this will help make more in- nadja.geisler@cs.tu-darmstadt.de (N. Geisler); formation accessible to everyone, regardless of their of carsten.binnig@cs.tu-darmstadt.de (C. Binnig) background. € https://benjaminhaettasch.de/ (B. Hättasch); http://www.dm.tu-darmstadt.de/ (N. Geisler)  0000-0001-8949-3611 (B. Hättasch); 0000-0002-5245-6718 Outline The rest of the paper is organized as follows: (N. Geisler); 0000-0002-2744-7836 (C. Binnig) After briefly describing the Spider challenge and its sib- © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 1 CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 https://datamanagementlab.github.io/univerSQL/ lings in Section 2, we analyze how reproducible and us- comparable to Spider in size, complexity and databases. able the submissions to the shared tasks are in Section 3. However, queries are arranged in user interactions, pro- In Section 4, we present our prototypical implementation viding dialogue-like context. Therefore, it is not sufficient UniverSQL that makes more of these systems usable for to just translate the current NL utterance into SQL but research. We examine strengths, weaknesses, and pos- information from previous queries has to be taken into ac- sible further developments of benchmarks in Section 5, count. Analogous to Spider, SparC features a leaderboard before providing a brief final summary in Section 6. for variants with and without value handling. CoSQL [5] takes the challenge to the level of a real conversational agent. It consists of both dialogues and 2. What are SPIDER, SparC and annotated SQL queries simulating real-world DB explo- CoSQL? ration scenarios. Therefore, the system has to maintain a state. CoSQL defines several challenges, the simplest The Spider challenge [3] has become one of the standard one mainly adds further context to interpret compared evaluations for NLIDBs since its publication in 2018. So to SparC, the other ones cover generation of suitable far, it was cited 217 times and 71 submissions were made responses and intention detection/classification. to the shared task. The dataset aims to surpass most exist- ing datasets in size by at least one order of magnitude. At the same time it covers a diverse set of simple and com- 3. How reproducible and usable plex SQL queries. This provides the necessary basis for are the challenge submissions? data-driven systems to translate joins, nestings etc., and challenges them to do so to achieve good performance All three challenges (SPIDER, SparC and CoSQL) fea- on the development and test data splits. ture a public leaderboard where different approaches Alongside the dataset, Spider provides a shared task: and their scores on the public, development as well as Since such a dataset is expensive to create, it is not fea- the unpublished test set are listed. In this section, we sible to create one every time the NLIDB is applied to will investigate the state of the submissions particularly a new database. The authors suggest that this problem with regard to how reproducible the submissions are and is solved by NLIDB systems capable of generalizing to whether they can be used outside of the exact task. An new databases and performing well across domains. This overview of our analysis can be found in Tables 1 and 2 idea is not entirely new: Systems by e.g., Rangel et al. (as in June 2021). We will quickly interpret those results. [8] or Wang et al. [9] already attempted to be domain- independent in one way or another. However, Spider is SPIDER The leaderboard for the primary Spider task the first dataset of its size, complexity and quality. The (without value handling) featured 62 entries in June 2021. split ensures that each database occurs in exactly one set Some of them are only small variations of the same sys- (training, development, and test). This provides a con- tem, nevertheless, this boils down to 51 different ap- crete task description and evaluation process, allowing proaches. Yet, only little more than half (36 or 58 %) of accurate and comparable measurements of success. those approaches are published in some way, the remain- Yu et al. [3] also propose a way of categorizing SQL ing approaches are anonymous or contain only names of queries with regard to difficulty in the context of the authors or institutions (so far). For 25 submissions, a link translation task. The concept regards the number of SQL to code is provided, yet, some repositories are empty or components, selections, and conditions to label a query the link is invalid. In total, 20 approaches (32 %) have at as easy, medium, hard, or extra hard. A SQL query is least some code that could be used as starting point for estimated to be harder if it contains more SQL keywords, reproduction. Unfortunately, this is not evenly spaced, e.g., a query is considered to be hard if it contains nest- only for two of the top ten current submissions (and for ings, the EXCEPT keyword, or three (or more) columns four of the top twenty) code is provided. in the SELECT statements, three (or more) WHERE con- Two approaches deserve special mention: Shi et al. ditions, and a GROUP BY over two columns. Even more [10] provide a Jupyter Notebook for translation of user- structures or keywords in one query are considered extra specified queries on custom data2 and the code of Lin hard. The Spider shared task encourages the submis- et al. [11] allows interaction with pretrained checkpoints sion of models to show up in the leaderboard. There through a command line interface. are two variants: the original task does not check value accuracy, but there is also a leaderboard for systems that SPIDER (with Value Predection) This variation of handle/predict values (not just queries with placehold- the task (additionally covering value handling necessary ers). SparC [4] is the multi-turn variant of Spider. It deals 2 https://github.com/awslabs/gap-text2sql/blob/main/ with cross-domain semantic parsing in context and is rat-sql-gap/notebook.ipynb Table 1 Analysis of the leaderboard entries for Spider (with (+v) and without (-v) value prediction), SparC & CoSQL. We checked how many different approaches are presented, how many of them reference a publication and how often there is code to at least try to reproduce the approach. Spider (-v) Spider (+v) SparC CoSQL Entries 62 7 17 10 Diff. appr. 51 5 15 8 - Publications 36 (58 %) 6 (86 %) 8 (47 %) 9 (90 %) - Code 20 (32 %) 4 (57 %) 4 (24 %) 5 (50 %) Table 2 Analysis of the available repositories for the different challenges. We report whether the repositories are empty or contain code, whether checkpoints/pre-trained models are provided for download and whether the usage of this approach on own data/tables is in some way prepared. Spider (-v) Spider (+v) SparC CoSQL Repositories 15 2 4 5 - Empty? 2 (13 %) 0 (0 %) 0 (0 %) 1 (20 %) - Code? 13 (87 %) 2 (100 %) 3 (75 %) 4 (80 %) - Checkpoints 9 (60 %) 2 (100 %) 3 (75 %) 2 (40 %) - Own data? 2 (13 %) 0 (0 %) 0 (0 %) 0 (0 %) for translating real NL queries) unfortunately received in computer science where reproducibility is still a substantially fewer submissions (seven entries for five challenge. ACM conferences try to tackle this through approaches and all but one with publication). Four ap- reproducibility challenges and badges in the ACM proaches, provide code (only two of six publications). Digital Library.3 Yet, publishing code and artifacts that allow others to redo the experiments is still optional. SparC Although this challenge was published just nine While it is surely not feasible to change the whole months after the Spider challenge, it received consider- publishing and reviewing process at once, we think that ably fewer submissions so far. The leaderboard for the shared tasks are a good place to start. Of course it is variant without value handling has 17 entries for 15 dif- fine that submissions are anonymous until the approach ferent approaches. For less than half of them (47 %) pub- was reviewed and published. But we advocate that once lications are referenced, for 24 % there is code and three names are revealed, it should also be necessary to ref- submissions provide pre-trained models for download. erence publication and code. Authors of a challenge set For the variant with value prediction, there are only two the requirements for submissions to be included in a entries, out of which only one references a publication leaderboard—and they should take advantage of that. and no code is provided at all. We therefore did not Moreover, it should be honored when authors of an include this variant in Tables 1 and 2. approach or research prototype invest that extra time to make it directly usable for others and their research. CoSQL At the time of writing, the challenge was A very good example (from a slightly different domain) public for around 20 months. There were only baseline is SentenceBERT [12]. Although it is an implementation implementations or entries without publications for two accompanying a research paper, it is extremely easy to of the three variants, only one entry included value use: install via pip, import, specify which model to use. handling. The main task received ten submissions by The installation scripts will install dependencies and the eight different approaches with a publication ratio of system will download required files/checkpoints, making 90 %. For half of the approaches there is code, but in it possible to build research on top of it in minutes. only two cases checkpoints can be downloaded and That case is already the cream of the crop, in many there is no preparation for the use of the models outside cases significantly less effort would help: pinning ver- the evaluation scripts at all. sions of dependencies (especially machine learning li- braries often introduce breaking changes in just months), Overall, we have to conclude that reproducibility run the code on a second machine under a different user- of the approaches submitted to the leaderboards of all name, add an installation script to download required challenges is at best mediocre, which is in line with 3 https://www.acm.org/publications/policies/ problems of the community and especially research artifact-review-and-badging-current the code not on the benchmark data but on individual natural language queries. There has to be a better way. The Spider and SparC challenges do not enforce a cer- tain architecture (i.e., their aim is to foster research on all kinds of approaches to solve the task and not tie it down to e.g., a hyper-parameter optimization for a fixed architecture). This has the downside of making it even harder to use the resulting approaches in other applica- tions. As a community service we therefore provide a simple API implementation called UniverSQL that can be used in prototypes for information access i.e., ones that use NLIDBs (and maybe other components) but do not focus on implementing them. The idea is that this API can be used as a unified interface to NLIDBs regardless of their architecture. This allows researchers to concentrate on their task—and allows them to make use of approaches that would otherwise be difficult to use. UniverSQL is a small python application that serve as a translation server. The API allows unified access to most important functionalities (select a database, select a trans- lator, do the actual translation) and some convenience and debugging functions like logging. It can be used for individual translations but also for (context preserving) multi-turn interactions as in the SparC challenge. An overview of available endpoints can be seen in Figure 1. The core of UniverSQL is a wrapper implementation to allow running arbitrary queries on pre-trained models. We provide two sample implementations of this wrapper Figure 1: Endpoints of the UniverSQL API for systems from the Spider leaderboard: EditSQL [6] and IRNet [7]. It also includes a script to setup these two systems and download required dependencies and model dumps. We publish our code together with an external data or add environment variables for config- extensive documentation how to create wrappers for uration. Each of these steps can make it substantially other NL2SQL approaches and scripts for simple setup. easier to run foreign code (or your own after a while). It We hope that this itself evolves into a challenge were is not about providing perfectly fast and robust industry- researchers provide such a wrapper implementation and grade software for production use—that is something installation script for their approach and will therefore (academic) researchers usually cannot accomplish and maintain a ready to use list as part of the published code.4 also should not spend their time on—but to allow quick prototypical usage to decide whether it is feasible to use an approach in research and maybe investing time in 5. What are we (still) missing? improving it. We therefore argue that shared tasks like Spider should require this in the future for submissions Modern data driven approaches would not be possible to their leaderboards, and find it a great pity that most without big amounts of data, but curating and annotating of the current submissions are difficult to reproduce and it is out of scope for many researchers. Hence, it is not even more difficult to utilize for further research. surprising that Spider and SparC, but also other datasets, have strongly advanced research in the field. However, we believe that further advancements are still possible: 4. Does it translate? We already outlined some flaws of Spider such as a missing focus on reproducibility. Yet, we also want to As shown in the last section, in June 2021 there were 86 highlight advantages like the manually annotated and submission in total for Spider and SparC. If one wants to high-quality data, which deservedly currently makes it build a system on top of them, currently one has to pick the most important benchmark for of NL-2-SQL transla- one of the best performing approaches from the leader- tors. board, obtain the code, install dependencies, download pre-trained models (if any) and then find a way to run 4 https://datamanagementlab.github.io/univerSQL/ In addition to Spider, there are other datasets and ap- answer as a bunch of numbers. Yet, a framework to asses proaches for benchmarking of NLIDBs, but, like Spider, a system in respect to these kind of questions would help they have some flaws. We will take a glance at some of to better decide on which improvements it is worth to fo- them, to outline typical problems: cus. We therefore hope that this user perspective will be The WikiSQL Benchmark by Zhong et al. [13] is a large considered more regularly in computer science research— dataset (though smaller than Spider) that also features not as a separate field of research but an integral part to a leaderboard. Unfortunately, it consists only of a small drive research in an direction that is suitable to support number of unique query patterns [14] (in fact, half of humans best in whatever they want to accomplish. the questions in the dataset are generated from one sin- gle pattern). In particular, it contains neither joins nor nestings. Furthermore, the NL questions are often low 6. Conclusion quality (i.e., many are grammatically incorrect), some In this paper, we analyzed the reproducibility and prepa- do not have a proper semantic meaning and make little ration for use in further research of the submissions to sense when read by humans and some NL questions do the Spider, SparC and CoSQL challenges. Unfortunately, not have the same meaning as the associated SQL query. we found that only for about 40 % of the submissions Utama et al. [15] published ParaphraseBench, an ap- code is available and for even fewer submissions artifacts proach that tries to measure translation difficulty by di- like pre-trained model dumps are provided. Additionally, viding queries into classes. The benchmark was manually the code is in most cases only capable to do the batch curated but is quite small and covers only one table. translation of specific data required for the evaluation A recent paper by Gkini et al. [16] tries to benchmark scripts of the challenges but not prepared for use on other existing translation systems. They focus on system as- real world data. We therefore presented a prototypical pects like execution times or resource consumption and API implementation called UniverSQL that provides a not on translation accuracy. However, their analysis simple interface for NL to SQL translation and boils down leaves some open questions: First, their dataset which is the task of adapting an approach for individual transla- unfortunately not publicly available (yet) appears to be tion to implementing a simple wrapper class. The imple- quite small, it consists only of 216 keyword-based and 241 mentation is provided as open source software. Finally, natural language queries. Second, although they cite Spi- we analyzed further shortcomings of Spider and other der, they did not include the high-performing approaches benchmarks and advocated for a stronger user perspec- from the leaderboard in their evaluation. Overall, this tive when designing similar benchmarks in the future. approach does not appear sufficient for an evaluation that takes the user’s perspective into account. Even if we combined all these approaches, the result Acknowledgments would still not be the best way to evaluate NLIDBs. There- fore, we will conclude with a brief outlook on what is We thank Janina Hartmann and Sebastian Bremser for still needed and what would be possible in this area. their substantial contribution to the implementation. As mentioned before, it would probably boost the us- This work has been supported by the German Fed- age of the approaches if they allowed for direct/easy use. eral Ministry of Education and Research as part of the Enforcing this is not an inherent part of a benchmark but Project Software Campus 2.0 (TUDA), Microproject IN- could be done as part of the setup of a shared task. TEXPLORE, under grant ZN 01IS17050, by the German Much more difficult but probably also even more im- Research Foundation as part of the Research Training portant is taking the user’s perspective into account. One Group Adaptive Preparation of Information from Hetero- way to do so could be end-to-end benchmarks that do not geneous Sources (AIPHES) under grant No. GRK 1994/1, only evaluate the translation accuracy but the real perfor- as well as the German Federal Ministry of Education and mance in a data exploration task from input to the output Research and the State of Hesse through the National (SparC and especially CoSQL do this to some extend). High-Performance Computing Program. But there are many other highly interesting questions: This research and development project is/was partly We can measure the accuracy of a system like an NLIDB, funded by the German Federal Ministry of Education but what accuracy should we strive for? Are all errors and Research (BMBF) within the “The Future of Value equally bad? Can a slightly wrong translation still be Creation – Research on Production, Services and Work” sufficient? What is the influence of a suboptimal transla- program (funding number 02L19C150) and managed by tion? Will the user be satisfied by a system with 100 % the Project Management Agency Karlsruhe (PTKA). The translation accuracy? Or do they expect something that authors are responsible for the content of this publica- cannot be accomplished even by perfectly working sys- tion. tems? Answering such questions is hard, it can probably not always be automated and it is difficult to frame the References Javier González, A. Gelbukh, G. Sidorov, M. M. J. Rodríguez, A domain independent natural lan- [1] R. J. L. John, N. Potti, J. M. Patel, Ava: From data to guage interface to databases capable of processing insights through conversations., in: CIDR, 2017. complex queries, in: A. Gelbukh, Á. de Albornoz, [2] H. Kim, B.-H. So, W.-S. Han, H. Lee, Natural lan- H. Terashima-Marín (Eds.), MICAI 2005: Advances guage to SQL: Where are we today?, Proceed- in Artificial Intelligence, Springer Berlin Heidel- ings of the VLDB Endowment 13 (2020) 1737–1750. berg, Berlin, Heidelberg, 2005, pp. 833–842. doi:10.14778/3401960.3401970. [9] Y. Wang, J. Berant, P. Liang, Building a semantic [3] T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, parser overnight, in: Proceedings of the 53rd An- Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, nual Meeting of the Association for Computational D. Radev, Spider: A large-scale human-labeled Linguistics and the 7th International Joint Con- dataset for complex and cross-domain semantic ference on Natural Language Processing (Volume parsing and text-to-sql task, in: Proceedings of 1: Long Papers), Association for Computational the 2018 Conference on Empirical Methods in Nat- Linguistics, Beijing, China, 2015, pp. 1332–1342. ural Language Processing, Association for Com- URL: https://www.aclweb.org/anthology/P15-1129. putational Linguistics, Brussels, Belgium, 2018, pp. doi:10.3115/v1/P15-1129. 3911–3921. [10] P. Shi, P. Ng, Z. Wang, H. Zhu, A. H. Li, [4] T. Yu, R. Zhang, M. Yasunaga, Y. C. Tan, X. V. Lin, J. Wang, C. N. dos Santos, B. Xiang, Learn- S. Li, I. L. Heyang Er, B. Pang, T. Chen, E. Ji, S. Dixit, ing contextual representations for semantic pars- D. Proctor, S. Shim, V. Z. Jonathan Kraft, C. Xiong, ing with generation-augmented pre-training, 2020. R. Socher, D. Radev, SParC: Cross-domain seman- arXiv:2012.10309. tic parsing in context, in: Proceedings of the 57th [11] X. V. Lin, R. Socher, C. Xiong, Bridging textual and Annual Meeting of the Association for Computa- tabular data for cross-domain text-to-sql semantic tional Linguistics, Association for Computational parsing, 2020. arXiv:2012.12627. Linguistics, Florence, Italy, 2019, pp. 4511–4523. [12] N. Reimers, I. Gurevych, Sentence-bert: Sentence [5] T. Yu, R. Zhang, H. Er, S. Li, E. Xue, B. Pang, X. V. embeddings using siamese bert-networks, in: Pro- Lin, Y. C. Tan, T. Shi, Z. Li, et al., CoSQL: A conver- ceedings of the 2019 Conference on Empirical Meth- sational text-to-sql challenge towards cross-domain ods in Natural Language Processing, Association natural language interfaces to databases, Proceed- for Computational Linguistics, 2019, pp. 3982 – 3992. ings of the 2019 Conference on Empirical Meth- URL: https://arxiv.org/abs/1908.10084. ods in Natural Language Processing and the 9th In- [13] V. Zhong, C. Xiong, R. Socher, Seq2sql: Generating ternational Joint Conference on Natural Language structured queries from natural language using re- Processing (EMNLP-IJCNLP) (2019). URL: http://dx. inforcement learning, CoRR abs/1709.00103 (2017). doi.org/10.18653/v1/D19-1204. doi:10.18653/v1/ [14] C. Finegan-Dollak, J. K. Kummerfeld, X. Lin, K. Ra- d19-1204. manathan, S. Sadasivam, R. Zhang, D. R. Radev, Im- [6] R. Zhang, T. Yu, H. Y. Er, S. Shim, E. Xue, X. V. Lin, proving text-to-sql evaluation methodology, CoRR T. Shi, C. Xiong, R. Socher, D. Radev, Editing-based abs/1806.09029 (2018). sql query generation for cross-domain context- [15] P. Utama, N. Weir, F. Basik, C. Binnig, U. Cetintemel, dependent questions, in: Proceedings of the 2019 B. Hättasch, A. Ilkhechi, S. Ramaswamy, A. Usta, Conference on Empirical Methods in Natural Lan- An end-to-end neural natural language interface guage Processing, Hong Kong, China, 2019, pp. for databases, 2018. arXiv:1804.00401. 5338–5349. [16] O. Gkini, T. Belmpas, G. Koutrika, Y. Ioannidis, [7] J. Guo, Z. Zhan, Y. Gao, Y. Xiao, J.-G. Lou, T. Liu, An in-depth benchmarking of text-to-sql systems, D. Zhang, Towards complex text-to-sql in cross- in: Proceedings of the 2021 International Confer- domain database with intermediate representation, ence on Management of Data, SIGMOD/PODS ’21, in: Proceeding of the 57th Annual Meeting of the Association for Computing Machinery, New York, Association for Computational Linguistics (ACL), NY, USA, 2021, p. 632–644. URL: https://doi.org/ Association for Computational Linguistics, 2019, p. 10.1145/3448016.3452836. doi:10.1145/3448016. 4524–4535. 3452836. [8] R. A. P. Rangel, O. Joaquín Pérez, B. Juan