1. Introduction

WikiDBs: A Corpus Of Relational Databases From Wikidata

Liane Vogel

Carsten Binnig

0 1 0 DFKI Darmstadt , Germany 1 Technical University of Darmstadt , Darmstadt , Germany

In recent years, deep learning on tabular data, also known as tabular representation learning, has gained growing interest. However, representation learning for relational databases with multiple tables is still an under-explored area, which might be due to the lack of openly available resources. Therefore, we introduce WikiDBs, a novel open-source corpus of 10,000 relational databases. Each database consists of multiple tables that are connected by foreign keys. The dataset is based on Wikidata and aims to follow the characteristics of real-world databases. In this paper, we describe the dataset and the method for creating it. We also conduct preliminary experiments on the tasks of imputing missing values and predicting column and table names in the databases.

1. Introduction

databases, the existing benchmarks are too few and not diverse enough in terms of the domains they cover. In The Importance of Representation Learning. While addition, automatic data generation is not a valid option, text and images often dominate the field of representation as current methods are only able to generate numerical learning, considerable progress has recently also been and categorical data, but not meaningful text and conmade on other modalities such as tabular data [ 1, 2 ]. This text as contained in real-world databases. According to is important since a non-negligible amount of data is ex- [ 8 ], a significant part of the data in databases is saved pressed in tabular form, in particular enterprise data [ 3 ]. as text, so in order to have realistic training data, it is For individual tables, several approaches have been devel- important to use databases that contain not only numeric oped to solve downstream tasks such as entity matching and categorical, but also textual data. or missing value imputation. Several large-scale datasets, Towards a new Corpus of Relational Databases. such as GitTables [ 4 ] and WikiTables [ 5 ], provide the We aim to support research on representation learning for necessary training data, as data availability is essential relational data by creating a new, large-scale resource for for the development of proficient deep learning models. tabular representation learning on relational databases.

Missing Large Corpora for Relational Databases. Hereby, our goal is to have realistic data, that is not synFor relational databases with multiple tables that are thetically generated. While a few real-world relational linked with foreign keys, however, there is a lack of both databases exist that are openly available such as the Interlarge openly available training data and deep neural net- net Movie Database (IMDb) or the MIMIC database [ 9 ], no work architectures that can incorporate the context of large corpus containing many relational databases exists. multiple related tables. However, collecting large corpora Therefore, we present an approach that uses the Wikiof relational data is non trivial. Due to the sensitivity of data knowledge base [ 10 ] as the basis for deriving a large data stored in relational databases, real-world enterprise corpus of relational databases. Along with this paper, we databases are typically kept private and are not accessible are releasing a new, open-access dataset called WikiDBs to the representation learning community, resulting in a — a corpus of 10,000 relational databases extracted from lack of openly available databases. Wikidata covering a wide spectrum of diverse domains.

The need for Real-World Data. As a consequence, Initial Results using the Corpus. In this paper, we in the field of database research, it is common to use syn- compare the characteristics of our corpus to statistics thetic databases such as the datasets in the TPC bench- available for real-world relational databases to justify marks [ 6, 7 ]. This may be suficient for testing database- the design of our corpus. Furthermore, to showcase that internals, but for representation learning on relational the corpus can be used to learn representations that are databases, which requires a large number of diferent informed by multiple tables in a relational database, we Joint Workshops at 49th International Conference on Very Large Data follow an approach presented in [ 11 ]. In this work, we Bases (VLDBW’23) — TaDA’23: Tabular Data Analysis Workshop, introduced the vision of new models for representation August 28 - September 1, 2023, Vancouver, Canada learning on relational databases. Here, we demonstrate * Corresponding author. ifrst experiments of such a model which is trained on our $ liane.vogel@cs.tu-darmstadt.de (L. Vogel); new WikiDBs dataset. car0st0e0n0.-b0i0n0n1i-g9@76c8s-.8tu8-7d3a(rLm. sVtoagdet.ld);e0(0C0.0-B0i0n0n2i-g2)744-7836 (C. Binnig) Contributions of this Work. To summarize, this © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License paper makes the following contributions: (1) We introCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) duce a novel method of extracting multi-table relational

Foreign Key Relationship

Schema as json file: 2. The WikiDBs Dataset databases from Wikidata. (2) We release a first large scale data also often includes a large number of columns, e.g. corpus of relational data1 and derive important statis- 18.7 on average as reported in the SQLShare corpus [ 14 ]. tics which we compare to available characteristics of For our corpus, we store the schema information which real-world relational databases. (3) We show first exper- includes the table structure and the foreign keys. For the imental results on our dataset for the tasks of missing schema information, we use the same format that is used value imputation, table and column name prediction. by the GitSchemas dataset (shown in Figure 1, right). Furthermore, the individual table data is made available in the CSV-format.

2.2. Analysis and Statistics 2.1. Design Principles

For our WikiDBs dataset, we want the characteristics Next, we analyze the resulting statistics of the derived corto reflect the properties of real-world databases. As en- pus which is published with this paper. Overall, as menterprises do not share the statistics of their databases, tioned before, our dataset consists of 10,000 databases we have to rely on the characteristics of available public that each have between two and nine tables which are resources and model our dataset in a similar way. In Ta- connected via foreign keys. The statistics of WikiDBs are ble 1, we have collected characteristics of existing public compared to those of existing open resources in Table 1. resources, such as the number of tables in a database, and In total, our dataset contains 42,472 tables, the median the average number of columns and rows per table. number of tables per database is 4. On average, each table

For deriving statistics, we found only two existing col- has 17.9 columns and 46.3 rows. The distribution of the lections of relational databases, the Relational Learning number of tables per database is visualized in Figure 2. Repository from CTU Prague [ 13 ] (which also includes TPC-H and IMDb) and the SQLShare [ 14 ] repository. 2.3. Methodology of Construction However, all these repositories include only a small number of relational databases. Therefore, we also include the In this section we describe the procedure how we derive statistics of the significantly larger datasets GitSchemas relational databases based on Wikidata. [ 12 ] — which only contains schema information — and Wikidata Dataformat. The data in Wikidata is stored GitTables [ 4 ] — a corpus of individual tables. For the dis- in a document-oriented database, where documents reptribution of how many tables we include per database, we resent items that are instances of diferent concepts, such follow the distribution of the GitSchemas dataset, which as artists or paintings. In this way, concepts closely reis based on a large number of database schemas found in semble the notion of tables. public git repositories. We include on average a higher Every item in Wikidata is associated with a unique number of columns per table as e.g. the CTU Parague identifier, the so-called QID. The item representing the dataset or GitSchemas, because real-world enterprise book 1984 by George Orwell for example has the id Q208460. Properties of items are stored in form of keyvalue pairs, where property names are saved with their corresponding value. Most important are properties (e.g. 3000 the publication date (P577)) that resemble attributes of a table row. Moreover, properties also include other infor- sse 2000 mation such as the related concept of an item; e.g. the taab book 1984 has the property instance of (P31) literary work fado (Q7725634). rbem 1000

Creation of a Table. The creation of a relational table uN farlosomtWheipkaidrattoafisotrhsuusbmclaasdseopfo)srseilbalteiobnysthine Winsiktaidnacetao.fT(hoer 0 2 3 4 5 6 7 8 9 information that the book 1984 is an instance of literary WikiDBs - Number of tables in a database work allows us to search Wikidata for all other items that Figure 2: Distribution of the number of tables over the dataset. are also tagged with the information that they are an The distribution is modeled analogous to the GitSchemas [ 12 ] instance of literary work. dataset, e.g. 26% of the databases contain 2 tables.

A challenge in Wikidata is that every item (e.g., each book) might use a diferent set of properties. For example, ing of data in the dump, we additionally build up a lookup for some books the year the book was published is avail- structure which maps concepts (i.e., tables) to potential able, while for others it is not. For constructing tables, items (i.e., rows). The values of each item correspond to we use the union of all properties. If a value is missing, the rows of the tables, the properties of the values form we store a NULL-value in the table row. To avoid con- the column headers. The lookup structure allows us to structing tables with highly sparse columns, we prune quickly navigate the content in the dump and extract columns of a table where the fraction of NULL-values is data for individual tables beyond a configurable threshold 2.

Creation of a Database. For each created table, some 2.4. Discussion columns contain references to concepts that are also saved as items in Wikidata. We use those columns that We clearly see this work and the corpus released with contain Wikidata items to build further tables for the this paper only as a starting point to foster further redatabase. For constructing relational databases, we ran- search. While this is the very first large scale corpus of domly select a concept in Wikidata as a starting point relational databases, we believe that more work is necand then traverse relationships to other tables randomly. essary to extend the corpus. First, at the moment we For example, the table of literary work contains a column provide tables of sizes which closely resemble the sizes author which is a reference to another item, which allows of tables found in repositories such as on GitHub. Howus to build an additional table of authors (linked via a ever, real-world enterprise databases often contain also a foreign key to the table literary work) where each row few very large tables (e.g., the orders table of an online contains information on an author and columns contain shop). For Wikidata, we found approximately 20 cone.g. their date and place of birth or nationality (Figure 1). cepts such as scholarly article, galaxy or protein that have

Implementation Details. For constructing our cor- more than 400k items, enabling the creation of very large pus, we use the Wikidata JSON dump [ 15 ] as a starting tables and databases. Furthermore, with our repository point for creating the dataset. To enable eficient query- we focus on English-language content in the first version but our method allows to easily create databases in other languages included in Wikidata. 2For the corpus released with this paper, we prune columns with more than 20% NULL-values.

4. Related Work

Finally, we hope that the corpus fosters more research on models for table representation that can take data from multiple connected tables into account. In the following, In the following, we summarize related work grouped by we show the results of an early version of such a model diferent directions. that is enabled by the WikiDBs corpus. Single-Table Repositories. So far, tabular representation learning mostly focuses on learning representations of single tables. Commonly used corpora are for example 3. Experiments GitTables [ 4 ], WikiTables [ 5 ], the Dresden Web Table Corpus (DWTC) [19] or the WDC corpora [20].

In this section, we conduct initial experiments on our new Multi-Table Repositories. In order to support maWikiDBs dataset. We present results for three diferent chine learning on multi-table relational data, [ 13 ] pubtasks, namely predicting missing values, column names lished the CTU Prague Relational Learning Repository and table names. We model all these tasks as generative in 2015. Currently, there are 83 databases included. tasks rather than classification tasks in order to be able to The SQLShare corpus [ 14 ] is a query workload dataset work with unseen data. We apply the architecture intro- which includes 64 databases collected from real-world duced in [ 11 ] that is a combination of language models users (mainly researchers and scientists) from the webser(LMs) and graph neural networks (GNNs). Similar to [ 11 ], vice SQLShare. Both repositories are thus much smaller we compare our model to RPT [ 16 ] as a baseline. than corpora with data for single tables that are com

Pre-Training Procedure. Following [ 11 ], we train monly used for table representation learning. Finally, the language model BART [ 17 ] and the GNN separately. the GitSchemas [ 12 ] repository contains 50k database We split the 10,000 databases from our dataset into schemas based on SQL files from public GitHub repos80/10/10 percent for training, validation and testing. First, itories. The information thus provides highly relevant we fine-tune a pre-trained BART model from the Hug- insight into real-world databases. However, the reposigingface library [18] on single tables from our dataset for tory lacks the content of the databases. 250 epochs with an initial learning rate of 10 − 4 and a Datasets based on Wikidata. For the SemTab chalcosine annealing schedule to reconstruct masked table lenge [21], where tabular data is matched to knowledge names, column names and cell values. Next, for training graphs, tables were built using data from Wikidata. Furthe GNN on the databases, we use the fine-tuned BART thermore, Wikidata has been used to build datasets for encoder to compute node embeddings for a database and named entity classification [ 22] and named entity disthe BART decoder to convert the representation of a ambiguation [23], as well as complex sequential quesmasked node in the GNN back into natural language text. tion answering [24]. Moreover, [25] verbalize knowledge In our experiments, we limit a database to using a ta- graph triples from Wikidata, and [26] create alignments ble and its direct neighbors. We train the GNN for 500 between Wikidata triples and Wikipedia abstracts. epochs. The checkpoint with the best accuracy on the validation set is used for evaluation and we report the results as an average of three runs. 5. Conclusion & Future Work

Initial Results. The results of our initial experiments on the WikiDBs corpus are shown in Table 2. Compared To support representation learning on databases we into the RPT [ 16 ] baseline that is only able to work on troduced our new dataset WikiDBs that is based on data single tables, our model achieves a higher performance from Wikidata and released a first corpus with 10,000 for all three tasks. Incorporating the context of multiple databases. In future, we plan to extend the dataset and tables of the databases increases the F1 score especially look into opportunities to leverage the corpus for new for the task of column name detection, from 48.98% for model architectures or for fine-tuning large language the BART model to 69.37% for our model. models such as GPT-based [27] models on table data.

Acknowledgments

We thank Till Döhmen and Madelon Hulsebos for generously providing the table statistics from their GitSchemas dataset. This work has been supported by the BMBF and the state of Hesse as part of the NHR Program and the BMBF project KompAKI (grant number 02L19C150), as well as the HMWK cluster project 3AI. Finally, we want to thank hessian.AI, and DFKI Darmstadt for their support. [18] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De- [24] A. Saha, V. Pahuja, M. M. Khapra, K. Sankaralangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- narayanan, S. Chandar, Complex sequential questowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, tion answering: Towards learning to converse over Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, linked question answer pairs with a knowledge M. Drame, Q. Lhoest, A. M. Rush, Transform- graph, in: Proceedings of the Thirty-Second AAAI ers: State-of-the-art natural language processing, Conference on Artificial Intelligence, (AAAI-18), in: Proceedings of the 2020 Conference on Empiri- the 30th innovative Applications of Artificial Incal Methods in Natural Language Processing: Sys- telligence (IAAI-18), and the 8th AAAI Sympotem Demonstrations, EMNLP 2020 - Demos, On- sium on Educational Advances in Artificial Intelline, November 16-20, 2020, Association for Com- ligence (EAAI-18), New Orleans, Louisiana, USA, putational Linguistics, 2020, pp. 38–45. URL: https: February 2-7, 2018, AAAI Press, 2018, pp. 705–713. //doi.org/10.18653/v1/2020.emnlp-demos.6. URL: https://www.aaai.org/ocs/index.php/AAAI/ [19] J. Eberius, M. Thiele, K. Braunschweig, W. Lehner, AAAI18/paper/view/17181.

Top-k entity augmentation using consistent set cov- [25] G. Amaral, O. Rodrigues, E. Simperl, WDV: A ering, SSDBM ’15, 2015. doi:10.1145/2791347. broad data verbalisation dataset built from wiki2791353. data, in: The Semantic Web - ISWC 2022 - 21st [20] O. Lehmberg, D. Ritze, R. Meusel, C. Bizer, A International Semantic Web Conference, Virtual large public corpus of web tables containing time Event, October 23-27, 2022, Proceedings, volume and context metadata, in: Proceedings of the 13489 of Lecture Notes in Computer Science, Springer, 25th International Conference on World Wide Web, 2022, pp. 556–574. URL: https://doi.org/10.1007/ WWW 2016, Montreal, Canada, April 11-15, 2016, 978-3-031-19433-7_32.

Companion Volume, ACM, 2016, pp. 75–76. URL: [26] H. ElSahar, P. Vougiouklis, A. Remaci, C. Gravier, https://doi.org/10.1145/2872518.2889386. J. S. Hare, F. Laforest, E. Simperl, T-rex: A large [21] E. Jiménez-Ruiz, O. Hassanzadeh, V. Efthymiou, scale alignment of natural language with knowlJ. Chen, K. Srinivas, Semtab 2019: Resources to edge base triples, in: Proceedings of the Eleventh benchmark tabular data to knowledge graph match- International Conference on Language Resources ing systems, in: The Semantic Web - 17th Interna- and Evaluation, LREC 2018, Miyazaki, Japan, May tional Conference, ESWC 2020, Heraklion, Crete, 7-12, 2018, European Language Resources AssociaGreece, May 31-June 4, 2020, Proceedings, volume tion (ELRA), 2018. URL: http://www.lrec-conf.org/ 12123 of Lecture Notes in Computer Science, Springer, proceedings/lrec2018/summaries/632.html. 2020, pp. 514–530. URL: https://doi.org/10.1007/ [27] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, 978-3-030-49461-2_30. J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, [22] J. Geiß, A. Spitz, M. Gertz, Neckar: A named en- G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, tity classifier for wikidata, in: Language Tech- G. Krueger, T. Henighan, R. Child, A. Ramesh, nologies for the Challenges of the Digital Age - D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, 27th International Conference, GSCL 2017, Berlin, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, Germany, September 13-14, 2017, Proceedings, vol- C. Berner, S. McCandlish, A. Radford, I. Sutskever, ume 10713 of Lecture Notes in Computer Science, D. Amodei, Language models are few-shot learners, Springer, 2017, pp. 115–129. URL: https://doi.org/ in: Advances in Neural Information Processing 10.1007/978-3-319-73706-5_10. Systems 33: Annual Conference on Neural [23] A. Cetoli, M. Akbari, S. Bragaglia, A. D. O’Harney, Information Processing Systems 2020, NeurIPS M. Sloan, Named entity disambiguation using 2020, December 6-12, 2020, virtual, 2020. URL: deep learning on graphs, CoRR abs/1810.09164 https://proceedings.neurips.cc/paper/2020/hash/ (2018). URL: http://arxiv.org/abs/1810.09164. 1457c0d6bfcb4967418bfb8ac142f64a-Abstract. arXiv:1810.09164. html.

[1]

Deng ,

Sun ,

Lees ,

Wu , C. Yu, TURL: table understanding through representation learning , Proc. VLDB Endow . 14 ( 2020 ) 307 - 319 . URL: http://www.vldb.org/pvldb/vol14/p307-deng.pdf .

[2]

Badaro ,

Saeed ,

Papotti , Transformers for Tabular Data Representation: A Survey of Models and Applications, Transactions of the Association for Computational Linguistics 11 ( 2023 ) 227 - 249 . URL: https://doi.org/10.1162/tacl_a_ 00544 .

[3]

Cahoon ,

Savelieva ,

A. C.

Mueller ,

Floratou ,

Curino ,

Patel ,

Henkel ,

Weimer ,

Gustafsson ,

Wydrowski ,

Batoukov ,

Deep ,

Emani , The need for tabular representation learning: An industry perspective , in: NeurIPS 2022 First Table Representation Workshop , 2022 . URL: https://openreview.net/forum?id=jk4B84qmlXJ.

[4]

Hulsebos , Ç. Demiralp,

Groth , Gittables: A large-scale corpus of relational tables , arXiv preprint arXiv:2106.07258 ( 2021 ). URL: https://arxiv. org/abs/2106.07258.

[5]

C. S.

Bhagavatula ,

Noraset ,

Downey , Tabel: Entity linking in web tables , in: The Semantic Web - ISWC 2015 - 14th International Semantic Web Conference , Bethlehem, PA, USA, October 11 - 15 , 2015 , Proceedings, Part

, volume 9366 of Lecture Notes in Computer Science, Springer, 2015 , pp. 425 - 441 . URL: https://doi.org/10.1007/978-3- 319 -25007-6_ 25 .

[6]

Huppler , The art of building a good benchmark, in: Performance Evaluation and Benchmarking , First TPC Technology Conference, TPCTC 2009 , Lyon, France, August 24-28 , 2009 , Revised Selected Papers, volume 5895 of Lecture Notes in Computer Science, Springer, 2009 , pp. 18 - 30 . URL: https://doi.org/10.1007/978-3- 642 -10424- 4 _ 3 .

[7]

Poess ,

Smith ,

Kollar ,

Larson , Tpc-ds, taking decision support benchmarking to the next level , in: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, SIGMOD '02 , Association for Computing Machinery, New York, NY, USA, 2002 , p. 582 - 587 . URL: https://doi.org/10.1145/564691.564759.

[8]

Vogelsgesang ,

Haubenschild ,

Finis ,

Kemper ,

Leis ,

Mühlbauer ,

Neumann , M. Then, Get real: How benchmarks fail to represent the real world , in: Proceedings of the 7th International Workshop on Testing Database Systems, DBTest@SIGMOD 2018 , Houston, TX, USA, June 15, 2018 , ACM, 2018 , pp. 1 : 1 - 1 : 6 . URL: https://doi. org/10.1145/3209950.3209952.

[9]

A. E.

Johnson , T. J. Pollard , L.

Shen , L.-w. H.

Lehman , M.

Feng , M.

Ghassemi , B. Moody, P. Szolovits, L. Anthony

Celi , R. G. Mark, Mimic-iii, a freely accessible critical care database , Scientific data 3 ( 2016 ) 1 - 9 . URL: https://doi.org/10.1038/sdata. 2016 . 35 .

[10] Wikidata , https://www.wikidata.org, .

[11]

Vogel ,

Hilprecht ,

Binnig , Towards foundation models for relational databases [vision paper], NeurIPS 2022 First Table Representation Workshop ( 2022 ). URL: https://openreview.net/forum? id=s1KlNOQq71_.

[12]

Döhmen ,

Hulsebos ,

Beecks ,

Schelter , Gitschemas: A dataset for automating relational data preparation tasks , in: 38th IEEE International Conference on Data Engineering Workshops, ICDE Workshops 2022 ,

Kuala

Lumpur , Malaysia, May 9, 2022 , IEEE, 2022 , pp. 74 - 78 . URL: https://doi.org/10. 1109/ICDEW55742. 2022 . 00016 .

[13]

Motl ,

Schulte , The CTU prague relational learning repository , CoRR abs/1511 .03086 ( 2015 ). URL: http://arxiv.org/abs/1511.03086. arXiv: 1511 . 03086 .

[14]

Jain ,

Moritz ,

Halperin ,

Howe , E. Lazowska, Sqlshare: Results from a multi-year sql-asa-service experiment , in: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016 , San Francisco, CA, USA, June 26 - July 01, 2016 , ACM, 2016 , pp. 281 - 293 . URL: https://doi.org/10.1145/2882903.2882957.

[15] Downloaded the 'latest-all . json.gz' dump on February 22 , 2023 from https://dumps.wikimedia.org/ wikidatawiki/entities/.

[16]

Tang ,

Fan ,

Li ,

Tu ,

Du ,

Li ,

Madden , M. Ouzzani, RPT: relational pre-trained transformer is almost all you need towards democratizing data preparation , Proc. VLDB Endow . 14 ( 2021 ) 1254 - 1261 . URL: http://www.vldb.org/pvldb/vol14/ p1254-tang.pdf .

[17]

Lewis ,

Liu ,

Goyal ,

Ghazvininejad ,

Mohamed ,

Levy ,

Stoyanov , L. Zettlemoyer, BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , ACL 2020, Online, July 5 - 10 , 2020 , Association for Computational Linguistics, 2020 , pp. 7871 - 7880 . URL: https://doi.org/10.18653/v1/ 2020 . acl-main. 703 .