<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>WikiDBs: A Corpus Of Relational Databases From Wikidata</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Liane Vogel</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carsten Binnig</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DFKI Darmstadt</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Technical University of Darmstadt</institution>
          ,
          <addr-line>Darmstadt</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In recent years, deep learning on tabular data, also known as tabular representation learning, has gained growing interest. However, representation learning for relational databases with multiple tables is still an under-explored area, which might be due to the lack of openly available resources. Therefore, we introduce WikiDBs, a novel open-source corpus of 10,000 relational databases. Each database consists of multiple tables that are connected by foreign keys. The dataset is based on Wikidata and aims to follow the characteristics of real-world databases. In this paper, we describe the dataset and the method for creating it. We also conduct preliminary experiments on the tasks of imputing missing values and predicting column and table names in the databases.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        databases, the existing benchmarks are too few and not
diverse enough in terms of the domains they cover. In
The Importance of Representation Learning. While addition, automatic data generation is not a valid option,
text and images often dominate the field of representation as current methods are only able to generate numerical
learning, considerable progress has recently also been and categorical data, but not meaningful text and
conmade on other modalities such as tabular data [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. This text as contained in real-world databases. According to
is important since a non-negligible amount of data is ex- [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], a significant part of the data in databases is saved
pressed in tabular form, in particular enterprise data [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. as text, so in order to have realistic training data, it is
For individual tables, several approaches have been devel- important to use databases that contain not only numeric
oped to solve downstream tasks such as entity matching and categorical, but also textual data.
or missing value imputation. Several large-scale datasets, Towards a new Corpus of Relational Databases.
such as GitTables [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and WikiTables [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], provide the We aim to support research on representation learning for
necessary training data, as data availability is essential relational data by creating a new, large-scale resource for
for the development of proficient deep learning models. tabular representation learning on relational databases.
      </p>
      <p>
        Missing Large Corpora for Relational Databases. Hereby, our goal is to have realistic data, that is not
synFor relational databases with multiple tables that are thetically generated. While a few real-world relational
linked with foreign keys, however, there is a lack of both databases exist that are openly available such as the
Interlarge openly available training data and deep neural net- net Movie Database (IMDb) or the MIMIC database [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], no
work architectures that can incorporate the context of large corpus containing many relational databases exists.
multiple related tables. However, collecting large corpora Therefore, we present an approach that uses the
Wikiof relational data is non trivial. Due to the sensitivity of data knowledge base [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] as the basis for deriving a large
data stored in relational databases, real-world enterprise corpus of relational databases. Along with this paper, we
databases are typically kept private and are not accessible are releasing a new, open-access dataset called WikiDBs
to the representation learning community, resulting in a — a corpus of 10,000 relational databases extracted from
lack of openly available databases. Wikidata covering a wide spectrum of diverse domains.
      </p>
      <p>
        The need for Real-World Data. As a consequence, Initial Results using the Corpus. In this paper, we
in the field of database research, it is common to use syn- compare the characteristics of our corpus to statistics
thetic databases such as the datasets in the TPC bench- available for real-world relational databases to justify
marks [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. This may be suficient for testing database- the design of our corpus. Furthermore, to showcase that
internals, but for representation learning on relational the corpus can be used to learn representations that are
databases, which requires a large number of diferent informed by multiple tables in a relational database, we
Joint Workshops at 49th International Conference on Very Large Data follow an approach presented in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. In this work, we
Bases (VLDBW’23) — TaDA’23: Tabular Data Analysis Workshop, introduced the vision of new models for representation
August 28 - September 1, 2023, Vancouver, Canada learning on relational databases. Here, we demonstrate
* Corresponding author. ifrst experiments of such a model which is trained on our
$ liane.vogel@cs.tu-darmstadt.de (L. Vogel); new WikiDBs dataset.
car0st0e0n0.-b0i0n0n1i-g9@76c8s-.8tu8-7d3a(rLm. sVtoagdet.ld);e0(0C0.0-B0i0n0n2i-g2)744-7836 (C. Binnig) Contributions of this Work. To summarize, this
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License paper makes the following contributions: (1) We
introCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) duce a novel method of extracting multi-table relational
      </p>
      <p>Foreign Key
Relationship</p>
      <p>
        Schema as json file:
2. The WikiDBs Dataset
databases from Wikidata. (2) We release a first large scale data also often includes a large number of columns, e.g.
corpus of relational data1 and derive important statis- 18.7 on average as reported in the SQLShare corpus [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
tics which we compare to available characteristics of For our corpus, we store the schema information which
real-world relational databases. (3) We show first exper- includes the table structure and the foreign keys. For the
imental results on our dataset for the tasks of missing schema information, we use the same format that is used
value imputation, table and column name prediction. by the GitSchemas dataset (shown in Figure 1, right).
Furthermore, the individual table data is made available
in the CSV-format.
      </p>
      <sec id="sec-1-1">
        <title>2.2. Analysis and Statistics</title>
      </sec>
      <sec id="sec-1-2">
        <title>2.1. Design Principles</title>
        <p>For our WikiDBs dataset, we want the characteristics Next, we analyze the resulting statistics of the derived
corto reflect the properties of real-world databases. As en- pus which is published with this paper. Overall, as
menterprises do not share the statistics of their databases, tioned before, our dataset consists of 10,000 databases
we have to rely on the characteristics of available public that each have between two and nine tables which are
resources and model our dataset in a similar way. In Ta- connected via foreign keys. The statistics of WikiDBs are
ble 1, we have collected characteristics of existing public compared to those of existing open resources in Table 1.
resources, such as the number of tables in a database, and In total, our dataset contains 42,472 tables, the median
the average number of columns and rows per table. number of tables per database is 4. On average, each table</p>
        <p>
          For deriving statistics, we found only two existing col- has 17.9 columns and 46.3 rows. The distribution of the
lections of relational databases, the Relational Learning number of tables per database is visualized in Figure 2.
Repository from CTU Prague [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] (which also includes
TPC-H and IMDb) and the SQLShare [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] repository. 2.3. Methodology of Construction
However, all these repositories include only a small
number of relational databases. Therefore, we also include the In this section we describe the procedure how we derive
statistics of the significantly larger datasets GitSchemas relational databases based on Wikidata.
[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] — which only contains schema information — and Wikidata Dataformat. The data in Wikidata is stored
GitTables [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] — a corpus of individual tables. For the dis- in a document-oriented database, where documents
reptribution of how many tables we include per database, we resent items that are instances of diferent concepts, such
follow the distribution of the GitSchemas dataset, which as artists or paintings. In this way, concepts closely
reis based on a large number of database schemas found in semble the notion of tables.
public git repositories. We include on average a higher Every item in Wikidata is associated with a unique
number of columns per table as e.g. the CTU Parague identifier, the so-called QID. The item representing the
dataset or GitSchemas, because real-world enterprise book 1984 by George Orwell for example has the id
Q208460. Properties of items are stored in form of
keyvalue pairs, where property names are saved with their
corresponding value. Most important are properties (e.g. 3000
the publication date (P577)) that resemble attributes of a
table row. Moreover, properties also include other infor- sse 2000
mation such as the related concept of an item; e.g. the taab
book 1984 has the property instance of (P31) literary work fado
(Q7725634). rbem 1000
        </p>
        <p>
          Creation of a Table. The creation of a relational table uN
farlosomtWheipkaidrattoafisotrhsuusbmclaasdseopfo)srseilbalteiobnysthine Winsiktaidnacetao.fT(hoer 0 2 3 4 5 6 7 8 9
information that the book 1984 is an instance of literary WikiDBs - Number of tables in a database
work allows us to search Wikidata for all other items that Figure 2: Distribution of the number of tables over the dataset.
are also tagged with the information that they are an The distribution is modeled analogous to the GitSchemas [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
instance of literary work. dataset, e.g. 26% of the databases contain 2 tables.
        </p>
        <p>A challenge in Wikidata is that every item (e.g., each
book) might use a diferent set of properties. For example, ing of data in the dump, we additionally build up a lookup
for some books the year the book was published is avail- structure which maps concepts (i.e., tables) to potential
able, while for others it is not. For constructing tables, items (i.e., rows). The values of each item correspond to
we use the union of all properties. If a value is missing, the rows of the tables, the properties of the values form
we store a NULL-value in the table row. To avoid con- the column headers. The lookup structure allows us to
structing tables with highly sparse columns, we prune quickly navigate the content in the dump and extract
columns of a table where the fraction of NULL-values is data for individual tables
beyond a configurable threshold 2.</p>
        <p>Creation of a Database. For each created table, some 2.4. Discussion
columns contain references to concepts that are also
saved as items in Wikidata. We use those columns that We clearly see this work and the corpus released with
contain Wikidata items to build further tables for the this paper only as a starting point to foster further
redatabase. For constructing relational databases, we ran- search. While this is the very first large scale corpus of
domly select a concept in Wikidata as a starting point relational databases, we believe that more work is
necand then traverse relationships to other tables randomly. essary to extend the corpus. First, at the moment we
For example, the table of literary work contains a column provide tables of sizes which closely resemble the sizes
author which is a reference to another item, which allows of tables found in repositories such as on GitHub.
Howus to build an additional table of authors (linked via a ever, real-world enterprise databases often contain also a
foreign key to the table literary work) where each row few very large tables (e.g., the orders table of an online
contains information on an author and columns contain shop). For Wikidata, we found approximately 20
cone.g. their date and place of birth or nationality (Figure 1). cepts such as scholarly article, galaxy or protein that have</p>
        <p>
          Implementation Details. For constructing our cor- more than 400k items, enabling the creation of very large
pus, we use the Wikidata JSON dump [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] as a starting tables and databases. Furthermore, with our repository
point for creating the dataset. To enable eficient query- we focus on English-language content in the first version
but our method allows to easily create databases in other
languages included in Wikidata.
2For the corpus released with this paper, we prune columns with
more than 20% NULL-values.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Related Work</title>
      <p>
        Finally, we hope that the corpus fosters more research
on models for table representation that can take data from
multiple connected tables into account. In the following, In the following, we summarize related work grouped by
we show the results of an early version of such a model diferent directions.
that is enabled by the WikiDBs corpus. Single-Table Repositories. So far, tabular
representation learning mostly focuses on learning representations
of single tables. Commonly used corpora are for example
3. Experiments GitTables [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], WikiTables [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the Dresden Web Table
Corpus (DWTC) [19] or the WDC corpora [20].
      </p>
      <p>
        In this section, we conduct initial experiments on our new Multi-Table Repositories. In order to support
maWikiDBs dataset. We present results for three diferent chine learning on multi-table relational data, [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
pubtasks, namely predicting missing values, column names lished the CTU Prague Relational Learning Repository
and table names. We model all these tasks as generative in 2015. Currently, there are 83 databases included.
tasks rather than classification tasks in order to be able to The SQLShare corpus [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] is a query workload dataset
work with unseen data. We apply the architecture intro- which includes 64 databases collected from real-world
duced in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] that is a combination of language models users (mainly researchers and scientists) from the
webser(LMs) and graph neural networks (GNNs). Similar to [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], vice SQLShare. Both repositories are thus much smaller
we compare our model to RPT [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] as a baseline. than corpora with data for single tables that are
com
      </p>
      <p>
        Pre-Training Procedure. Following [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], we train monly used for table representation learning. Finally,
the language model BART [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and the GNN separately. the GitSchemas [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] repository contains 50k database
We split the 10,000 databases from our dataset into schemas based on SQL files from public GitHub
repos80/10/10 percent for training, validation and testing. First, itories. The information thus provides highly relevant
we fine-tune a pre-trained BART model from the Hug- insight into real-world databases. However, the
reposigingface library [18] on single tables from our dataset for tory lacks the content of the databases.
250 epochs with an initial learning rate of 10 − 4 and a Datasets based on Wikidata. For the SemTab
chalcosine annealing schedule to reconstruct masked table lenge [21], where tabular data is matched to knowledge
names, column names and cell values. Next, for training graphs, tables were built using data from Wikidata.
Furthe GNN on the databases, we use the fine-tuned BART thermore, Wikidata has been used to build datasets for
encoder to compute node embeddings for a database and named entity classification [ 22] and named entity
disthe BART decoder to convert the representation of a ambiguation [23], as well as complex sequential
quesmasked node in the GNN back into natural language text. tion answering [24]. Moreover, [25] verbalize knowledge
In our experiments, we limit a database to using a ta- graph triples from Wikidata, and [26] create alignments
ble and its direct neighbors. We train the GNN for 500 between Wikidata triples and Wikipedia abstracts.
epochs. The checkpoint with the best accuracy on the
validation set is used for evaluation and we report the
results as an average of three runs. 5. Conclusion &amp; Future Work
      </p>
      <p>
        Initial Results. The results of our initial experiments
on the WikiDBs corpus are shown in Table 2. Compared To support representation learning on databases we
into the RPT [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] baseline that is only able to work on troduced our new dataset WikiDBs that is based on data
single tables, our model achieves a higher performance from Wikidata and released a first corpus with 10,000
for all three tasks. Incorporating the context of multiple databases. In future, we plan to extend the dataset and
tables of the databases increases the F1 score especially look into opportunities to leverage the corpus for new
for the task of column name detection, from 48.98% for model architectures or for fine-tuning large language
the BART model to 69.37% for our model. models such as GPT-based [27] models on table data.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <p>We thank Till Döhmen and Madelon Hulsebos for
generously providing the table statistics from their GitSchemas
dataset. This work has been supported by the BMBF and
the state of Hesse as part of the NHR Program and the
BMBF project KompAKI (grant number 02L19C150), as
well as the HMWK cluster project 3AI. Finally, we want to
thank hessian.AI, and DFKI Darmstadt for their support.
[18] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De- [24] A. Saha, V. Pahuja, M. M. Khapra, K.
Sankaralangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- narayanan, S. Chandar, Complex sequential
questowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, tion answering: Towards learning to converse over
Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, linked question answer pairs with a knowledge
M. Drame, Q. Lhoest, A. M. Rush, Transform- graph, in: Proceedings of the Thirty-Second AAAI
ers: State-of-the-art natural language processing, Conference on Artificial Intelligence, (AAAI-18),
in: Proceedings of the 2020 Conference on Empiri- the 30th innovative Applications of Artificial
Incal Methods in Natural Language Processing: Sys- telligence (IAAI-18), and the 8th AAAI
Sympotem Demonstrations, EMNLP 2020 - Demos, On- sium on Educational Advances in Artificial
Intelline, November 16-20, 2020, Association for Com- ligence (EAAI-18), New Orleans, Louisiana, USA,
putational Linguistics, 2020, pp. 38–45. URL: https: February 2-7, 2018, AAAI Press, 2018, pp. 705–713.
//doi.org/10.18653/v1/2020.emnlp-demos.6. URL: https://www.aaai.org/ocs/index.php/AAAI/
[19] J. Eberius, M. Thiele, K. Braunschweig, W. Lehner, AAAI18/paper/view/17181.</p>
      <p>Top-k entity augmentation using consistent set cov- [25] G. Amaral, O. Rodrigues, E. Simperl, WDV: A
ering, SSDBM ’15, 2015. doi:10.1145/2791347. broad data verbalisation dataset built from
wiki2791353. data, in: The Semantic Web - ISWC 2022 - 21st
[20] O. Lehmberg, D. Ritze, R. Meusel, C. Bizer, A International Semantic Web Conference, Virtual
large public corpus of web tables containing time Event, October 23-27, 2022, Proceedings, volume
and context metadata, in: Proceedings of the 13489 of Lecture Notes in Computer Science, Springer,
25th International Conference on World Wide Web, 2022, pp. 556–574. URL: https://doi.org/10.1007/
WWW 2016, Montreal, Canada, April 11-15, 2016, 978-3-031-19433-7_32.</p>
      <p>Companion Volume, ACM, 2016, pp. 75–76. URL: [26] H. ElSahar, P. Vougiouklis, A. Remaci, C. Gravier,
https://doi.org/10.1145/2872518.2889386. J. S. Hare, F. Laforest, E. Simperl, T-rex: A large
[21] E. Jiménez-Ruiz, O. Hassanzadeh, V. Efthymiou, scale alignment of natural language with
knowlJ. Chen, K. Srinivas, Semtab 2019: Resources to edge base triples, in: Proceedings of the Eleventh
benchmark tabular data to knowledge graph match- International Conference on Language Resources
ing systems, in: The Semantic Web - 17th Interna- and Evaluation, LREC 2018, Miyazaki, Japan, May
tional Conference, ESWC 2020, Heraklion, Crete, 7-12, 2018, European Language Resources
AssociaGreece, May 31-June 4, 2020, Proceedings, volume tion (ELRA), 2018. URL: http://www.lrec-conf.org/
12123 of Lecture Notes in Computer Science, Springer, proceedings/lrec2018/summaries/632.html.
2020, pp. 514–530. URL: https://doi.org/10.1007/ [27] T. B. Brown, B. Mann, N. Ryder, M. Subbiah,
978-3-030-49461-2_30. J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
[22] J. Geiß, A. Spitz, M. Gertz, Neckar: A named en- G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,
tity classifier for wikidata, in: Language Tech- G. Krueger, T. Henighan, R. Child, A. Ramesh,
nologies for the Challenges of the Digital Age - D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,
27th International Conference, GSCL 2017, Berlin, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
Germany, September 13-14, 2017, Proceedings, vol- C. Berner, S. McCandlish, A. Radford, I. Sutskever,
ume 10713 of Lecture Notes in Computer Science, D. Amodei, Language models are few-shot learners,
Springer, 2017, pp. 115–129. URL: https://doi.org/ in: Advances in Neural Information Processing
10.1007/978-3-319-73706-5_10. Systems 33: Annual Conference on Neural
[23] A. Cetoli, M. Akbari, S. Bragaglia, A. D. O’Harney, Information Processing Systems 2020, NeurIPS
M. Sloan, Named entity disambiguation using 2020, December 6-12, 2020, virtual, 2020. URL:
deep learning on graphs, CoRR abs/1810.09164 https://proceedings.neurips.cc/paper/2020/hash/
(2018). URL: http://arxiv.org/abs/1810.09164. 1457c0d6bfcb4967418bfb8ac142f64a-Abstract.
arXiv:1810.09164. html.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lees</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Yu, TURL: table understanding through representation learning</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>14</volume>
          (
          <year>2020</year>
          )
          <fpage>307</fpage>
          -
          <lpage>319</lpage>
          . URL: http://www.vldb.org/pvldb/vol14/p307-deng.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Badaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saeed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          ,
          <article-title>Transformers for Tabular Data Representation: A Survey of Models and Applications, Transactions of the Association for Computational Linguistics 11 (</article-title>
          <year>2023</year>
          )
          <fpage>227</fpage>
          -
          <lpage>249</lpage>
          . URL: https://doi.org/10.1162/tacl_a_
          <fpage>00544</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cahoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Savelieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Mueller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Floratou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Curino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Henkel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Weimer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gustafsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wydrowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Batoukov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Deep</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Emani</surname>
          </string-name>
          ,
          <article-title>The need for tabular representation learning: An industry perspective</article-title>
          ,
          <source>in: NeurIPS 2022 First Table Representation Workshop</source>
          ,
          <year>2022</year>
          . URL: https://openreview.net/forum?id=jk4B84qmlXJ.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hulsebos</surname>
          </string-name>
          , Ç. Demiralp,
          <string-name>
            <given-names>P.</given-names>
            <surname>Groth</surname>
          </string-name>
          ,
          <article-title>Gittables: A large-scale corpus of relational tables</article-title>
          ,
          <source>arXiv preprint arXiv:2106.07258</source>
          (
          <year>2021</year>
          ). URL: https://arxiv. org/abs/2106.07258.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Bhagavatula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Noraset</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Downey</surname>
          </string-name>
          , Tabel:
          <article-title>Entity linking in web tables</article-title>
          ,
          <source>in: The Semantic Web - ISWC 2015 - 14th International Semantic Web Conference</source>
          , Bethlehem, PA, USA, October
          <volume>11</volume>
          -
          <issue>15</issue>
          ,
          <year>2015</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>I</given-names>
          </string-name>
          , volume
          <volume>9366</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2015</year>
          , pp.
          <fpage>425</fpage>
          -
          <lpage>441</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -25007-6_
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Huppler</surname>
          </string-name>
          ,
          <article-title>The art of building a good benchmark, in: Performance Evaluation and Benchmarking</article-title>
          ,
          <source>First TPC Technology Conference, TPCTC</source>
          <year>2009</year>
          , Lyon, France,
          <source>August 24-28</source>
          ,
          <year>2009</year>
          , Revised Selected Papers, volume
          <volume>5895</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2009</year>
          , pp.
          <fpage>18</fpage>
          -
          <lpage>30</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -10424-
          <issue>4</issue>
          _
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Poess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kollar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <article-title>Tpc-ds, taking decision support benchmarking to the next level</article-title>
          ,
          <source>in: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, SIGMOD '02</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2002</year>
          , p.
          <fpage>582</fpage>
          -
          <lpage>587</lpage>
          . URL: https://doi.org/10.1145/564691.564759.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vogelsgesang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Haubenschild</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Finis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kemper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Leis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mühlbauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Then, Get real: How benchmarks fail to represent the real world</article-title>
          ,
          <source>in: Proceedings of the 7th International Workshop on Testing Database Systems, DBTest@SIGMOD</source>
          <year>2018</year>
          , Houston, TX, USA, June 15,
          <year>2018</year>
          , ACM,
          <year>2018</year>
          , pp.
          <volume>1</volume>
          :
          <fpage>1</fpage>
          -
          <issue>1</issue>
          :
          <fpage>6</fpage>
          . URL: https://doi. org/10.1145/3209950.3209952.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , T. J.
          <string-name>
            <surname>Pollard</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>L.-w. H.</given-names>
          </string-name>
          <string-name>
            <surname>Lehman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ghassemi</surname>
            , B. Moody, P. Szolovits,
            <given-names>L. Anthony</given-names>
          </string-name>
          <string-name>
            <surname>Celi</surname>
          </string-name>
          , R. G. Mark,
          <article-title>Mimic-iii, a freely accessible critical care database</article-title>
          ,
          <source>Scientific data 3</source>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          . URL: https://doi.org/10.1038/sdata.
          <year>2016</year>
          .
          <volume>35</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Wikidata</surname>
          </string-name>
          , https://www.wikidata.org, .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Vogel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hilprecht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Binnig</surname>
          </string-name>
          ,
          <article-title>Towards foundation models for relational databases</article-title>
          [vision paper],
          <source>NeurIPS 2022 First Table Representation Workshop</source>
          (
          <year>2022</year>
          ). URL: https://openreview.net/forum? id=s1KlNOQq71_.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Döhmen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hulsebos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Beecks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schelter</surname>
          </string-name>
          ,
          <article-title>Gitschemas: A dataset for automating relational data preparation tasks</article-title>
          ,
          <source>in: 38th IEEE International Conference on Data Engineering Workshops, ICDE Workshops</source>
          <year>2022</year>
          ,
          <string-name>
            <given-names>Kuala</given-names>
            <surname>Lumpur</surname>
          </string-name>
          , Malaysia, May 9,
          <year>2022</year>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>78</lpage>
          . URL: https://doi.org/10. 1109/ICDEW55742.
          <year>2022</year>
          .
          <volume>00016</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Motl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Schulte</surname>
          </string-name>
          ,
          <article-title>The CTU prague relational learning repository</article-title>
          ,
          <source>CoRR abs/1511</source>
          .03086 (
          <year>2015</year>
          ). URL: http://arxiv.org/abs/1511.03086. arXiv:
          <volume>1511</volume>
          .
          <fpage>03086</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moritz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Halperin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Howe</surname>
          </string-name>
          , E. Lazowska,
          <article-title>Sqlshare: Results from a multi-year sql-asa-service experiment</article-title>
          ,
          <source>in: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference</source>
          <year>2016</year>
          , San Francisco, CA, USA, June 26 - July 01,
          <year>2016</year>
          , ACM,
          <year>2016</year>
          , pp.
          <fpage>281</fpage>
          -
          <lpage>293</lpage>
          . URL: https://doi.org/10.1145/2882903.2882957.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <article-title>Downloaded the 'latest-all</article-title>
          .
          <source>json.gz' dump on February 22</source>
          ,
          <year>2023</year>
          from https://dumps.wikimedia.org/ wikidatawiki/entities/.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>N.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Ouzzani, RPT: relational pre-trained transformer is almost all you need towards democratizing data preparation</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>14</volume>
          (
          <year>2021</year>
          )
          <fpage>1254</fpage>
          -
          <lpage>1261</lpage>
          . URL: http://www.vldb.org/pvldb/vol14/ p1254-tang.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          , L. Zettlemoyer,
          <article-title>BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</article-title>
          ,
          <source>ACL 2020, Online, July</source>
          <volume>5</volume>
          -
          <issue>10</issue>
          ,
          <year>2020</year>
          , Association for Computational Linguistics,
          <year>2020</year>
          , pp.
          <fpage>7871</fpage>
          -
          <lpage>7880</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2020</year>
          . acl-main.
          <volume>703</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>