=Paper=
{{Paper
|id=Vol-3320/paper2
|storemode=property
|title=Wikary: A Dataset of N-ary Wikipedia Tables Matched to Qualified Wikidata Statements
|pdfUrl=https://ceur-ws.org/Vol-3320/paper2.pdf
|volume=Vol-3320
|authors=Igor Mazurek,Berend Wiewel,Benno Kruit
|dblpUrl=https://dblp.org/rec/conf/semweb/MazurekWK22
}}
==Wikary: A Dataset of N-ary Wikipedia Tables Matched to Qualified Wikidata Statements==
<pdf width="1500px">https://ceur-ws.org/Vol-3320/paper2.pdf</pdf>
<pre>
Wikary: A Dataset of N-ary Wikipedia Tables
Matched to Qualified Wikidata Statements
Igor Mazurek1 , Berend Wiewel1 and Benno Kruit1,∗
1
    Vrije Universiteit Amsterdam, De Boelelaan 1105, 1081 HV Amsterdam, The Netherlands


                                         Abstract
                                         We introduce a dataset of almost 32.000 tables from 3 Wikipedia language versions which have been
                                         matched to Wikidata statements with qualifiers at 98.4% precision. The tables express a diverse set of
                                         n-ary relations which constitute a new target for semantic table interpretation research.

                                         Keywords
                                         Tabular Data, Knowledge Graph Matching, N-ary Relations, Qualifiers


1. Introduction and Background
Tabular data from databases, documents, or the web contain a wealth of information that
could be made more accessible, searchable, and useful by matching it with Knowledge Graphs
(KGs). However, integrating tabular data with KGs is still typically a manual process, requiring
in-depth knowledge of the KG schema and the domain of interest. Much process has been made
in recent years, and automating this task currently remains an open challenge. In particular,
many systems have been developed that match table columns to semantic types (Column-Type
Annotation, CTA), table cells to KG entity (Cell-Entity Annotation, CEA), and pairs of table
columns to binary KG relations (Column-Property Annotation, CPA). One application of such
systems is the extraction of subject-property-object triples from each pair of columns, for
extending KGs with new factual statements. The quality of such systems may be evaluated
using a variety of benchmark datasets, with the goal of assessing performance on a variety of
topical domains and controlled environments [1, 2, 3]. In order to create useful systems that
effectively match tabular data to KGs in practice, these benchmarks should therefore reflect the
diversity of tabular data as it occurs in real-world usage. Public benchmarks for this problem
are becoming increasingly realistic, incentivizing the development and evaluation of usable
systems [2].
   However, much real-world tabular data does not express binary KG relations but rather
represents higher-order, n-ary relations instead [4, 5]. N-ary relations express statements
involving multiple (> 2) entities or values, which cannot be decomposed into independent,
atomic parts without compromising truthfulness, coherence, or completeness. For example,
consider the statement the following statement:
SemTab’22: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, Collocated with the 21st Interna-
tional Semantic Web Conference, 23-27 October 2022, Hybrid, Hangzhou, China
∗
    Corresponding author.
Envelope-Open i.w.mazurek@student.vu.nl (I. Mazurek); b.wiewel@student.vu.nl (B. Wiewel); b.b.kruit@vu.nl (B. Kruit)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Example 1.1 (N-ary Statement)
The album Thriller by Michael Jackson reached the top-1 position in the US Billboard 200 chart
on February 26th, 1983.
This statement cannot be expressed by a single triple, but also cannot be decomposed into
independent parts that make sense on their own. Multiple statements like this can be naturally
represented in tabular form, and, indeed, real-world tables very often express n-ary relations.
As of yet, though, table-to-KG matching benchmarks have not included such n-ary tables as
they are not covered by the CPA task framework. To reflect the diversity of real-world data,
this class of tables should be considered in table interpretation research.
   In the popular Wikidata KG[6], n-ary relations are modeled using qualifiers[7]. Qualifiers
extend simple statements with additional context information for the claim and may be repre-
sented in RDF in a straightforward way using blank nodes. This way, complex n-ary claims
may be represented such as our example 1.1 above:
Example 1.2 (N-ary Statement as RDF in Turtle syntax)
wd:Q44320 p:P2291 [                          # Thriller (album)
    ps:P2291 wd:Q188819;                     # charted in: US Billboard 200
    pq:P585 ”1983”^^xsd:gYear;               # point in time: 1983
    pq:P1352 ”1”                             # ranking: 1
] .

Qualifiers in Wikidata are most often used to represent temporal scopes of statements and are
therefore important from a data modeling perspective.
   Because tables in Wikipedia articles express information about well-known entities that are
also described by Wikidata, they form a prime candidate for studying n-ary tables in a controlled
environment. By only changing the structure of the table, while keeping a tight alignment to a
broad-coverage KG, we hope to contribute insights that may generalize to situations in which
the entities might not be covered by a KG. Such low-coverage scenarios may occur for tables
from other sources such as the web[5], CSVs [8], or relational databases [9].

Contribution Our goal is to encourage the study of the entire variety of web tables encoun-
tered in practice while maintaining a grounding in well-studied semantic models. In this paper,
we, therefore, introduce a dataset of almost 32.000 tables from 3 Wikipedia language versions
which have been matched to Wikidata statements with qualifiers. The large scale allows for
the analysis of the diversity of representation of n-ary statements in practice, and by sourcing
tables in multiple languages, we aim to diversify the topical coverage of the tables. The dataset
is publicly available on Zenodo1 .


2. Dataset creation
The creation process of the dataset can be split into three parts: scraping tables, joining the
tables with Wikidata, and filtering high-confidence matches. Finally, we estimate the quality of
the dataset using an annotation interface to label a subset of the data manually.
1
    https://doi.org/10.5281/zenodo.7025005
                            (a)                                                   (b)
Figure 1: (a) Example of a row from a Wikipedia table that expresses an n-ary relation (b) Wikidata
statement that expresses the same information

                                          Database                               Pages
                          Lang                             Titles       Tables
                                          version                                w/ tables
                       Simple EN     all nopic 2022-03    274296        35047     18194
                          NL         all nopic 2021-11    2853121       582319    243893
                          PL         all nopic 2022-05    2034836       537366    199791
Table 1
Statistical characteristics of each Wikipedia version


Scraping data The Wikimedia Foundation provides two different types of static database
dumps of its projects, including Wikipedia. One is the original wikitext markup as edited by
contributors, and the other is a static HTML export which is suitable for self-hosting2 . Due to
the fact that the wikitext allows for complex nesting of templates, we opted to use the HTML
representation to extract the final tables as they appear rendered in articles. This ensures that
all table elements that are viewable by readers are also available for extraction.
   We decided to compare three languages - simple English, Dutch, and Polish. The choice was
made based on our linguistic competencies as later on, we need to annotate tables which requires
an understanding of the given language. Key statistical characteristics of each Wikipedia version
together with dump versions are shown in Table 1. For each Wikipedia version considered,
we scraped all HTML tables of the “wikitable” class, which is used to indicate content tables
in articles. During this step, we also extracted additional data for each table: page title, table
index, section title, table caption, and list of hyperlinks per row. The page title and table index
allow us to uniquely identify the table.

Merge with Wikidata statements Each row in every table is being assessed individually,
which is why for the sake of clarity, in our example we will cover only one row in the table.
We will focus on the row shown in Figure 1a which is present on Tim Allen’s page in the
Filmography section. The next step is converting hyperlinks of each row to Wikidata entities,

2
    Provided using the Kiwix/Openzim toolchain (http://www.kiwix.org)
as well as looking up the Wikidata entity associated with the page itself. The page entity
provides additional information about the table, necessary to completely understand the table.
Matching hyperlinks to Wikidata identifiers is performed using an index based on all redirects
and mappings of article titles to Wikidata IDs.
   The next step is creating permutations of all pairs of entities included in each row and the
page entity. These are merged into a collection of all Wikidata statements that have qualifiers.
This merge can be considered as database-style join, and the keys that we use are pair of
identifiers from the Wikipedia table and subject and object from Wikidata. In Figure 1b, a
Wikidata statement is shown that shows the same information as the row in the example table.
The set of Wikidata statements can be seen as a database table that consists of more than 15.5
million tuples, with the following information: subject, property, object, qualifier property, and
qualifier value. From the perspective of record matching, this merge operation may be seen as
a blocking step that efficiently produces a large number of matching candidates.

Finding matches The last step of the implementation is finding matches to keep only table
rows that likely express n-ary relations. We can distinguish three types of matches, however,
all of them are based on qualifier value in a corresponding Wikidata statement. The qualifier
value can not be equal to the subject and object and has been in a different cell. Note that the
presence of a match does not guarantee that the table expresses the same information as the
matched Wikidata statement, nor does it guarantee that the table expresses n-ary relations.
Our goal is to find a large number of tables that may express n-ary facts with high likelihood,
and these matching approaches are designed to result in high precision. We use the following
matching functions:

1. Wikidata identifier match In this type of match, the qualifier value is equal to Wikidata
   entities. We look up the Wikidata entities of hyperlinked pages and compare the qualifier
   value, if the values are the same then we have a match.
2. Year cell match This match applies when the qualifier value is provided as a date, however
   instead of using the entire date we only extract the year from it. The next step is looking up
   a cell from the row that contains only a given year as a text in the cell and nothing more.
   Moreover, this cell can not be a subject or object of a matched statement.
3. Within cell year match In this more lenient version of the above-mentioned matching, we
   also used the year extracted from the qualifier value. However in this case we try to find this
   year occurring in any place in a cell. This means that all of the matches from the previous
   type are included in this one. This trade-off leads to higher recall as the year in a cell could
   be combined with some additional information, which would not be detected by the previous
   type of matching. Though it also means lower precision because the number indicating the
   year could be part of some large number of unrelated strings, our evaluation has shown that
   the loss in precision is limited.

Quality Evaluation In order to estimate the quality of the extracted tables, we created the
environment to annotate tables. To facilitate this process, annotators need convenient tools to
save time on unnecessary actions like opening a table’s HTML one at a time or independently
Figure 2: Widget used for annotations. The interface shows the full HTML table with all original
markup and the matching row highlighted. Above it, the matching Wikidata statement is shown with
links to check the identity and background information of matching entities, along with the option to
inspect the table in its original context.


finding the table on Wikipedia. Our annotation interface makes use of the PigeonXT3 Python
library and runs entirely within a Jupyter Notebook, and is therefore integrated with our
extraction approach. As shown in Figure 2 it displays the content of the table along with
subject, property, and object per random row and relation pair merged with Wikidata statement.
This labeling interface will be released along with the dataset, to facilitate the creation and
assessment of new versions of this dataset.
   Moreover, during the evaluation process, we discarded rows containing more than 100
hyperlinks per row as the manual review showed that were results of incorrect table formatting.


3. Statistics & Analysis
The Table 2 displays key statistics during the process of finding matches. The third column
of the table presents the number of tables where at least one hyperlink was matched to the
Wikidata page. Looking at percentages of the total data set this number seems to be quite
consistent across all languages. Most of the tables that were successfully merged with Wikidata
statements are in Simple English Wikipedia, followed by Polish, and Dutch has the lowest
number of 15%. Intriguing is the number of tables per page differs remarkably, 1.23 in Simple
English compared to 1.85 in the Polish version.
   The number of matches is shown in Table ??, while Simple English and Dutch Wikipedia look
similar in terms of identifier matches, 7%, and 5.5%, the Polish version has only 2.1%. Year cell

3
    https://github.com/dennisbakhuis/pigeonXT
                                               At least one hyperlink          Merged with Wikidata
                           Dataset
            Lang                             matched to Wikidata page        statements with qualifiers
                                                                T. % of                       T. % of
                       Tables     Pages      Tables Pages                    Tables Pages
                                                                dataset                       dataset
            Simple
                       35047      18194      23809    14935    68%           10023     8155    28.6%
            EN
            NL         582319     243893     360778   153628   62%           87584     51867   15%
            PL         537366     199791     366795   149858   68.3%         114694    62196   21.3%
Table 2
Number of tables and pages at distinct point of implementation

                                Lang /                    Simple
                                                                     NL        PL
                                Match type                EN
                                               Pages      605        4094      2015
                                Wikidata
                                               Tables     706        4790      2420
                                identifier
                                               Rows       2212       14230     6370
                                match
                                               Matches    6895       37382     13978
                                               Pages      53         904       1329
                                Year
                                               Tables     58         1055      1438
                                cell
                                               Rows       176        3208      7930
                                match
                                               Matches    270        4987      25318
                                Within         Pages      217        14445     4773
                                cell           Tables     292        7485      10459
                                year           Rows       740        41646     28798
                                match          Matches    1109       50551     58504
Table 3
Number of matches found


match is alike in Dutch and Polish versions, while Simple English has only half of the matches
percentage-wise, specifically 0.6%. Results of within cell year match show different results in all
of the Wikipedias, Dutch has the highest percentage value equal to 16%, Polish has 9.1% and
Simple English only 2.9%.
   Due to the low number of annotations, we decided to calculate evaluation metrics based on
merged results for both different methods of detection and Wikipedia’s versions, this resulted
in 750 annotated matches. The table was classified as nary if one of three types of matches
occurred, this results in a precision of 98.4%. Starting from all possible matches returned by the
blocking step, the matching step filtered n-ary tables with a recall of 23.3% .


4. Maintenance, Availability and Use-cases
The dataset is available on Zenodo4 . Our aim is to maintain and expand the dataset in more
languages, including the full English Wikipedia. Additionally, we aim to incorporate any
user feedback with regard to possible noise and formatting errors, to improve usability. We

4
    https://doi.org/10.5281/zenodo.7025005
                 Lang                Simple EN                  NL                   PL
               Total n-ary
                                         964                   18674                12355
                 tables
              N-ary table
                                 Rows        Columns   Rows      Columns    Rows      Columns
             characteristics
                 mean            31.59         9.83    19.16         5.59   27.70         18.40
                   std           50.02         99.34   43.64         3.34   65.48         74.39
                  25%              6             3       8            4       8             4
                  50%             16             4      13            4      13             5
                  75%            37.25           6      22            6      25             8
                  max             626          2511    4574           49    1009           701
Table 4
Numer of n-ary tables found


published the full dataset creation pipeline as a set of Python scripts5 , to facilitate the creation of
custom datasets based on our approach, including those based on up-to-date Wikipedia dumps.
Additionally, we also included the annotation tool used for quality evaluation.

Use-cases The main use-case of our dataset is as a resource for N-ary information extraction.
It can be used to train or benchmark N-ary table interpretation systems within the Wikipedia
domain, or for analyzing general structures of n-ary tables (such as functional dependencies[5])
for out-of-domain applications.
   As an exemplar task, we have already used this dataset in preliminary research on distinguish-
ing binary and nary tables by statistical classification. We extracted surface-level features (such
as column types) from known binary and these n-ary tables, and were able to train classifiers
that distinguished them with acceptable performance. Such classifiers may also be used as a
pre-processing filter when extracting binary facts from tables, so as to reduce false-positive
CPA predictions.


5. Conclusion
We introduce a dataset of almost 32.000 tables from 3 Wikipedia language versions which have
been matched to Wikidata statements with qualifiers at 98.4% precision. The tables express a
diverse set of n-ary relations which constitute a new target for semantic table interpretation
research.


References
[1] E. Jiménez-Ruiz, O. Hassanzadeh, V. Efthymiou, J. Chen, K. Srinivas, V. Cutrona, Results of
    semtab 2020, in: CEUR Workshop Proceedings, volume 2775, 2020, pp. 1–8.


5
    https://github.com/igormazurek/wikary/
[2] V. Cutrona, J. Chen, V. Efthymiou, O. Hassanzadeh, E. Jiménez-Ruiz, J. Sequeda, K. Srinivas,
    N. Abdelmageed, M. Hulsebos, D. Oliveira, et al., Results of semtab 2021, Proceedings of the
    Semantic Web Challenge on Tabular Data to Knowledge Graph Matching 3103 (2022) 1–12.
[3] E. Jiménez-Ruiz, O. Hassanzadeh, V. Efthymiou, J. Chen, K. Srinivas, Semtab 2019: Resources
    to benchmark tabular data to knowledge graph matching systems, in: European Semantic
    Web Conference, Springer, 2020, pp. 514–530.
[4] B. Kruit, P. Boncz, J. Urbani, Extracting n-ary facts from wikipedia table clusters, in:
    Proceedings of the 29th ACM International Conference on Information & Knowledge
    Management, 2020, pp. 655–664.
[5] O. Lehmberg, C. Bizer, Profiling the semantics of n-ary web table data, in: Proceedings of
    the International Workshop on Semantic Big Data, 2019, pp. 1–6.
[6] D. Vrandečić, M. Krötzsch, Wikidata: a free collaborative knowledge base, Communications
    of the ACM 57 (2014) 78–85.
[7] F. Erxleben, M. Günther, M. Krötzsch, J. Mendez, D. Vrandečić, Introducing wikidata to
    the linked data web, Lecture Notes in Computer Science (including subseries Lecture
    Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 8796 (2014) 50–65.
    doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 3 1 9 - 1 1 9 6 4 - 9 .
[8] M. Hulsebos, Ç. Demiralp, P. Groth, Gittables: A large-scale corpus of relational tables,
    arXiv preprint arXiv:2106.07258 (2021). URL: https://arxiv.org/abs/2106.07258.
[9] T. Döhmen, M. Hulsebos, C. Beecks, S. Schelter, Gitschemas: A dataset for automating
    relational data preparation tasks, in: 2022 IEEE 38th International Conference on Data
    Engineering Workshops (ICDEW), IEEE, 2022, pp. 74–78.

</pre>