Wikary: A Dataset of N-ary Wikipedia Tables Matched to Qualified Wikidata Statements Igor Mazurek1 , Berend Wiewel1 and Benno Kruit1,∗ 1 Vrije Universiteit Amsterdam, De Boelelaan 1105, 1081 HV Amsterdam, The Netherlands Abstract We introduce a dataset of almost 32.000 tables from 3 Wikipedia language versions which have been matched to Wikidata statements with qualifiers at 98.4% precision. The tables express a diverse set of n-ary relations which constitute a new target for semantic table interpretation research. Keywords Tabular Data, Knowledge Graph Matching, N-ary Relations, Qualifiers 1. Introduction and Background Tabular data from databases, documents, or the web contain a wealth of information that could be made more accessible, searchable, and useful by matching it with Knowledge Graphs (KGs). However, integrating tabular data with KGs is still typically a manual process, requiring in-depth knowledge of the KG schema and the domain of interest. Much process has been made in recent years, and automating this task currently remains an open challenge. In particular, many systems have been developed that match table columns to semantic types (Column-Type Annotation, CTA), table cells to KG entity (Cell-Entity Annotation, CEA), and pairs of table columns to binary KG relations (Column-Property Annotation, CPA). One application of such systems is the extraction of subject-property-object triples from each pair of columns, for extending KGs with new factual statements. The quality of such systems may be evaluated using a variety of benchmark datasets, with the goal of assessing performance on a variety of topical domains and controlled environments [1, 2, 3]. In order to create useful systems that effectively match tabular data to KGs in practice, these benchmarks should therefore reflect the diversity of tabular data as it occurs in real-world usage. Public benchmarks for this problem are becoming increasingly realistic, incentivizing the development and evaluation of usable systems [2]. However, much real-world tabular data does not express binary KG relations but rather represents higher-order, n-ary relations instead [4, 5]. N-ary relations express statements involving multiple (> 2) entities or values, which cannot be decomposed into independent, atomic parts without compromising truthfulness, coherence, or completeness. For example, consider the statement the following statement: SemTab’22: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, Collocated with the 21st Interna- tional Semantic Web Conference, 23-27 October 2022, Hybrid, Hangzhou, China ∗ Corresponding author. Envelope-Open i.w.mazurek@student.vu.nl (I. Mazurek); b.wiewel@student.vu.nl (B. Wiewel); b.b.kruit@vu.nl (B. Kruit) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Example 1.1 (N-ary Statement) The album Thriller by Michael Jackson reached the top-1 position in the US Billboard 200 chart on February 26th, 1983. This statement cannot be expressed by a single triple, but also cannot be decomposed into independent parts that make sense on their own. Multiple statements like this can be naturally represented in tabular form, and, indeed, real-world tables very often express n-ary relations. As of yet, though, table-to-KG matching benchmarks have not included such n-ary tables as they are not covered by the CPA task framework. To reflect the diversity of real-world data, this class of tables should be considered in table interpretation research. In the popular Wikidata KG[6], n-ary relations are modeled using qualifiers[7]. Qualifiers extend simple statements with additional context information for the claim and may be repre- sented in RDF in a straightforward way using blank nodes. This way, complex n-ary claims may be represented such as our example 1.1 above: Example 1.2 (N-ary Statement as RDF in Turtle syntax) wd:Q44320 p:P2291 [ # Thriller (album) ps:P2291 wd:Q188819; # charted in: US Billboard 200 pq:P585 ”1983”^^xsd:gYear; # point in time: 1983 pq:P1352 ”1” # ranking: 1 ] . Qualifiers in Wikidata are most often used to represent temporal scopes of statements and are therefore important from a data modeling perspective. Because tables in Wikipedia articles express information about well-known entities that are also described by Wikidata, they form a prime candidate for studying n-ary tables in a controlled environment. By only changing the structure of the table, while keeping a tight alignment to a broad-coverage KG, we hope to contribute insights that may generalize to situations in which the entities might not be covered by a KG. Such low-coverage scenarios may occur for tables from other sources such as the web[5], CSVs [8], or relational databases [9]. Contribution Our goal is to encourage the study of the entire variety of web tables encoun- tered in practice while maintaining a grounding in well-studied semantic models. In this paper, we, therefore, introduce a dataset of almost 32.000 tables from 3 Wikipedia language versions which have been matched to Wikidata statements with qualifiers. The large scale allows for the analysis of the diversity of representation of n-ary statements in practice, and by sourcing tables in multiple languages, we aim to diversify the topical coverage of the tables. The dataset is publicly available on Zenodo1 . 2. Dataset creation The creation process of the dataset can be split into three parts: scraping tables, joining the tables with Wikidata, and filtering high-confidence matches. Finally, we estimate the quality of the dataset using an annotation interface to label a subset of the data manually. 1 https://doi.org/10.5281/zenodo.7025005 (a) (b) Figure 1: (a) Example of a row from a Wikipedia table that expresses an n-ary relation (b) Wikidata statement that expresses the same information Database Pages Lang Titles Tables version w/ tables Simple EN all nopic 2022-03 274296 35047 18194 NL all nopic 2021-11 2853121 582319 243893 PL all nopic 2022-05 2034836 537366 199791 Table 1 Statistical characteristics of each Wikipedia version Scraping data The Wikimedia Foundation provides two different types of static database dumps of its projects, including Wikipedia. One is the original wikitext markup as edited by contributors, and the other is a static HTML export which is suitable for self-hosting2 . Due to the fact that the wikitext allows for complex nesting of templates, we opted to use the HTML representation to extract the final tables as they appear rendered in articles. This ensures that all table elements that are viewable by readers are also available for extraction. We decided to compare three languages - simple English, Dutch, and Polish. The choice was made based on our linguistic competencies as later on, we need to annotate tables which requires an understanding of the given language. Key statistical characteristics of each Wikipedia version together with dump versions are shown in Table 1. For each Wikipedia version considered, we scraped all HTML tables of the “wikitable” class, which is used to indicate content tables in articles. During this step, we also extracted additional data for each table: page title, table index, section title, table caption, and list of hyperlinks per row. The page title and table index allow us to uniquely identify the table. Merge with Wikidata statements Each row in every table is being assessed individually, which is why for the sake of clarity, in our example we will cover only one row in the table. We will focus on the row shown in Figure 1a which is present on Tim Allen’s page in the Filmography section. The next step is converting hyperlinks of each row to Wikidata entities, 2 Provided using the Kiwix/Openzim toolchain (http://www.kiwix.org) as well as looking up the Wikidata entity associated with the page itself. The page entity provides additional information about the table, necessary to completely understand the table. Matching hyperlinks to Wikidata identifiers is performed using an index based on all redirects and mappings of article titles to Wikidata IDs. The next step is creating permutations of all pairs of entities included in each row and the page entity. These are merged into a collection of all Wikidata statements that have qualifiers. This merge can be considered as database-style join, and the keys that we use are pair of identifiers from the Wikipedia table and subject and object from Wikidata. In Figure 1b, a Wikidata statement is shown that shows the same information as the row in the example table. The set of Wikidata statements can be seen as a database table that consists of more than 15.5 million tuples, with the following information: subject, property, object, qualifier property, and qualifier value. From the perspective of record matching, this merge operation may be seen as a blocking step that efficiently produces a large number of matching candidates. Finding matches The last step of the implementation is finding matches to keep only table rows that likely express n-ary relations. We can distinguish three types of matches, however, all of them are based on qualifier value in a corresponding Wikidata statement. The qualifier value can not be equal to the subject and object and has been in a different cell. Note that the presence of a match does not guarantee that the table expresses the same information as the matched Wikidata statement, nor does it guarantee that the table expresses n-ary relations. Our goal is to find a large number of tables that may express n-ary facts with high likelihood, and these matching approaches are designed to result in high precision. We use the following matching functions: 1. Wikidata identifier match In this type of match, the qualifier value is equal to Wikidata entities. We look up the Wikidata entities of hyperlinked pages and compare the qualifier value, if the values are the same then we have a match. 2. Year cell match This match applies when the qualifier value is provided as a date, however instead of using the entire date we only extract the year from it. The next step is looking up a cell from the row that contains only a given year as a text in the cell and nothing more. Moreover, this cell can not be a subject or object of a matched statement. 3. Within cell year match In this more lenient version of the above-mentioned matching, we also used the year extracted from the qualifier value. However in this case we try to find this year occurring in any place in a cell. This means that all of the matches from the previous type are included in this one. This trade-off leads to higher recall as the year in a cell could be combined with some additional information, which would not be detected by the previous type of matching. Though it also means lower precision because the number indicating the year could be part of some large number of unrelated strings, our evaluation has shown that the loss in precision is limited. Quality Evaluation In order to estimate the quality of the extracted tables, we created the environment to annotate tables. To facilitate this process, annotators need convenient tools to save time on unnecessary actions like opening a table’s HTML one at a time or independently Figure 2: Widget used for annotations. The interface shows the full HTML table with all original markup and the matching row highlighted. Above it, the matching Wikidata statement is shown with links to check the identity and background information of matching entities, along with the option to inspect the table in its original context. finding the table on Wikipedia. Our annotation interface makes use of the PigeonXT3 Python library and runs entirely within a Jupyter Notebook, and is therefore integrated with our extraction approach. As shown in Figure 2 it displays the content of the table along with subject, property, and object per random row and relation pair merged with Wikidata statement. This labeling interface will be released along with the dataset, to facilitate the creation and assessment of new versions of this dataset. Moreover, during the evaluation process, we discarded rows containing more than 100 hyperlinks per row as the manual review showed that were results of incorrect table formatting. 3. Statistics & Analysis The Table 2 displays key statistics during the process of finding matches. The third column of the table presents the number of tables where at least one hyperlink was matched to the Wikidata page. Looking at percentages of the total data set this number seems to be quite consistent across all languages. Most of the tables that were successfully merged with Wikidata statements are in Simple English Wikipedia, followed by Polish, and Dutch has the lowest number of 15%. Intriguing is the number of tables per page differs remarkably, 1.23 in Simple English compared to 1.85 in the Polish version. The number of matches is shown in Table ??, while Simple English and Dutch Wikipedia look similar in terms of identifier matches, 7%, and 5.5%, the Polish version has only 2.1%. Year cell 3 https://github.com/dennisbakhuis/pigeonXT At least one hyperlink Merged with Wikidata Dataset Lang matched to Wikidata page statements with qualifiers T. % of T. % of Tables Pages Tables Pages Tables Pages dataset dataset Simple 35047 18194 23809 14935 68% 10023 8155 28.6% EN NL 582319 243893 360778 153628 62% 87584 51867 15% PL 537366 199791 366795 149858 68.3% 114694 62196 21.3% Table 2 Number of tables and pages at distinct point of implementation Lang / Simple NL PL Match type EN Pages 605 4094 2015 Wikidata Tables 706 4790 2420 identifier Rows 2212 14230 6370 match Matches 6895 37382 13978 Pages 53 904 1329 Year Tables 58 1055 1438 cell Rows 176 3208 7930 match Matches 270 4987 25318 Within Pages 217 14445 4773 cell Tables 292 7485 10459 year Rows 740 41646 28798 match Matches 1109 50551 58504 Table 3 Number of matches found match is alike in Dutch and Polish versions, while Simple English has only half of the matches percentage-wise, specifically 0.6%. Results of within cell year match show different results in all of the Wikipedias, Dutch has the highest percentage value equal to 16%, Polish has 9.1% and Simple English only 2.9%. Due to the low number of annotations, we decided to calculate evaluation metrics based on merged results for both different methods of detection and Wikipedia’s versions, this resulted in 750 annotated matches. The table was classified as nary if one of three types of matches occurred, this results in a precision of 98.4%. Starting from all possible matches returned by the blocking step, the matching step filtered n-ary tables with a recall of 23.3% . 4. Maintenance, Availability and Use-cases The dataset is available on Zenodo4 . Our aim is to maintain and expand the dataset in more languages, including the full English Wikipedia. Additionally, we aim to incorporate any user feedback with regard to possible noise and formatting errors, to improve usability. We 4 https://doi.org/10.5281/zenodo.7025005 Lang Simple EN NL PL Total n-ary 964 18674 12355 tables N-ary table Rows Columns Rows Columns Rows Columns characteristics mean 31.59 9.83 19.16 5.59 27.70 18.40 std 50.02 99.34 43.64 3.34 65.48 74.39 25% 6 3 8 4 8 4 50% 16 4 13 4 13 5 75% 37.25 6 22 6 25 8 max 626 2511 4574 49 1009 701 Table 4 Numer of n-ary tables found published the full dataset creation pipeline as a set of Python scripts5 , to facilitate the creation of custom datasets based on our approach, including those based on up-to-date Wikipedia dumps. Additionally, we also included the annotation tool used for quality evaluation. Use-cases The main use-case of our dataset is as a resource for N-ary information extraction. It can be used to train or benchmark N-ary table interpretation systems within the Wikipedia domain, or for analyzing general structures of n-ary tables (such as functional dependencies[5]) for out-of-domain applications. As an exemplar task, we have already used this dataset in preliminary research on distinguish- ing binary and nary tables by statistical classification. We extracted surface-level features (such as column types) from known binary and these n-ary tables, and were able to train classifiers that distinguished them with acceptable performance. Such classifiers may also be used as a pre-processing filter when extracting binary facts from tables, so as to reduce false-positive CPA predictions. 5. Conclusion We introduce a dataset of almost 32.000 tables from 3 Wikipedia language versions which have been matched to Wikidata statements with qualifiers at 98.4% precision. The tables express a diverse set of n-ary relations which constitute a new target for semantic table interpretation research. References [1] E. Jiménez-Ruiz, O. Hassanzadeh, V. Efthymiou, J. Chen, K. Srinivas, V. Cutrona, Results of semtab 2020, in: CEUR Workshop Proceedings, volume 2775, 2020, pp. 1–8. 5 https://github.com/igormazurek/wikary/ [2] V. Cutrona, J. Chen, V. Efthymiou, O. Hassanzadeh, E. Jiménez-Ruiz, J. Sequeda, K. Srinivas, N. Abdelmageed, M. Hulsebos, D. Oliveira, et al., Results of semtab 2021, Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching 3103 (2022) 1–12. [3] E. Jiménez-Ruiz, O. Hassanzadeh, V. Efthymiou, J. Chen, K. Srinivas, Semtab 2019: Resources to benchmark tabular data to knowledge graph matching systems, in: European Semantic Web Conference, Springer, 2020, pp. 514–530. [4] B. Kruit, P. Boncz, J. Urbani, Extracting n-ary facts from wikipedia table clusters, in: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020, pp. 655–664. [5] O. Lehmberg, C. Bizer, Profiling the semantics of n-ary web table data, in: Proceedings of the International Workshop on Semantic Big Data, 2019, pp. 1–6. [6] D. Vrandečić, M. Krötzsch, Wikidata: a free collaborative knowledge base, Communications of the ACM 57 (2014) 78–85. [7] F. Erxleben, M. Günther, M. Krötzsch, J. Mendez, D. Vrandečić, Introducing wikidata to the linked data web, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 8796 (2014) 50–65. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 3 1 9 - 1 1 9 6 4 - 9 . [8] M. Hulsebos, Ç. Demiralp, P. Groth, Gittables: A large-scale corpus of relational tables, arXiv preprint arXiv:2106.07258 (2021). URL: https://arxiv.org/abs/2106.07258. [9] T. Döhmen, M. Hulsebos, C. Beecks, S. Schelter, Gitschemas: A dataset for automating relational data preparation tasks, in: 2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW), IEEE, 2022, pp. 74–78.