1. Introduction

CEA CTA CPA SemTab

MammoTab: a giant and comprehensive dataset for Semantic Table Interpretation

Mattia Marzocchi

m.marzocchi@campus.unimib.it 0

Marco Cremaschi

marco.cremaschi@unimib.it 0

Riccardo Pozzi

riccardo.pozzi@unimib.it 0

Roberto Avogadro

roberto.avogadro@unimib.it 0

Matteo Palmonari

matteo.palmonari@unimib.it 0 0 University of Milan - Bicocca , viale Sarca 336, Edificio U14, 20126, Milan , Italy

2019

In this paper, we present MammoTab, a dataset composed of 1M Wikipedia tables extracted from over 20M Wikipedia pages and annotated through Wikidata. The lack of this kind of datasets in the stateof-the-art makes MammoTab a good resource for testing and training Semantic Table Interpretation approaches. The dataset has been designed to cover several key challenges, such as disambiguation, homonymy, and NIL-mentions. The dataset has been evaluated using MTab, one of the best approaches of the SemTab challenge.

eol>Semantic Table Interpretation Tabular Data SemTab Challenge Knowledge Graph

1. Introduction

international challenge1, now in its 4th version. The challenge consists of diferent rounds in which groups of tables with diferent features and levels of dificulty have to be annotated. The increased interest in the STI has led to the construction of several datasets (gold standards) in the last decade. As it will be better described later, these datasets often include only a part of the characteristics of Web tables (e.g., small tables with few mentions, easy to annotate semantically [ 4 ]).

As a consequence, we created a new dataset, MammoTab, composed of 980 254 tables extracted from 21 149 260 Wikipedia pages and annotated through Wikidata. The number and the diferent features of the tables make

MammoTab a good resource for testing and/or training

STI approaches. In particular, because of its dimension MammoTab is a useful tool to train data-hungry models, which require a vast amount of data (e.g., entity linking systems based on large language models).

The rest of the paper is organised as follows. Section 2 will present a brief analysis of the state-of-the-art related datasets used in the context of the STI. Subsequently, MammoTab is described (Section 3), its characteristics are listed (Section 3.1), and the pipeline used for its implementation is presented (Section 3.2). The evaluation is depicted in Section 4 through a state-of-the-art STI approach.

2. Datasets

Although several approaches deal with semantic annotations on tabular data, there are limited Gold Standards (GSs) for assessing the quality of these annotations. The main ones are T2Dv2, Limaye, Musicbrainz, IMDB, Taheryan, Tough Table and SemTab. Table 1 shows statistics for the GSs.

An excellent STI approach must consider and adequately balance the diferent features of a table (or a set of tables). The annotation involves several key challenges: i) disambiguation: the class of the entities described in a table are not known in advance, and those entities may correspond to more than one class in the KG. ii) homonymy: this issue is related to the presence of diferent entities with the same name and class. iii) matching: the mention in the table may be syntactically diferent from the label of the entity in a KG (i.e., use of acronyms, aliases and typos). iv) NIL-mentions: the approach must also consider strings that refer to entities for which a representation has not yet been created within the KG, namely NIL-mentions. v) literal and named-entity: in a table, there can be columns that contain named-entity mentions (NE-column) and columns containing strings (L-column). vi) missing context: it is often easier to extract the context from textual documents than from tables due to the amount of content to be processed. For instance, the header (i.e., the first row of a table) which usually contains descriptive attributes for the columns, may or may not be present. vii) amount of data: the approach must consider large tables with many rows and columns, and tables with very few mentions. viii) diferent domains : the tables within a set can belong to very general or specific domains. MammoTab has been designed to cover all these cases, making it a resource for evaluating or training STI approaches.

The dataset is made up of tables automatically extracted from Wikipedia. Some pipelines for extracting tables from Wikipedia have been presented in the state-of-the-art. Among these, TabEL [ 10 ] proposes a STI approach and a dataset (WikiTable corpus) composed of 1.6M tables from the November 2013 XML dump of English Wikipedia. The dataset focuses only on the CEA task using YAGO as reference KG. It should be noted that the WikiTable is outdated and the code has not been made available2. Another paper [ 11 ] proposes a dataset composed of 670 171 tables extracted from WikiTable corpus to which the Wikipedia page title, section title, and table caption have been added to obtain a more comprehensive description. However, the dataset has not been released, nor has the source code for its generation.

3. The MammoTab Dataset

The annotations inside MammoTab are based on Wikidata v. 20220520 and are provided following the structure used by the SemTab challenge. One table is stored in one CSV file, and each line corresponds to a table row. The target columns for annotation, CTA, and CEA annotations are saved in CSV files. A JSON document has also been created for each Wikipedia page with additional information about the tables extracted by that page (see Listing 1). We released MammoTab in Zenodo3 following the FAIR Guiding Principles4. The dataset is released under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) licence5. MammoTab contains a variety of tables that allow to evaluate approaches considering the challenges previously listed: in the dataset there are tables containing i) entities hard to disambiguate (e.g., table id LBJJ1WGD - reactor Clinton)6, ii) cases of homonymy (e.g., table id MRBWAAOA - soccer player

2websail-fe.cs.northwestern.edu/TabEL/

3zenodo.org/deposit/7014472 4www.nature.com/articles/sdata201618 5creativecommons.org/licenses/by-sa/4.0/ 6en.wikipedia.org/wiki/List_of_cancelled_nuclear_reactors_in_the_United_States Michael Jordan)7, iii) aliases (table id JCND1XGG - Tom Riddle alias of Lord Voldemort)8, and iv) NIL-mentions (e.g., table id MSTBGKPR - KKOP-LP Wildcat Broad. Inc)9.

3.1. Dataset Profile

The MammoTab tables were extracted from 21 149 260 Wikipedia pages using the XML dump10. In these pages 2 803 424 tables were detected. Among these, the tables with at least three links in the same column have been stored for a total amount of 980 254 tables. Some dataset statistics are reported in Table 2.

3.2. Implementation

The dataset was built through a pipeline implemented as a set of Python scripts, which are available in a Git repository11.

The pipeline consists of 10 steps: i) dump processing: each Wikipedia XML dump file (we used the multiple bz2 stream dumps for easier parallelisation) is parsed using BeautifulSoup12 to extract each page; ii) tables identification : pages are scanned to find those that contain at least one table (wikitext syntax: | class=wikitable |); iii) tables normalisation and cleaning: each cell is normalised and cleaned up using custom and wikitextparser13 functions. Elements such as subscripts, superscripts, elements of the wikitext syntax, images, Wikipedia help and project pages links, and links to external pages are removed; iv) tables analysis: each table is analysed to check for cells that contain links to Wikipedia pages (wikitext syntax: [[link]]). A cell is considered as a mention of an entity only if the entire cell is a link. The remaining cells, containing multiple links or additional words around the link, may also be mentions of entities, but we consider them uncertain and mark them as UNKNOWN; v) tables storing: tables that have at least three fully linked cells in a column are stored; vi) table header and table caption detection: table header (wikitext syntax: !header) and table caption, if any, are stored and added to the current table; vii) column analysis: each column is analysed and classified into Literal columns (L-column) for datatype values (e.g., strings, numbers, dates, such as 4808, 10/04/1983), 7en.wikipedia.org/wiki/USA_Today_All-USA_high_school_football_team 8en.wikipedia.org/wiki/Christian_Coulson 9en.wikipedia.org/wiki/List_of_radio_stations_in_Nebraska 10dumps.wikimedia.org/enwiki/20220720/ 11bitbucket.org/disco_unimib/mammotab/ 12www.crummy.com/software/BeautifulSoup/ 13github.com/5j9/wikitextparser or Named-Entity columns (NE-column) if it contains links to Wikipedia pages; viii) entity linking - CEA: for each Wikipedia link, the related Wikidata entity is extracted; ix) column annotation CTA: column types are set by choosing the most specific entity class (according to Wikidata subclass relationships) that is shared by most of the column rows. For columns with less than 5 rows, all cells must be instances of that class, while, for bigger columns, at least 60% of the cells are instances of that class; x) NIL-identification : we mark as NIL the cells containing Wikipedia red links14, which are those links referring to a page that does not exist.

The Listing 1 shows an example of a JSON document used to manage the results of the process described above on the Wikipedia page about “As Long as You Love Me ( Justin Bieber song)”15. Listing 1: JSON document with the information relating to a Wikipedia page that contains at least one table. {"wiki_id": ’36115735’, ""ttaibtllee"s:":’A{s Long as You Love Me (Justin Bieber song)’, "XXI7BFMW": { """chlaeipantdk"ie:or"n[":[:’[[’’’,’[P,’’rR’o’e’m,go,i’toM’inu’’os,,nia’’cl’D_r]ade,tolwee’n,also’eaFddo’a,rtm’eaIsts’fl,oa’nrLd"a_AbRseeclL’oo]rn]d,gs’a],s You Love Me"’, [’’, ’’, ’Music_download’, ’Island_Records’], ...], "text": [[’Region’, ’Date’, ’Format’, ’Label’], [’United States’, ’June 11, 2012’, ’Digital Download’, ’Island Records’], [’Canada’, ’June 11, 2012’, ’Digital Download’, ’Island Records’], ...], "target_col": [ 2 ], "entity": [[’[’’,’,’’’’,,’’Q’6,47’3’5]6,4’, ’Q190585’],

[’’, ’’, ’Q6473564’, ’Q190585’], "types": [[.[[.[],.],][,[],],[[’Q]8,1[ 9 ]4]1,037’], [’Q18127’]], [[], [], [’Q81941037’], [’Q18127’]], ...], "col_types": [[], [], [[’Q81941037’, 0.8571428571428571]], [[’Q18127’, 0.2857142857142857]]], "col_type_perfect": [’’, ’’, ’Q81941037’, ’’]}}}

A re-run of the Python scripts, simply pointing a new allows to obtain a new version of the dataset.

4. Evaluation

Wikipedia XML dump file to process 16, The experiments in this Section aim to demonstrate how the use of MammoTab allows identifying the weaknesses of a STI approach, with particular reference to the key challenges reported in Section 2. The Mtab [ 12 ] approach was considered since it won several versions of the SemTab challenge. A sample of 5 000 tables was selected without NIL-mentions due to the limitations of the Mtab17 free API. Table 3 reports the results obtained by the approach.

The values obtained by MTab versus MammoTab are lower than the other datasets. A substantial decrease in the F1-Score in the CTA can also be noted. The results show that in the current version, the MammoTab tables are a valuable resource for testing STI approaches which must be characterised by sophisticated mechanisms that consider many semantics aspects. However, as done in other datasets [ 9 ], it is possible to add some noise (i.e., adding misspelt or fake mentions) to increase the complexity of the annotation task.

14en.wikipedia.org/wiki/Wikipedia:Red_link 15en.wikipedia.org/wiki/As_Long_as_You_Love_Me_( Justin_Bieber_song) 16dumps.wikimedia.org/backup-index.html 17mtab.app/mtab/docs

[1]

Neumaier ,

Umbrich ,

J. X.

Parreira ,

Polleres , Multi-level semantic labelling of numerical values , in: The Semantic Web - ISWC 2016 , Springer International Publishing, Cham, 2016 , pp. 428 - 445 .

[2]

Kejriwal ,

C. A.

Knoblock ,

Szekely , Knowledge graphs: Fundamentals, techniques, and applications , MIT Press, 2021 .

[3]

Cutrona ,

Chen ,

Efthymiou ,

Hassanzadeh ,

Jimenez-Ruiz ,

Sequeda ,

Srinivas ,

Abdelmageed ,

Hulsebos ,

Oliveira ,

Pesquita , Results of semtab 2021 , in: 20th International Semantic Web Conference , volume 3103 , CEUR Workshop Proceedings, 2022 , pp. 1 - 12 .

[4]

Zhang , E. Meij,

Balog ,

Reinanda , Novel entity discovery from web tables , in: Proceedings of The Web Conference 2020 , WWW '20, Association for Computing Machinery, New York, NY, USA, 2020 , p. 1298 - 1308 .

[5]

Ritze ,

Bizer , Matching web tables to dbpedia - a feature utility study , in: Proceedings of the 20th International Conference on Extending Database Technology, EDBT 2017 , Venice, Italy, March 21 -24, 2017 , OpenProceedings, Konstanz, 2017 , pp. 210 - 221 .

[6]

Limaye ,

Sarawagi ,

Chakrabarti , Annotating and searching web tables using entities, types and relationships , Proc. VLDB Endow . 3 ( 2010 ) 1338 - 1347 .

[7]

Zhang , Efective and eficient semantic table interpretation using tableminer+ , Semantic Web 8 ( 2017 ) 921 - 957 .

[8]

Taheriyan ,

C. A.

Knoblock ,

Szekely ,

J. L.

Ambite , Leveraging linked data to discover semantic relations within data sources , in: The Semantic Web - ISWC , 2016 , pp. 549 - 565 .

[9]

Cutrona ,

Bianchi ,

Jimenez-Ruiz ,

Palmonari , Tough tables: Carefully evaluating entity linking for tabular data , in: The Semantic Web - ISWC 2020, Lecture Notes in Computer Science , Springer International Publishing, 2020 , pp. 328 - 343 .

[10]

C. S.

Bhagavatula ,

Noraset ,

Downey , Tabel: Entity linking in web tables , in: M. Arenas , O.

Corcho , E. Simperl, M.

Strohmaier , M. d'Aquin, K.

Srinivas , P.

Groth , M.

Dumontier , J.

Heflin , K.

Thirunarayan , K.

Thirunarayan , S. Staab (Eds.), The Semantic Web - ISWC 2015 , Springer International Publishing, Cham, 2015 , pp. 425 - 441 .

[11]

Deng ,

Sun ,

Lees ,

Wu ,

Yu , Turl: Table understanding through representation learning , SIGMOD Rec . 51 ( 2022 ) 33 - 40 .

[12]

Nguyen , I. Yamada,

Kertkeidkachorn ,

Ichise ,

Takeda , Semtab 2021 : Tabular data annotation with mtab tool ., in: SemTab@ ISWC, 2021 , pp. 92 - 101 .