=Paper=
{{Paper
|id=Vol-3320/paper3
|storemode=property
|title=MammoTab: A Giant and Comprehensive Dataset for Semantic Table Interpretation
|pdfUrl=https://ceur-ws.org/Vol-3320/paper3.pdf
|volume=Vol-3320
|authors=Mattia Marzocchi,Marco Cremaschi,Riccardo Pozzi,Roberto Avogadro,Matteo Palmonari
|dblpUrl=https://dblp.org/rec/conf/semweb/MarzocchiCPAP22
}}
==MammoTab: A Giant and Comprehensive Dataset for Semantic Table Interpretation==
MammoTab: a giant and comprehensive dataset for
Semantic Table Interpretation
Mattia Marzocchi1 , Marco Cremaschi1 , Riccardo Pozzi1 , Roberto Avogadro1 and
Matteo Palmonari1
1
University of Milan - Bicocca, viale Sarca 336, Edificio U14, 20126, Milan, Italy
Abstract
In this paper, we present MammoTab, a dataset composed of 1M Wikipedia tables extracted from over
20M Wikipedia pages and annotated through Wikidata. The lack of this kind of datasets in the state-
of-the-art makes MammoTab a good resource for testing and training Semantic Table Interpretation
approaches. The dataset has been designed to cover several key challenges, such as disambiguation,
homonymy, and NIL-mentions. The dataset has been evaluated using MTab, one of the best approaches
of the SemTab challenge.
Keywords
Semantic Table Interpretation, Tabular Data, SemTab Challenge, Knowledge Graph
1. Introduction
A vast amount of information is provided as structured data on the Web in tables, this quantity
grew over the years. This increase can be linked to the uptake of the Open Data movement,
whose purpose is to make a large number of tabular data sources freely available, addressing
a wide range of domains, such as finance, mobility, tourism, sports, or cultural heritage [1].
The massive availability of tabular data on the Web makes Web tables a valuable source to
consider for data miners. For instance, these tables can be employed for data integration tasks
or to construct and extend Knowledge Graphs (KGs) [2]. In this field, the table-to-KG matching
problem, also referred to as Semantic Table Interpretation (STI), is the process of adding the
semantic meaning of a table by mapping its elements (i.e., cells/mentions, columns, rows) to
semantic tags (i.e., entities, classes, properties) from KGs (e.g., Wikidata, DBpedia). This process
is typically broken down into the following tasks: (i) cell/mentions to Knowledge Graph (KG)
entity matching (CEA task), (ii) column to KG class matching (CTA task), and (iii) column pair
to KG property matching (CPA task) [3]. In the last decade, the table-to-KG has collected much
attention in the research community [3]. This interest is also certified by the introduction
of the “SemTab: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching”
SemTab: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching 2022
$ m.marzocchi@campus.unimib.it (M. Marzocchi); marco.cremaschi@unimib.it (M. Cremaschi);
riccardo.pozzi@unimib.it (R. Pozzi); roberto.avogadro@unimib.it (R. Avogadro); matteo.palmonari@unimib.it
(M. Palmonari)
0000-0003-0855-0245 (M. Marzocchi); 0000-0001-7840-6228 (M. Cremaschi); 0000-0002-4954-3837 (R. Pozzi);
0000-0001-8074-7793 (R. Avogadro); 0000-0002-1801-5118 (M. Palmonari)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
international challenge1 , now in its 4th version. The challenge consists of different rounds in
which groups of tables with different features and levels of difficulty have to be annotated. The
increased interest in the STI has led to the construction of several datasets (gold standards)
in the last decade. As it will be better described later, these datasets often include only a part
of the characteristics of Web tables (e.g., small tables with few mentions, easy to annotate
semantically [4]).
As a consequence, we created a new dataset, MammoTab, composed of 980 254 tables ex-
tracted from 21 149 260 Wikipedia pages and annotated through Wikidata. The number and the
different features of the tables make MammoTab a good resource for testing and/or training
STI approaches. In particular, because of its dimension MammoTab is a useful tool to train
data-hungry models, which require a vast amount of data (e.g., entity linking systems based on
large language models).
The rest of the paper is organised as follows. Section 2 will present a brief analysis of the
state-of-the-art related datasets used in the context of the STI. Subsequently, MammoTab is
described (Section 3), its characteristics are listed (Section 3.1), and the pipeline used for its
implementation is presented (Section 3.2). The evaluation is depicted in Section 4 through a
state-of-the-art STI approach.
2. Datasets
Although several approaches deal with semantic annotations on tabular data, there are limited
Gold Standards (GSs) for assessing the quality of these annotations. The main ones are T2Dv2,
Limaye, Musicbrainz, IMDB, Taheryan, Tough Table and SemTab. Table 1 shows statistics for
the GSs.
Table 1
Statistics for the most common datasets. ‘-’ indicates unknown.
GS Tables Columns Rows Classes Entities Predicates KG
T2Dv2 [5] 234 1 157 27 996 39 - 154 DBpedia
Limaye [6] 6 522 - - 747 142 737 90 Wikipedia Yago
LimayeAll [7] 6 310 28 547 135 978 - 227 046 - Freebase
Limaye200 [7] 200 903 4 144 615 - 361 Freebase
MusicBrainz [7] 1 406 9 842 - 9 842 93 266 7 030 Freebase
IMDB [7] 7 416 7 416 - 7 416 92 321 - Freebase
Taheriyan [8] 29 2 467 16 006 - - - Schema
Tough Table (2T) [9] 180 194 438 802 540 667 244 0 Wikidata DBpedia
R1 64 320 9 088 120 8 418 116
R2 11 924 59 620 298 100 14 780 463 796 6 762
SemTab2019 DBpedia
R3 2 161 10 805 153 431 5 752 406 827 7 575
R4 817 3 268 51 471 1 732 107 352 2 747
R1 34 294 170 068 249 329 135 773 985 109 135 773
R2 12 173 55 951 84 896 43 752 283 446 43 752
SemTab2020 Wikidata
R3 62 614 229 321 396 903 166 632 768 324 166 632
R4 22 387 79 552 670 335 32 461 1 662 164 56 475
R1 180 802 194438 539 667243 56475
R2 1750 5589 29280 2190 47439 3835 Wikidata
SemTab2021
R3 7207 17902 58949 7206 58948 10694
1
www.cs.ox.ac.uk/isg/challenges/sem-tab/
An excellent STI approach must consider and adequately balance the different features of
a table (or a set of tables). The annotation involves several key challenges: i) disambiguation:
the class of the entities described in a table are not known in advance, and those entities may
correspond to more than one class in the KG. ii) homonymy: this issue is related to the presence
of different entities with the same name and class. iii) matching: the mention in the table may
be syntactically different from the label of the entity in a KG (i.e., use of acronyms, aliases
and typos). iv) NIL-mentions: the approach must also consider strings that refer to entities
for which a representation has not yet been created within the KG, namely NIL-mentions. v)
literal and named-entity: in a table, there can be columns that contain named-entity mentions
(NE-column) and columns containing strings (L-column). vi) missing context: it is often easier
to extract the context from textual documents than from tables due to the amount of content
to be processed. For instance, the header (i.e., the first row of a table) which usually contains
descriptive attributes for the columns, may or may not be present. vii) amount of data: the
approach must consider large tables with many rows and columns, and tables with very few
mentions. viii) different domains: the tables within a set can belong to very general or specific
domains. MammoTab has been designed to cover all these cases, making it a resource for
evaluating or training STI approaches.
The dataset is made up of tables automatically extracted from Wikipedia. Some pipelines
for extracting tables from Wikipedia have been presented in the state-of-the-art. Among these,
TabEL [10] proposes a STI approach and a dataset (WikiTable corpus) composed of 1.6M tables
from the November 2013 XML dump of English Wikipedia. The dataset focuses only on the
CEA task using YAGO as reference KG. It should be noted that the WikiTable is outdated and
the code has not been made available2 . Another paper [11] proposes a dataset composed of
670 171 tables extracted from WikiTable corpus to which the Wikipedia page title, section title,
and table caption have been added to obtain a more comprehensive description. However, the
dataset has not been released, nor has the source code for its generation.
3. The MammoTab Dataset
The annotations inside MammoTab are based on Wikidata v. 20220520 and are provided
following the structure used by the SemTab challenge. One table is stored in one CSV file,
and each line corresponds to a table row. The target columns for annotation, CTA, and CEA
annotations are saved in CSV files. A JSON document has also been created for each Wikipedia
page with additional information about the tables extracted by that page (see Listing 1). We
released MammoTab in Zenodo3 following the FAIR Guiding Principles4 . The dataset is released
under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) licence5 . MammoTab contains
a variety of tables that allow to evaluate approaches considering the challenges previously
listed: in the dataset there are tables containing i) entities hard to disambiguate (e.g., table id
LBJJ1WGD - reactor Clinton)6 , ii) cases of homonymy (e.g., table id MRBWAAOA - soccer player
2
websail-fe.cs.northwestern.edu/TabEL/
3
zenodo.org/deposit/7014472
4
www.nature.com/articles/sdata201618
5
creativecommons.org/licenses/by-sa/4.0/
6
en.wikipedia.org/wiki/List_of_cancelled_nuclear_reactors_in_the_United_States
Michael Jordan)7 , iii) aliases (table id JCND1XGG - Tom Riddle alias of Lord Voldemort)8 , and
iv) NIL-mentions (e.g., table id MSTBGKPR - KKOP-LP Wildcat Broad. Inc)9 .
3.1. Dataset Profile
The MammoTab tables were extracted from 21 149 260 Wikipedia pages using the XML dump10 .
In these pages 2 803 424 tables were detected. Among these, the tables with at least three links
in the same column have been stored for a total amount of 980 254 tables. Some dataset statistics
are reported in Table 2.
Table 2
Overall statistics of the MammoTab dataset: the total number of Tables, Columns, and Rows; the
minimum, maximum, and the average number of columns and rows per table. Linked Cells refers to
the number of cells linked to an entity, Typed Cols refers to the number of columns with a known class
associated, and NILs refers to the number of NIL-mentions.
Columns Rows
Tables Linked Cells Typed Cols NILs
total min max avg total min max avg
980 254 5 638 191 1 500+ 5.75 23 376 498 3 14 436 23.85 28 446 720 2 001 902 4 686 457
3.2. Implementation
The dataset was built through a pipeline implemented as a set of Python scripts, which are
available in a Git repository11 .
The pipeline consists of 10 steps: i) dump processing: each Wikipedia XML dump file (we
used the multiple bz2 stream dumps for easier parallelisation) is parsed using BeautifulSoup12
to extract each page; ii) tables identification: pages are scanned to find those that contain at least
one table (wikitext syntax: | class=wikitable |); iii) tables normalisation and cleaning: each cell
is normalised and cleaned up using custom and wikitextparser13 functions. Elements such as
subscripts, superscripts, elements of the wikitext syntax, images, Wikipedia help and project
pages links, and links to external pages are removed; iv) tables analysis: each table is analysed
to check for cells that contain links to Wikipedia pages (wikitext syntax: [[link]]). A cell
is considered as a mention of an entity only if the entire cell is a link. The remaining cells,
containing multiple links or additional words around the link, may also be mentions of entities,
but we consider them uncertain and mark them as UNKNOWN; v) tables storing: tables that
have at least three fully linked cells in a column are stored; vi) table header and table caption
detection: table header (wikitext syntax: !header) and table caption, if any, are stored and added
to the current table; vii) column analysis: each column is analysed and classified into Literal
columns (L-column) for datatype values (e.g., strings, numbers, dates, such as 4808, 10/04/1983),
7
en.wikipedia.org/wiki/USA_Today_All-USA_high_school_football_team
8
en.wikipedia.org/wiki/Christian_Coulson
9
en.wikipedia.org/wiki/List_of_radio_stations_in_Nebraska
10
dumps.wikimedia.org/enwiki/20220720/
11
bitbucket.org/disco_unimib/mammotab/
12
www.crummy.com/software/BeautifulSoup/
13
github.com/5j9/wikitextparser
or Named-Entity columns (NE-column) if it contains links to Wikipedia pages; viii) entity linking
- CEA: for each Wikipedia link, the related Wikidata entity is extracted; ix) column annotation -
CTA: column types are set by choosing the most specific entity class (according to Wikidata
subclass relationships) that is shared by most of the column rows. For columns with less than 5
rows, all cells must be instances of that class, while, for bigger columns, at least 60% of the cells
are instances of that class; x) NIL-identification: we mark as NIL the cells containing Wikipedia
red links14 , which are those links referring to a page that does not exist.
The Listing 1 shows an example of a JSON document used to manage the results of the process
described above on the Wikipedia page about “As Long as You Love Me (Justin Bieber song)”15 .
Listing 1: JSON document with the information relating to a Wikipedia page that contains at
least one table.
1 {"wiki_id": ’36115735’,
"title": ’As Long as You Love Me (Justin Bieber song)’,
3 "tables": {
"XXI7BFMW": {
5 "caption": ’Promotional release dates for "As Long as You Love Me"’,
"header": [[’Region’, ’Date’, ’Format’, ’Label’]],
7 "link": [[’’, ’’, ’’, ’’],
[’’, ’’, ’Music_download’, ’Island_Records’],
9 [’’, ’’, ’Music_download’, ’Island_Records’],
...],
11 "text": [[’Region’, ’Date’, ’Format’, ’Label’],
[’United States’, ’June 11, 2012’, ’Digital Download’, ’Island Records’],
13 [’Canada’, ’June 11, 2012’, ’Digital Download’, ’Island Records’],
...],
15 "target_col": [2],
"entity": [[’’, ’’, ’’, ’’],
17 [’’, ’’, ’Q6473564’, ’Q190585’],
[’’, ’’, ’Q6473564’, ’Q190585’],
19 ...],
"types": [[[], [], [], []],
21 [[], [], [’Q81941037’], [’Q18127’]],
[[], [], [’Q81941037’], [’Q18127’]],
23 ...],
"col_types": [[], [], [[’Q81941037’, 0.8571428571428571]], [[’Q18127’, 0.2857142857142857]]],
25 "col_type_perfect": [’’, ’’, ’Q81941037’, ’’]}}}
A re-run of the Python scripts, simply pointing a new Wikipedia XML dump file to process16 ,
allows to obtain a new version of the dataset.
4. Evaluation
The experiments in this Section aim to demonstrate how the use of MammoTab allows identify-
ing the weaknesses of a STI approach, with particular reference to the key challenges reported in
Section 2. The Mtab [12] approach was considered since it won several versions of the SemTab
challenge. A sample of 5 000 tables was selected without NIL-mentions due to the limitations of
the Mtab17 free API. Table 3 reports the results obtained by the approach.
The values obtained by MTab versus MammoTab are lower than the other datasets. A
substantial decrease in the F1-Score in the CTA can also be noted. The results show that in the
current version, the MammoTab tables are a valuable resource for testing STI approaches which
must be characterised by sophisticated mechanisms that consider many semantics aspects.
However, as done in other datasets [9], it is possible to add some noise (i.e., adding misspelt or
fake mentions) to increase the complexity of the annotation task.
14
en.wikipedia.org/wiki/Wikipedia:Red_link
15
en.wikipedia.org/wiki/As_Long_as_You_Love_Me_(Justin_Bieber_song)
16
dumps.wikimedia.org/backup-index.html
17
mtab.app/mtab/docs
Table 3
Results (F1-Score) obtained by the MTab approach on different datasets.
Mtab [12] on CEA CTA CPA
SemTab2019 R4 0.983 - 0.832
SemTab2020 R4 0.907 0.993 0.997
SemTab2020 2T 0.907 0.728 -
SemTab2021 R3 0.968 0.984 0.993
MammoTab 0.853 0.659 -
References
[1] S. Neumaier, J. Umbrich, J. X. Parreira, A. Polleres, Multi-level semantic labelling of
numerical values, in: The Semantic Web – ISWC 2016, Springer International Publishing,
Cham, 2016, pp. 428–445.
[2] M. Kejriwal, C. A. Knoblock, P. Szekely, Knowledge graphs: Fundamentals, techniques,
and applications, MIT Press, 2021.
[3] V. Cutrona, J. Chen, V. Efthymiou, O. Hassanzadeh, E. Jimenez-Ruiz, J. Sequeda, K. Srinivas,
N. Abdelmageed, M. Hulsebos, D. Oliveira, C. Pesquita, Results of semtab 2021, in: 20th
International Semantic Web Conference, volume 3103, CEUR Workshop Proceedings, 2022,
pp. 1–12.
[4] S. Zhang, E. Meij, K. Balog, R. Reinanda, Novel entity discovery from web tables, in: Pro-
ceedings of The Web Conference 2020, WWW ’20, Association for Computing Machinery,
New York, NY, USA, 2020, p. 1298–1308.
[5] D. Ritze, C. Bizer, Matching web tables to dbpedia - a feature utility study, in: Proceedings
of the 20th International Conference on Extending Database Technology, EDBT 2017,
Venice, Italy, March 21-24, 2017, OpenProceedings, Konstanz, 2017, pp. 210–221.
[6] G. Limaye, S. Sarawagi, S. Chakrabarti, Annotating and searching web tables using entities,
types and relationships, Proc. VLDB Endow. 3 (2010) 1338–1347.
[7] Z. Zhang, Effective and efficient semantic table interpretation using tableminer+, Semantic
Web 8 (2017) 921–957.
[8] M. Taheriyan, C. A. Knoblock, P. Szekely, J. L. Ambite, Leveraging linked data to discover
semantic relations within data sources, in: The Semantic Web – ISWC, 2016, pp. 549–565.
[9] V. Cutrona, F. Bianchi, E. Jimenez-Ruiz, M. Palmonari, Tough tables: Carefully evaluating
entity linking for tabular data, in: The Semantic Web - ISWC 2020, Lecture Notes in
Computer Science, Springer International Publishing, 2020, pp. 328–343.
[10] C. S. Bhagavatula, T. Noraset, D. Downey, Tabel: Entity linking in web tables, in: M. Arenas,
O. Corcho, E. Simperl, M. Strohmaier, M. d’Aquin, K. Srinivas, P. Groth, M. Dumontier,
J. Heflin, K. Thirunarayan, K. Thirunarayan, S. Staab (Eds.), The Semantic Web - ISWC
2015, Springer International Publishing, Cham, 2015, pp. 425–441.
[11] X. Deng, H. Sun, A. Lees, Y. Wu, C. Yu, Turl: Table understanding through representation
learning, SIGMOD Rec. 51 (2022) 33–40.
[12] P. Nguyen, I. Yamada, N. Kertkeidkachorn, R. Ichise, H. Takeda, Semtab 2021: Tabular data
annotation with mtab tool., in: SemTab@ ISWC, 2021, pp. 92–101.