SOTAB: The WDC Schema.org Table Annotation
Benchmark
Keti Korini1,∗ , Ralph Peeters1 and Christian Bizer1
1
    Data and Web Science Group, University of Mannheim, Mannheim, Germany


                                         Abstract
                                         Understanding the semantics of table elements is a prerequisite for many data integration and data
                                         discovery tasks. Table annotation is the task of labeling table elements with terms from a given vocabulary.
                                         This paper presents the WDC Schema.org Table Annotation Benchmark (SOTAB) for comparing the
                                         performance of table annotation systems. SOTAB covers the column type annotation (CTA) and columns
                                         property annotation (CPA) tasks. SOTAB provides ∼50,000 annotated tables for each of the tasks
                                         containing Schema.org data from different websites. The tables cover 17 different types of entities
                                         such as movie, event, local business, recipe, job posting, or product. The tables stem from the WDC
                                         Schema.org Table Corpus which was created by extracting Schema.org annotations from the Common
                                         Crawl. Consequently, the labels used for annotating columns in SOTAB are part of the Schema.org
                                         vocabulary. The benchmark covers 91 types for CTA and 176 properties for CPA distributed across
                                         textual, numerical and date/time columns. The tables are split into fixed training, validation and test
                                         sets. The test sets are further divided into subsets focusing on specific challenges, such as columns with
                                         missing values or different value formats, in order to allow a more fine-grained comparison of annotation
                                         systems. The evaluation of SOTAB using Doduo and TURL shows that the benchmark is difficult to
                                         solve for current state-of-the-art systems.

                                         Keywords
                                         Table Annotation, Column Type Annotation, Column Property Annotation, Schema.org


1. Introduction
Tables containing structured data are widely used on the Web. Understanding the semantics of
tables is useful for a variety of data integration and data discovery tasks such as knowledge
base augmentation [1] or dataset search [2]. Table annotation is the task of annotating a table
with terms from given vocabulary, knowledge graph, or database schema. Table annotation
includes tasks such as Column Type Annotation (CTA) and Column Property Annotation (CPA).
CTA is the annotation of table columns with the type of the entities contained in a column.
CPA refers to the annotation of pairs of table columns with labels that indicate the relationship
between the main column of the table and another column. Figure 1 shows an example of a
table describing hotels with CTA labels shown above the table and CPA labels below.


SemTab@ISWC 2022, October 23–27, 2022, Hangzhou, China (Virtual)
∗
    Corresponding author.
Envelope-Open kkorini@uni-mannheim.de (K. Korini); ralph.peeters@uni-mannheim.de (R. Peeters);
christian.bizer@uni-mannheim.de (C. Bizer)
Orcid 0000-0002-2158-0070 (K. Korini); 0000-0003-3174-2616 (R. Peeters); 0000-0003-2367-0237 (C. Bizer)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Figure 1: An example table from SOTAB showcasing CTA and CPA labels


   This paper presents the WDC Schema.org Table Annotation Benchmark (SOTAB) for comparing
the performance of table annotation systems on the CTA and CPA tasks. The CTA dataset
consists of 59,548 tables covering 17 Schema.org [3] types of which 162,351 columns have been
annotated using 91 Schema.org types and properties. The CPA dataset consists of 48,379 tables
where 174,998 column pairs are annotated using 176 Schema.org properties. The tables used
in the benchmark originate from the WDC Schema.org Table Corpus which was created by
extracting Schema.org annotations from the December 2020 version of the Common Crawl.
Each table in the corpus contains all entities of a specific Schema.org type that are provided
by a specific host, for example all movies annotated on imdb.com. The columns of the table
are the attributes that are used by the host for describing the entities. Overall, the SOTAB
tables contain data gathered from 74,215 different hosts which makes the benchmark data quite
heterogeneous.


2. Related Work
This section provides an overview of the existing benchmarks for evaluating table annotation
systems and compares them to SOTAB in Table 1. The ToughTables, HardTables, BioDiv and
GitTables datasets are used by the Semantic Web Challenge on Tabular Data to Knowledge Graph
Matching (SemTab) which is a benchmark competition for table annotation systems that takes
place every year as part of the International Semantic Web Conference [4]. The ToughTables
(2T) [5] dataset’s tables are divided into easily solvable tables and tough tables harder to predict
and are annotated for CTA. The Hard Tables [6] dataset was generated querying DBpedia using
SPARQL queries to create tables that resemble tables found in the Web. The GitTables [7] corpus
is made of 1M tables from GitHub annotated with DBpedia and Schema.org [3] classes and
properties. A subset of this corpus was annotated for CTA. The BioDiv [8] dataset’s tables
belong to the biodiversity domain and contain numerical data and abbreviations in their column
values. Further datasets for benchmarking table annotation systems include: T2Dv2 [9] which
consists of Web tables annotated with DBpedia properties; WikiTables-TURL [10] which consists
of Wikipedia tables annotated using terms from Freebase [11]; and RedTab [12] which offers
tables belonging to the music and literature domain and being manually annotated for CPA.

Table 1
Overview of existing CTA and CPA benchmarks. The Labels column reports the number of unique labels
used to annotate table columns. The KG/VOC column names the vocabulary or knowledge graph used
for annotation: DBpedia (DBP), WikiData (WD), Schema.org (SCH) or Freebase (FB).
                                     Median                          CTA                  CPA
     Benchmark           Tables    Rows Cols       KG/VOC       Columns Labels       Columns Labels
     T2Dv2 [9]            779        4      18       DBP            -           -      670     119
     Hard Tables [6]     8,957       7       2       WD           9,398      2,235   14,531    472
     GitTables [7]       1,101      25      11     SCH/DBP      721/2,533   59/122      -       -
     2T [5]               180       89       4     DBP/WD          540      39/276      -       -
     BioDiv [8]            50       99      17       WD            614         92       -       -
     WikiTable [10]     580,171      8       5        FB         654,670      255    67,201    121
     REDTab [12]         9,149       5      18         -            -           -    22,236    23
     SOTAB (ours)       107,927     42       8       SCH         162,351       91    174,998   176


3. Creation of the SOTAB Benchmark
This section describes the selection of the SOTAB tables from the WDC Schema.org Table
Corpus, the assignment of lables to the tables, as well as well the selection of challenging
columns into specific subset of the test set.
   The WDC Schema.org Table Corpus was created using Schema.org data that was extracted
from the December 2020 version of the Web Data Commons Microdata and JSON-LD corpus.
The Schema.org Table Corpus consists of 4.2 million relational tables covering 43 Schema.org
types. Each table in the corpus contains the descriptions of all entities of a specific Schema.org
type that were extracted for a specific host, e.g. all movie records from imdb.com or product
records from ebay.com. All extracted entities of one Schema.org type are collected per host and
subsequently passed through a pipeline of processing and cleansing steps. For more details
about the Schema.org Table Corpus we refer the reader to the project website1 .
   Table Selection. We begin building SOTAB with a language identification phase on the
tables from the Schema.org Table Corpus. The language identification phase aims to filter out
non-English rows. For this purpose, we use the fastText language identification model [13, 14]
and keep rows from tables where the model is at least 50% confident that the language is English.
Finally, we remove all tables with less than 10 remaining rows and less than 3 columns.
   Label Generation. As the WDC Schema.org Table Corpus uses Schema.org properties as
column headers, these properties can be directly used as labels for the CPA task. We derive
the CTA label for a column from its CPA label using the Schema.org vocabulary definition
which specifies the types that are allowed as property values. In cases where the vocabulary
definition allows multiple types, a manual selection of the most appropriate type is done. Lastly,

1
    http://webdatacommons.org/structureddata/schemaorgtables/
we add some CTA annotations that are not included in the Schema.org vocabulary such as
IdentifierAT, MusicArtistAT and Museum/name. This is done with the purpose of including more
fine-grained labels instead of for example simply name, so that an annotation system needs to
better understand the semantics to select the correct annotation. After assigning the labels,
another filtering step is performed based on label frequency: We only keep the columns that
have CPA and CTA labels that are used at least 50 times.
   Test Sets for Specific Challenges. In addition to the full test sets for both tasks, we provide
subsets of the test sets that measure how good the systems can handle specific annotation
challenges. We provide test sets for the following challenges: (i) Missing Values: This set
contains columns having a value density between 10 and 70 percent and are thus harder to
predict, (ii) Format Heterogeneity: columns whose values are represented using different value
formats such as date or weight columns and (iii) Corner Cases: columns that are difficult to
annotate as their values are very similar to the values of other columns. Examples of corner
cases include startDate versus endDate and currenciesAccepted versus priceCurrency.

Table 2
Statistics of the SOTAB Datasets
                           Training   Training                  Test     Test   Test    Test    Test
                                                  Validation
                            Large      Small                   (Full)   (MV)    (FH)    (CC)    (RC)
              Tables        46,790      11,517         5,732   7,026    1,369    475    1,754   4,563
       CTA
              Columns      130,471      33,004        16,840   15,040   2,808    619    3,015   8,598
              Tables        37,128      9,435          4,771   6,480    1,479   1,101   2,129   5,024
       CPA
              Columns      134,425      33,643        17,417   23,156   4,032   1,593   3,492   14,039


4. Profiling the SOTAB Benchmark
The selection and labeling phase results in 46,790 tables annotated for the CTA task and 48,379
tables annotated for the CPA task which cover 17 Schema.org classes such as Person, Event
or JobPosting. The selected table columns include three data types: textual values, numerical
values and DateTime values. All CPA tables include a main (subject) column in the first column
position used in pairs with other columns for the CPA task. 91 CTA labels are used to annotate
162,351 columns in the CTA tables such as Organization, Date and Offer, and 176 CPA labels are
used to annotate 174,998 main column/column pairs in the CPA tables like price, datePublished
and productID. Detailed statistics about the number of tables per Schema.org class as well as
number of columns per label are provided on the SOTAB project page2 . We split the CTA
and CPA tables with a ratio of 80:10:10 into fixed training, validation and test sets by using
multi-label stratification to include examples of all labels in every set. We further split the
training set into a smaller subset using the same method and provide the Small training set
with the goal of comparing how methods perform when trained on less examples. Furthermore,
we provide subsets of columns in the test set for specific annotation challenges. These subsets
are created by grouping the columns in the test set by the challenges mentioned in Section 3:
2
    http://webdatacommons.org/structureddata/sotab/
test columns that have missing values (Test MV ), test columns that are corner cases (Test CC),
test columns that include values in different formats (Test FH) and test columns that are chosen
randomly (Test RC). Statistics of all splits are given in Table 2.


5. Benchmark Evaluation
We evaluate the difficulty of SOTAB using three supervised methods. The first is a simple
Random Forest classifier using TF-IDF-weighted words as features. The second is TURL [10],
which is pre-trained on a large corpus of Wikipedia tables using a combination of the Masked
Language Model objective of BERT and a Masked Entity Recovery objective introduced by the
authors. The last is Doduo [15] which uses multi-task learning for the simultaneous fine-tuning
of BERT for the CTA and CPA tasks. We fine-tune TURL for 50 epochs and Doduo for 30 epochs
using a learning rate of 5e-5. We report the micro-F1 score on the test sets for CTA and CPA in
Table 3. In both CTA and CPA tasks, there is a significant difference in the F1 scores among the
methods, with Transformer methods achieving at least 17 percentage points more than Random
Forest. This can be an indicator that incorporating table context helps the model make better
decisions for both tasks. The corner cases columns (CC) and columns with missing values (MV)
appear to be the subsets where both methods in both tasks make more prediction errors. Finally,
training with the smaller set leads to 3-8 percentage points lower scores on both tasks. The F1
scores show that SOTAB is challenging for all methods.

Table 3
SOTAB results for Random Forest (RF), TURL and Doduo (DO) using the large and small training sets
                               CTA                                               CPA
                   Large                   Small                     Large                   Small
            RF      Turl    DO       RF     Turl     DO       RF      Turl   DO        RF     Turl   DO
    Full   58.58   78.96   84.82   55.57   72.16     76.27   44.80   72.93   79.96   41.44   66.30   75.38
    MV     60.47   73.14   83.28   58.04   66.98     69.55   44.59   66.24   78.27   42.06   59.97   74.34
    CC     55.15   73.59   78.03   51.97   68.19     74.00   38.00   62.54   71.24   35.39   57.87   66.73
    FH     64.78   90.14   92.98   62.35   87.88     85.95   45.69   77.15   83.50   43.69   69.86   77.38
    RC     58.72   81.93   87.12   55.53   74.11     81.82   46.46   76.96   82.25   42.51   69.81   77.64


6. Conclusion and Availability
This paper introduced the WDC SOTAB benchmark. The aim of SOTAB is to complement
the set of publicly available table annotation benchmarks with a CTA and CPA benchmark
covering various entity types of general interest, e.g. products, local business, job postings,
and to provide training data from many independent data sources for these types in order to
reflect the full heterogeneity of the values that are used to describe entities. The WDC SOTAB
benchmark is available for public download on the project page. The code that was used for the
creation of the benchmark is provided on github3 .
3
    https://github.com/wbsg-uni-mannheim/wdc-sotab
References
 [1] D. Ritze, O. Lehmberg, Y. Oulabi, C. Bizer, Profiling the Potential of Web Tables for
     Augmenting Cross-domain Knowledge Bases, in: Proceedings of the 25th International
     Conference on World Wide Web, 2016, pp. 251–261.
 [2] A. Chapman, E. Simperl, L. Koesten, G. Konstantinidis, L.-D. Ibáñez, et al., Dataset search:
     A survey, The VLDB Journal 29 (2020) 251–272.
 [3] R. V. Guha, D. Brickley, S. Macbeth, Schema.org: Evolution of structured data on the web,
     Communications of the ACM 59 (2016) 44–51.
 [4] V. Cutrona, J. Chen, V. Efthymiou, O. Hassanzadeh, E. Jimenez-Ruiz, et al., Results of
     SemTab 2021, in: Proceedings of the Semantic Web Challenge on Tabular Data to Knowl-
     edge Graph Matching, volume 3103, CEUR-WS, 2022, pp. 1–12.
 [5] V. Cutrona, F. Bianchi, E. Jiménez-Ruiz, M. Palmonari, Tough Tables: Carefully Evaluating
     Entity Linking for Tabular Data, in: Proceedings of the 19th International Semantic Web
     Conference, 2020, pp. 328–343.
 [6] E. Jiménez-Ruiz, O. Hassanzadeh, V. Efthymiou, J. Chen, K. Srinivas, SemTab 2019: Re-
     sources to Benchmark Tabular Data to Knowledge Graph Matching Systems, in: The
     Semantic Web, Springer International Publishing, 2020, pp. 514–530.
 [7] M. Hulsebos, Ç. Demiralp, P. Groth, GitTables: A Large-Scale Corpus of Relational Tables,
     arXiv:2106.07258 (2022).
 [8] N. Abdelmageed, S. Schindler, B. König-Ries, BiodivTab: A Table Annotation Benchmark
     based on Biodiversity Research Data, in: Proceedings of the Semantic Web Challenge on
     Tabular Data to Knowledge Graph Matching, volume 3103, CEUR-WS, 2021, pp. 13–18.
 [9] D. Ritze, C. Bizer, Matching Web Tables To DBpedia - A Feature Utility Study, in:
     Proceedings of the 20th International Conference on Extending Database Technology,
     2017, pp. 210–221.
[10] X. Deng, H. Sun, A. Lees, Y. Wu, C. Yu, TURL: Table understanding through representation
     learning, Proceedings of the VLDB Endowment 14 (2020) 307–319.
[11] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, J. Taylor, Freebase: a collaboratively created
     graph database for structuring human knowledge, in: Proceedings of the ACM SIGMOD
     International Conference on Management of Data, 2008, pp. 1247–1250.
[12] S. Singh, A. F. Aji, G. Singh, C. Christodoulopoulos, A relation extraction dataset for knowl-
     edge extraction from web tables, in: Proceedings of the 29th International Conference on
     Computational Linguistics, 2022, pp. 2319–2327.
[13] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of tricks for efficient text classification,
     in: Proceedings of the 15th Conference of the European Chapter of the Association for
     Computational Linguistics: Volume 2, 2017, pp. 427–431.
[14] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, et al., Fasttext.zip: Compressing
     text classification models, arXiv preprint arXiv:1612.03651 (2016).
[15] Y. Suhara, J. Li, Y. Li, D. Zhang, Ç. Demiralp, et al., Annotating columns with pre-trained
     language models, in: Proceedings of the 2022 International Conference on Management
     of Data, 2022, pp. 1493–1503.