Results of SemTab 2020?

          Ernesto Jiménez-Ruiz1,2 , Oktie Hassanzadeh3 , Vasilis Efthymiou4 ,
              Jiaoyan Chen5 , Kavitha Srinivas3 , and Vincenzo Cutrona6
      1
         City, University of London, UK. ernesto.jimenez-ruiz@city.ac.uk
                  2
                     SIRIUS, University of Oslo, Norway. ernestoj@uio.no
3
  IBM Research, USA. hassanzadeh@us.ibm.com, kavitha.srinivas@ibm.com
                      4
                        ICS-FORTH, Greece. vefthym@ics.forth.gr
                5
                    University of Oxford, UK. jiaoyan.chen@cs.ox.ac.uk
   6
     Università degli Studi di Milano - Bicocca, Italy. vincenzo.cutrona@unimib.it


       Abstract. SemTab 2020 was the second edition of the Semantic Web Chal-
       lenge on Tabular Data to Knowledge Graph Matching, successfully collocated
       with the 19th International Semantic Web Conference (ISWC) and the 15th On-
       tology Matching (OM) Workshop. SemTab provides a common framework to
       conduct a systematic evaluation of state-of-the-art systems.

       Keywords: Tabular data · Knowledge Graphs · Matching · Semantic Table In-
       terpretation


1   Motivation
Tabular data in the form of CSV files is the common input format in a data analytics
pipeline. However a lack of understanding of the semantic structure and meaning of the
content may hinder the data analytics process. Thus gaining this semantic understanding
will be very valuable for data integration, data cleaning, data mining, machine learning
and knowledge discovery tasks. For example, understanding what the data is can help
assess what sorts of transformation are appropriate on the data.1
    Tables on the Web may also be the source of highly valuable data. The addition of
semantic information to Web tables may enhance a wide range of applications, such as
web search, question answering, and knowledge base (KB) construction.
    Tabular data to Knowledge Graph (KG) matching is the process of assigning se-
mantic tags from Knowledge Graphs (e.g., Wikidata or DBpedia) to the elements of the
table. This task however is often difficult in practice due to metadata (e.g., table and
column names) being missing, incomplete or ambiguous.
    Tabular data to KG matching tasks typically include (i) cell to KG entity matching
(CEA task), (ii) column to KG class matching (CTA task), and (iii) column pair to KG
property matching (CPA task).
?
   Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons Li-
   cense Attribution 4.0 International (CC BY 4.0).
 1
   AIDA project: https://www.turing.ac.uk/research/research-projects/
   artificial-intelligence-data-analytics-aida
             Table 1: Statistics of the datasets in each SemTab 2020 round.
                          Automatically Generated (AG)             Tough Tables (2T)
                   Round 1 Round 2 Round 3 Round 4                     Round 4
     Tables #       34,295     12,173     62,614     22,207              180
     Avg. Rows #      7.3        6.9        6.3        21               1,080
     Avg. Cols #      4.9        4.6        3.6        3.5               4.5


     There existed several approaches that aim at addressing one or several of above tasks
and datasets with ground truths that can serve as benchmarks (e.g., [14, 13]). Despite
this significant amount of work, there was a lack of a common framework to conduct a
systematic evaluation of state-of-the-art systems. The creation of SemTab2 [11] aimed
at filling this gap and becoming the reference challenge in this community, in the same
way the OAEI3 is for the Ontology Matching community.4


2     The Challenge
The SemTab 2020 challenge started on May 26 and closed on October 27. The target
KG in this edition was Wikidata [18]:
    – Wikidata Dump (April 24, 2020): https://doi.org/10.5281/zenodo.
      4282941
    SemTab 2020 was organised into four evaluation rounds where we aimed at test-
ing different datasets with increasing difficulty. Rounds 1-3 were run with the support
of AIcrowd5 , which provided an automatic evaluation of the submitted solutions, and
relied on an automatic dataset generator [11]. Round 4 was a blind round (i.e., no eval-
uation of submissions via AIcrowd) combining: (i) an automatically generated (AG)
dataset as in previous rounds, and (ii) the Tough Tables (2T) dataset for CEA and
CTA [7]. Table 1 provides a summary of the statistics of the datasets used in each
round. Both the AG datasets and the 2T dataset are available in Zenodo [9, 6]:
    – AG datasets: https://doi.org/10.5281/zenodo.4282879
    – 2T dataset: https://doi.org/10.5281/zenodo.4246370
    Table 2 shows the participation per round. We had a total of 28 different sys-
tems6 participating across the four rounds; however only 11 systems have produced
results in 3 or more rounds, which we identify as the SemTab 2020 core partici-
pants:7 MTab4Wikidata [15], LinkingPark [3], DAGOBAH [10], bbw [16], JenTab [1],
ManstisTable SE [4], AMALGAM [8], SSL [12], LexMa [17], Kepler-aSI [2], and
TeamTR [19].
 2
   http://www.cs.ox.ac.uk/isg/challenges/sem-tab/
 3
   http://oaei.ontologymatching.org/
 4
   http://ontologymatching.org/
 5
   https://www.aicrowd.com/
 6
   At least 28 different AIcrowd submissions.
 7
   These participants also submitted a system paper to the challenge
Table 2: Participation in the SemTab 2020 challenge. Outliers (F-score < 0.3): † 3 sys-
tems, ‡ 8 systems, § 1 system.
                   Round 1     Round 2       Round 3      Round 4-AG       Round 4-2T
      Total 2019     17          11             9              8               -
      Total 2020     18          16            18             10               9
      CEA            10          10             9              9               9
      CTA            15          13†           16‡            9§               8
      CPA            9§          11             8              7               -


2.1    Evaluation measures

Systems were requested to submit a single solution for each of the provided targets in
each of the tasks. For example, a target cell in the CEA task is to be annotated with a
single entity in the target KG.
    The evaluation measures for CEA and CPA are the standard Precision, Recall and
F1-score, as defined in Equation 1:

          |Correct Annotations|      |Correct Annotations|        2×P ×R
   P =                          , R=                       , F1 =                        (1)
          |System Annotations|        |Target Annotations|         P +R

where target annotations refer to the target cells for CEA and the target column pairs for
CPA. An annotation is regarded to be correct, if it is contained in the ground truth. Note
that it is possible that one target cell or column pair may have multiple annotations in
the ground truth.
    For the evaluation of CTA, we used approximations of Precision and Recall, by
adapting their numerators to consider as partially correct annotations, columns anno-
tated with one of the ancestors or descendants of the ground truth (GT) classes. In that
sense, we define the correctness score cscore of a CTA annotation α as
                        
                             d(α)
                        0.8
                                 , if α is in GT, or an ancestor of the GT,
             cscore(α) = 0.7d(α) , if α is a descendant of the GT,                       (2)
                        
                          0,        otherwise;
                        

where d(α) is the shortest distance to one of the ground truth classes. For example,
d(α) = 0 if α is a class in the ground truth, and d(α) = 2 if α is a grandparent of a
class in the ground truth. In the former case, the scoring of α will be 1, while in the later
0.64. Then, our approximations of Precision (AP), Recall (AR), and F1-score (AF1) for
the evaluation of CTA are computed as follows:

              P                     P
            cscore(α)                  cscore(α)               2 × AP × AR
AP =                      , AR =                      , AF 1 =
     |System Annotations|        |Target Annotations|            AP + AR
                                                                         (3)
                       Table 3: Average F1-score for the Top-10 systems discarding outliers.
                                    Automatically Generated (AG)                                         Tough Tables (2T)
                              Round 1 Round 2 Round 3 Round 4                                                Round 4
                  CEA          0.93       0.95       0.94        0.92                                          0.54
                  CTA          0.83       0.93       0.94        0.92                                          0.59
                  CPA          0.93       0.97       0.93        0.96                                            -

                       MTab4Wikidata   SSL       bbw                                         MTab4Wikidata   SSL       bbw
                       LinkingPark     DAGOBAH   AMALGAM                                     LinkingPark     DAGOBAH   AMALGAM
                       MantisTable     JenTab    LexMa                                       MantisTable     JenTab    LexMa
           1.00                                            1.00              1.0                                                  1.0


           0.95                                            0.95              0.8                                                  0.8
F1-Score


                                                                  F1-Score
           0.90                                            0.90              0.6                                                  0.6


           0.85                                            0.85              0.4                                                  0.4


           0.80                                            0.80              0.2                                                  0.2
                  R1              R2        R3       R4                            Avg. AG                                   2T


                               (a) AG Datasets                                                  (b) Average AG vs 2T

Fig. 1: Results in the CEA task with the Automatically Generated (AG) and Tough
Tables (2T) Datasets.


2.2               Results
Table 3 shows the average F1-score for the top-10 systems after discarding outliers (i.e.,
systems or submissions with very low performance). We can observe that the complex-
ity brought by the Tough Tables dataset was significant with respect to the previous
rounds in terms of average results. Note that Round 4 was blind.
CEA task. Figure 1a shows the results for the CEA task over the AG datasets. The
dataset in each round aimed at bringing new challenges. This is somehow reflected in
MantisTable, LinkingPark, LexMa and SSL. Nevertheless, the overall results for the AG
datasets where very positive. In contrast, as shown in Figure 1b, the performance over
the 2T dataset is significantly reduced where only three systems (MTab4Wikidata, bbw
and Linkingpark) managed to maintain and F1-score over 0.8. It is worth mentioning
the performance of the LexMa system in the 2T dataset, where it ranked 4th, while it
was among the last ranked systems with the AG dataset.
CTA task. As shown in Figure 2, the results in the CTA tasks are relatively similar to the
CEA task, where the average performance against the AG datasets is very good. How-
ever, the F1-score with the 2T dataset is dramatically reduced for all the systems (see
Figure 2b). It is worth emphasising a general improvement from Round 1 to Round 2
(see Figure 2a).
CPA task. Figure 3 summarises the results for the CPA tasks over the AG datastes.
Note that, currently, the 2T dataset is only available for the CEA and CTA tasks. It
                       MTab4Wikidata                 SSL                bbw                                          MTab4Wikidata         SSL       bbw
                       LinkingPark                   DAGOBAH            AMALGAM                                      LinkingPark           DAGOBAH   AMALGAM
                       MantisTable                   JenTab                                                          MantisTable           JenTab
                                                                                                     1.0                                                       1.0
           1.00                                                                    1.00
                                                                                                     0.9                                                       0.9
           0.95                                                                    0.95
                                                                                                     0.8                                                       0.8
           0.90                                                                    0.90
F1-Score


                                                                                          F1-Score
                                                                                                     0.7                                                       0.7
           0.85                                                                    0.85
                                                                                                     0.6                                                       0.6

           0.80                                                                    0.80
                                                                                                     0.5                                                       0.5

           0.75                                                                    0.75              0.4                                                       0.4

           0.70                                                                    0.70              0.3                                                       0.3
                  R1              R2                          R3             R4                            Avg. AG                                       2T


                               (a) AG Datasets                                                                          (b) Average AG vs 2T

Fig. 2: Results in the CTA task with the Automatically Generated (AG) and Tough Ta-
bles (2T) Datasets.

                                                                   MTab4Wikidata          SSL                               bbw
                                                                   LinkingPark            DAGOBAH                           TeamTR
                                                                   MantisTable            JenTab

                                                  1.00                                                                                    1.00


                                                  0.95                                                                                    0.95
                                       F1-Score


                                                  0.90                                                                                    0.90


                                                  0.85                                                                                    0.85


                                                  0.80                                                                                    0.80
                                                         R1                   R2                           R3                        R4


           Fig. 3: Results in the CPA task with the Automatically Generated (AG) Datasets.


is worth mentioning the improvement of systems like JenTab and bbw in Round 4.
MTab4Wikidata, LinkingPark and DAGOBAH maintained a constant performance dur-
ing the last rounds.


2.3               Prizes

IBM Research8 sponsored the prizes for the best systems in the challenge. This spon-
sorship was important not only for the challenge awards, but also because it shows a
strong interest from industry.

    – 1st Prize: MTab4Wikidata was the top system in all tasks and the least impacted
      by the 2T dataset.
 8
       https://www.research.ibm.com/
    – 2nd Prize: LinkingPark had a very good and constant performance just below
      MTab4Wikidata.
    – 3rd Prize: DAGOBAH and bbw. DAGOBAH had overall very positive results,
      apart from the CEA task in the 2T dataset. On the other hand, bbw had an outstand-
      ing performance in the last CEA round.

3     Lessons Learned and Future Work
As in SemTab 2019 [11], the experience of SemTab 2020 has been successful and
has served to increase the community interest in the semantic enrichment of tabular
data. Both from the organization side and the participation side, we aim at preparing a
new edition of the SemTab challenge in 2021. Next, we summarize the ideas for the
future editions of the challenge we discussed during the International Semantic Web
Conference.
Data shifting. Several participants preferred the options of having a fixed target KG,
given as a data dump, instead of using the SPARQL endpoints and related services to
access the latest version of the KG. This concerns was specially important as Wikidata is
continuously updated. This, however, may be challenging as it may require to “locally”
store and process a large amount of data, and it may hinder the participation of systems
that rely on online lookup services (e.g., DBpedia Lookup), since the exposed KG may
differ from the fixed target KG.
Blind evaluation. Some participants proposed to have both a public and private leader-
board in AIcrowd to cover the rounds with a blind evaluation. It is unclear if AIcrowd
provides this service, but systems like STILTool [5] may support this functionality.
Systems as services. To improve reproducibility we are considering to request partici-
pants to submit their systems as services, following a pre-defined API, at least in one of
the SemTab rounds. This could be achieved by offering the systems as a (Web) service
or by submitting a docker image of the system.
The user as a metric. In addition to the standard evaluation measures, it was proposed
to include tasks that allow measuring the productivity from a user point of view. For
example, a user may be more interested in a system that is easy to setup and run, than
in a sophisticated system that provides better results but requires a non-trivial effort for
installation or execution.
Complexity of annotations. In future editions, SemTab should also consider datasets
that involve more complex annotations to reflect more realistic use cases.
Domain-specific datasets. It was also proposed to use in the future domain-specific
datasets and KGs (e.g., biomedical) as targets, in addition to cross-domain KGs like
DBpedia and Wikidata and related datasets.

Acknowledgements
We would like to thank the challenge participants, the ISWC & OM organisers, the
AIcrowd team, and our sponsors (SIRIUS and IBM Research) that played a key role
in the success of SemTab. Special mention require Federico Bianchi and Matteo Pal-
monari who contributed to the creation of the Tough Tables dataset. This work was also
supported by the AIDA project (Alan Turing Institute), the SIRIUS Centre for Scalable
Data Access (Research Council of Norway), Samsung Research UK, Siemens AG, and
the EPSRC projects AnaLOG, OASIS and UK FIRES.


References
 1. N. Abdelmageed and S. Schindler. JenTab: Matching Tabular Data to Knowledge Graphs. In
    Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab). CEUR-
    WS.org, 2020.
 2. W. Baazouzi, M. Kachroudi, and S. Faiz. Kepler-aSI : Kepler as a Semantic Interpreter. In
    Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab). CEUR-
    WS.org, 2020.
 3. S. Chen, A. Karaoglu, C. Negreanu, T. Ma, J.-G. Yao, J. Williams, A. Gordon, and C.-Y.
    Lin. LinkingPark: An integrated approach for Semantic Table Interpretation. In Semantic
    Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab). CEUR-WS.org,
    2020.
 4. M. Cremaschi, R. Avogadro, A. Barazzetti, and D. Chieregato. MantisTable SE: an Efficient
    Approach for the Semantic Table Interpretation. In Semantic Web Challenge on Tabular
    Data to Knowledge Graph Matching (SemTab). CEUR-WS.org, 2020.
 5. M. Cremaschi, A. Siano, R. Avogadro, E. Jiménez-Ruiz, and A. Maurino. STILTool: A
    Semantic Table Interpretation evaLuation Tool. In ESWC 2020 Satellite Events, pages 61–
    66, 2020.
 6. V. Cutrona, F. Bianchi, E. Jiménez-Ruiz, and M. Palmonari. Tough Tables: Carefully
    Benchmarking Semantic Table Annotators [Data set]. https://doi.org/10.5281/
    zenodo.3840646, 2020.
 7. V. Cutrona, F. Bianchi, E. Jiménez-Ruiz, and M. Palmonari. Tough tables: Carefully evalu-
    ating entity linking for tabular data. In 19th International Semantic Web Conference, pages
    328–343, 2020.
 8. G. Diallo and R. Azzi. AMALGAM: making tabular dataset explicit with knowledge graph.
    In Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab).
    CEUR-WS.org, 2020.
 9. O. Hassanzadeh, V. Efthymiou, J. Chen, E. Jiménez-Ruiz, and K. Srinivas. SemTab
    2020: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching Data Sets.
    https://doi.org/10.5281/zenodo.4282879, 2020.
10. V.-P. Huynh, J. Liu, Y. Chabot, T. Labbé, P. Monnin, , and R. Troncy. DAGOBAH: Enhanced
    Scoring Algorithms for Scalable Annotations of Tabular Data. In Semantic Web Challenge
    on Tabular Data to Knowledge Graph Matching (SemTab). CEUR-WS.org, 2020.
11. E. Jimenez-Ruiz, O. Hassanzadeh, V. Efthymiou, J. Chen, and K. Srinivas. SemTab 2019:
    Resources to Benchmark Tabular Data to Knowledge Graph Matching Systems. In The
    Semantic Web: ESWC 2020. Springer International Publishing, 2020.
12. D. Kim, H. Park, J. K. Lee, and W. Kim. Generating conceptual subgraph from tabular data
    for knowledge graph matching. In Semantic Web Challenge on Tabular Data to Knowledge
    Graph Matching (SemTab). CEUR-WS.org, 2020.
13. O. Lehmberg, D. Ritze, R. Meusel, and C. Bizer. A large public corpus of web tables con-
    taining time and context metadata. In WWW, 2016.
14. G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using
    entities, types and relationships. VLDB Endowment, 3(1-2):1338–1347, 2010.
15. P. Nguyen, I. Yamada, N. Kertkeidkachorn, R. Ichise, and H. Takeda. MTab4Wikidata at
    the SemTab 2020: Tabular Data Annotation with Wikidata. In Semantic Web Challenge on
    Tabular Data to Knowledge Graph Matching (SemTab). CEUR-WS.org, 2020.
16. R. Shigapov, P. Zumstein, J. Kamlah, L. Oberländer, J. Mechnich, and I. Schumm. bbw:
    Matching CSV to Wikidata via Meta-lookup. In Semantic Web Challenge on Tabular Data
    to Knowledge Graph Matching (SemTab). CEUR-WS.org, 2020.
17. S. Tyagi and E. Jiménez-Ruiz. LexMa: Tabular Data to Knowledge Graph Matching using
    Lexical Techniques. In Semantic Web Challenge on Tabular Data to Knowledge Graph
    Matching (SemTab). CEUR-WS.org, 2020.
18. D. Vrandecic and M. Krötzsch. Wikidata: a free collaborative knowledge base. Commun.
    ACM, 57(10):78–85, 2014.
19. S. Yumusak. Knowledge graph matching with inter-service information transfer. In Semantic
    Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab). CEUR-WS.org,
    2020.