<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>way to generate an STI benchmark for your domain</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nora Abdelmageed</string-name>
          <email>nora.abdelmageed@uni-jena.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ernesto Jiménez-Ruiz</string-name>
          <email>ernesto.jimenez-ruiz@city.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oktie Hassanzadeh</string-name>
          <email>hassanzadeh@us.ibm.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Birgitta König-Ries</string-name>
          <email>birgitta.koenig-ries@uni-jena.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>IBM Research</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>City, University of London</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Heinz Nixdorf Chair for Distributed Information Systems, Friedrich Schiller University Jena</institution>
          ,
          <addr-line>Jena</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Oslo</institution>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Tabular data, often found in CSV files, is essential for data analytics workflows. Understanding this data in a semantic context, known as Semantic Table Interpretation (STI), is critical but challenging due to issues like label ambiguity. Consequently, STI has garnered significant attention in recent years. To evaluate STI systems efectively, robust benchmarks are needed. Most existing large-scale benchmarks originate from general domain sources and emphasize ambiguity, whereas domain-specific benchmarks tend to be smaller. This paper presents KG2Tables, a framework designed to create large-scale domainspecific benchmarks from a Knowledge Graph (KG). KG2Tables utilizes the internal hierarchy of relevant KG concepts and their properties. As a proof of concept, we have developed extensive datasets in the food, biodiversity, and biomedical domains. One of these datasets was used in the ISWC 2023 SemTab challenge, and the rest have been integrated into SemTab 2024.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>Semantic Table Interpretation (STI) has recently witnessed increasing attention from the
community [1]. The goal of this process is to map individual table components, e.g., columns and
cells, to entities and classes from a target Knowledge Graph (KG) such as Wikidata [2], or
CEUR
Workshop
Proceedings
(a) Horizontal Relational Table.
challenge. We believe this is due to domain-specific challenges that general-purpose systems
are ill-equipped to handle or require extensive tuning or training data.</p>
      <p>State-of-the-art STI tasks propose ways to annotate tabular data semantically and, thus,
facilitate a potential transformation into a KG. We summarize them as follows: 1) Cell Entity
Annotation (CEA) links a table cell value to a knowledge graph (KG) entity. 2) Column Type
Annotation (CTA) assigns a semantic type to an entire column. 3) Column Property Annotation
(CPA) connects a column pair (subject-object) to a semantic property from the KG. 4) Row
Annotation (RA) maps an entire row to a KG entity, difering from CEA as the subject column
might be missing. 5) Topic Detection (TD) classifies the table into a topic, such as a semantic
class. Figure 1 gives an overview of the five most common STI tasks in two table types Horizontal
Relation Tables in Figure 1 (a), and Entity Tables in Figure 1 (b). The former includes a set of
entities row-wise. The latter represents a single entity with a list of its properties. The solution
indicates that Wikidata is the target KG.</p>
      <p>In this paper, we introduce KG2Tables2, an STI benchmark generator that constructs both
horizontal relational tables and entity tables, given a list of domain-specific concepts from
Wikidata. This Zenodo dump3 shows the code we refer to in this paper and lists six benchmarks
we created using KG2Tables from three domains: food, biodiversity, and biomedical.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Methodology</title>
      <p>KG2Tables accepts a list of related domain concepts in a CSV file, and constructs a tree structure
for these concepts. In Wikidata, domain concepts form a graph structure since the KG allows
that. However, KG2Tables process each relevant concept only once. We construct the respective
2https://github.com/fusion-jena/KG2Tables
3https://zenodo.org/records/10285835
1
3
4</p>
      <sec id="sec-3-1">
        <title>Create Horizontal Tables</title>
      </sec>
      <sec id="sec-3-2">
        <title>Create Entity Tables 2 Children</title>
      </sec>
      <sec id="sec-3-3">
        <title>Refine Tables</title>
      </sec>
      <sec id="sec-3-4">
        <title>Format Benchmark gt targets tables</title>
        <p>tree structure using the internal hierarchy of the input concepts. e.g., in Wikidata, we have
included all instances and subclasses via wdt:P31, instance of and wdt:P279, subclass
of. We use the term “Children” to generalize related instances or subclasses. This tree structure
will be diferent in DBpedia. In that case, the internal hierarchy is determined via rdf:type
only. We applied a deduplication step since the overall instances and subclasses may overlap.
Such overlap may also occur across diferent levels of the tree.</p>
        <p>
          Figure 2 depicts the approach we developed to construct domain-specific benchmarks. It
starts with the children of domain concepts, i.e., based on the current level of the recursion, and
it consists of four steps: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) Create Horizontal Tables and (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) Create Raw Entity Tables:
we constructed both types of tables based on the properties of the current children; these
tables contain the solutions of all STI tasks. We apply several operations for these properties,
e.g., union, intersection, or random selection, to create diferent versions of the same table.
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) Refine Tables : we revised the collected data and applied several steps to construct the
ifnal tables, i.e., anonymizing column names. (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) Format Benchmark: we separated tables
from solutions and targets to create a complete set of STI tasks. Targets indicate what to solve
regarding column and row IDs, while solutions include the ground truth data of these targets.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Conclusions &amp; Remarks</title>
      <p>In this paper, we presented KG2Tables, a code generator that creates domain-specific tabular
data benchmarks for Semantic Table Interpretation (STI) tasks. It uses the internal hierarchy of
related concepts in a target Knowledge Graph (KG) to generate two types of tables: horizontal
and entity tables. KG2Tables addresses five common STI tasks and was tested in three domains:
Food, Biodiversity, and Biomedical. While our examples use Wikidata, KG2Tables is adaptable
to any KG through SPARQL query modifications. KG2Tables accepts and parses a list of given
domain concepts, thus validating and ensuring the domain specificity of the resulting dataset
using either a data-driven approach or a reuse of existing domain-specific classes.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wanders</surname>
          </string-name>
          ,
          <article-title>Repurposing and probabilistic integration of data, SIKS dissertation series</article-title>
          , Universiteit Twente,
          <year>2016</year>
          . Isbn:
          <fpage>978</fpage>
          -
          <lpage>90</lpage>
          -365-4110-7, number:
          <fpage>2016</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Vrandečić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krötzsch</surname>
          </string-name>
          ,
          <article-title>Wikidata: A free collaborative knowledgebase</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>57</volume>
          (
          <year>2014</year>
          )
          <fpage>78</fpage>
          -
          <lpage>85</lpage>
          . doi:
          <volume>10</volume>
          .1145/2629489.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , G. Kobilarov,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ives</surname>
          </string-name>
          ,
          <article-title>Dbpedia: A nucleus for a web of open data</article-title>
          ,
          <source>in: The Semantic Web: 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference</source>
          ,
          <string-name>
            <surname>ISWC</surname>
          </string-name>
          <year>2007</year>
          +
          <article-title>ASWC 2007, Busan</article-title>
          , Korea,
          <source>November 11-15</source>
          ,
          <year>2007</year>
          . Proceedings, Springer,
          <year>2007</year>
          , pp.
          <fpage>722</fpage>
          -
          <lpage>735</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Jiménez-Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hassanzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Efthymiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Srinivas</surname>
          </string-name>
          ,
          <year>Semtab 2019</year>
          :
          <article-title>Resources to benchmark tabular data to knowledge graph matching systems</article-title>
          ,
          <source>in: The Semantic Web - 17th International Conference, ESWC</source>
          <year>2020</year>
          , Heraklion, Crete, Greece, May 31-June 4,
          <year>2020</year>
          , Proceedings, Springer,
          <year>2020</year>
          , pp.
          <fpage>514</fpage>
          -
          <lpage>530</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Jiménez-Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hassanzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Efthymiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Srinivas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Cutrona</surname>
          </string-name>
          ,
          <article-title>Results of semtab 2020, in: Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2020) co-located with the 19th International Semantic Web Conference (ISWC</article-title>
          <year>2020</year>
          ),
          <article-title>Virtual conference (originally planned to</article-title>
          be in Athens, Greece),
          <source>November 5</source>
          ,
          <year>2020</year>
          , volume
          <volume>2775</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V.</given-names>
            <surname>Cutrona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Efthymiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hassanzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jiménez-Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sequeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Srinivas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Abdelmageed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hulsebos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Oliveira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pesquita</surname>
          </string-name>
          ,
          <source>Results of SemTab</source>
          <year>2021</year>
          ,
          <article-title>in: Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with the 20th International Semantic Web Conference (ISWC</article-title>
          <year>2021</year>
          ), Virtual conference,
          <source>October</source>
          <volume>27</volume>
          ,
          <year>2021</year>
          , volume
          <volume>3103</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Abdelmageed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Cutrona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Efthymiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hassanzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hulsebos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jiménez-Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sequeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Srinivas</surname>
          </string-name>
          ,
          <source>Results of SemTab</source>
          <year>2022</year>
          ,
          <article-title>in: Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with the 21st International Semantic Web Conference (ISWC</article-title>
          <year>2022</year>
          )., volume
          <volume>3320</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>O.</given-names>
            <surname>Hassanzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Abdelmageed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Cutrona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Efthymiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hulsebos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jiménez-Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khatiwada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Korini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kruit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sequeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Srinivas</surname>
          </string-name>
          ,
          <source>Results of SemTab</source>
          <year>2023</year>
          ,
          <article-title>in: Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with the 22nd International Semantic Web Conference (ISWC</article-title>
          <year>2023</year>
          )., volume
          <volume>3557</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>V.</given-names>
            <surname>Cutrona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bianchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jiménez-Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Palmonari</surname>
          </string-name>
          , Tough Tables:
          <article-title>Carefully Evaluating Entity Linking for Tabular Data</article-title>
          , in: 19th
          <source>International Semantic Web Conference (ISWC)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>328</fpage>
          -
          <lpage>343</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Abdelmageed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schindler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>König-Ries</surname>
          </string-name>
          ,
          <article-title>Biodivtab: A table annotation benchmark based on biodiversity research data, in: Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with the 20th International Semantic Web Conference (ISWC</article-title>
          <year>2021</year>
          ), Virtual conference,
          <source>October</source>
          <volume>27</volume>
          ,
          <year>2021</year>
          , volume
          <volume>3103</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>13</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>