<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Binh Vu</string-name>
          <email>binhvu@isi.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Craig A. Knoblock</string-name>
          <email>knoblock@isi.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fandel Lin</string-name>
          <email>fandel.lin@usc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>SemTab 2024, Semantic Description, Semantic Table Interpretation</institution>
          ,
          <addr-line>Knowledge Graphs, Semantic Web, Data</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>USC Information Sciences Institute</institution>
          ,
          <addr-line>Marina del Rey, CA 90292</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <fpage>11</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>There is an enormous number of tables available on the Web. However, it is dificult to automatically use the tables in data analytic pipelines because of the lack of semantic understanding of their structure and meaning. To address this problem, our approach, GRAMS+, automatically creates semantic descriptions of tables using distant supervision. SemTab is an annual challenge that provides a diverse set of benchmarks for systems that match tabular data with knowledge graphs. In this paper, we present the results of GRAMS+ at SemTab 2024 in the Accuracy Track. The results show that GRAMS+ is scalable and achieves competitive performance in the tasks in which we participated.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Integration</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Matching tabular data to an ontology or a knowledge graph is an essential problem in Data Integration.
The task is to annotate types of columns in the tables using classes of the target ontology and relations
between columns using the ontology properties. We developed a novel approach, GRAMS+ [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
addressing this problem using distant supervision. The approach leverages the fact that some data in a
table will often overlap with data in a knowledge graph (KG), which can be used to discover candidate
types and relationships in the table. Then, the approach uses two neural networks (NN) trained with a
labeled dataset generated automatically from Wikipedia tables to predict the final column types and
relationships.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. The SemTab Challenge</title>
      <p>The SemTab 2024 challenge consists of several tracks ranging from semantic table interpretation to
dataset assessment and contributions. We focus on the Accuracy Track, which is relevant to our
approach. This track contains four matching tasks: (1) the Cell Entity Annotation (CEA) matches a cell
to a KG entity, (2) the Column Type Annotation (CTA) assigns a KG class to a column, (3) the Column
Property Annotation (CPA) assigns a KG property to the relationship between two columns, and (4)
Topic Detection (TD) assigns a KG class to a table. Figure 1 shows an example table annotation.</p>
      <p>There are two types of tables in this track: horizontal tables (or relational tables) and entity tables.
A horizontal table is a grid where each row represents an entity and each column shares the same
semantic type (e.g., Figure 1). An entity table describes a single entity, where each row contains a
property of that entity.
https://binh-vu.github.io/ (B. Vu)</p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>
        Finally, the standard micro precision, recall, and F1 are used to measure the performance of the
participating systems [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. GRAMS+ Approach</title>
      <p>
        We generate the labeled dataset by leveraging the hyperlinks inside the Wikipedia tables to find
corresponding Wikidata entities and predict columns’ relation- ships based on the linked entities. We
remove context-inconsistent hyperlinks by first automatically assigning a type to each column based
on the most common type of its entities. Then, we employ a blocklist to remove all links in a column
if the column header is incompatible with the predicted column types. The blocklist is constructed
by manually verifying headers that appeared in multiple predicted types. As our approach is detailed
in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the remainder of this section provides a brief overview of each component in GRAMS+, along
with any changes to fit the SemTab 2024 challenge.
      </p>
      <sec id="sec-4-1">
        <title>3.1. Entity Linking</title>
        <p>Following typical entity linking (EL) systems, our EL approach consists of three main steps: (1) detect
the entity columns, which are the cells that will be linked; (2) retrieve candidate entities for each cell;
and (3) compute the candidates’ likelihood.</p>
        <p>For step 1, we directly use the target entity columns provided in SemTab’s datasets instead of running
the entity detection. To retrieve candidate entities, GRAMS+ combines multiple search strategies such
as using public Wikidata Search API, keyword search using ElasticSearch, and fuzzy search using
SymSpell. Given the huge number of tables in the Wikidata Tables dataset in Round 2 (78,745 tables),
we cannot use the public Wikidata API to search and only use the two later strategies.</p>
        <p>To compute the candidates’ likelihood, we use a two-hidden-layer perceptron with RELU activations.
It is trained using the auto-label dataset with the following groups of features:</p>
        <p>Surface Features include four string similarity functions between a cell and an entity name:
Levenshtein, Jaro-Winkler, Monge Elkan, and Generic Jaccard.</p>
        <p>
          Entity-context Similarity Features capture the coherence between a candidate and the surrounding
context of a cell. GRAMS+ uses two context similarity features: the weighted dot product of the column
header and the candidate description, and the number of cells matched with the candidate’s property
divided by a large constant representing the maximum number of columns in a table (e.g., 20) for
rescaling. The embeddings are computed from a Sentence Transformer model [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]1, and the weights of
embedding dimensions are learnable parameters. Note that GRAMS+ trains two entity linking models
for tables with and without headers. Because tables from the SemTab datasets do not have column
headers, GRAMS+ uses the model trained on tables without headers.
        </p>
        <p>Entity Prior Features bias the predictions toward popular entities. Currently, we use the normalized
log page rank of a candidate as the prior feature. The normalized log page rank of an entity  is calculated
as follows:</p>
        <p>log(pagerank()) − min ′∈ℰ log(pagerank( ′))
max ′∈ℰ log(pagerank( ′)) − min ′∈ℰ log(pagerank( ′))
where ℰ is the set of entities in KG, pagerank() is the pagerank of an entity  .</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Column Type Prediction</title>
        <p>
          To predict the type of a column, we use a greedy algorithm that first selects the type with the highest
score from the set of types directly found in the candidate entities of a column. Then, it iteratively
refines the prediction by replacing it with an ancestor type within  distance of the directed types if
the score diference is larger than a specific threshold  until  reaches the maximum chosen distance
(max_distance). The score of a type is computed by summing the maximum likelihood of the candidate
entities of the type for each cell and then dividing by the number of rows. We use the same threshold (
= 0.1) and maximum distance (max_distance = 2) as in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Column Relationship Prediction</title>
        <p>To predict the relationship of a column, GRAMS+ first constructs a candidate graph containing potential
relationships between columns. Then, GRAMS+ uses a classifier to predict the likelihood of each link
in the graph. As the SemTab challenge provides pairs of target columns for predictions, we directly use
the most likely relationships between target columns as the final predictions.</p>
        <p>The classifier employed to predict the likelihood of links is also a two-hidden-layer perceptron with
RELU activations. It is trained on the auto-label dataset with features such as the relative frequency of
discovering the link from top K entities, the average link likelihood, the relative frequency of finding
contradicting information between the table data and KG data, and whether there is a many-to-many
relationship between the source and target of the link.
1We use the pretrained all-mpnet-base-v2 model.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. SemTab 2024 Results</title>
    </sec>
    <sec id="sec-6">
      <title>5. Related Work</title>
      <p>
        Table Understanding is an essential problem in Data Integration and has attracted many studies over
the years. A comprehensive related work to GRAMS+ can be found in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In this section, we briefly
discuss work related to GRAMS+ in the setting of the SemTab challenge.
      </p>
      <p>
        Most systems participating in the SemTab, including GRAMS+, exploit the existing knowledge in
a KG. Typically, they first identify KG entities in a table (CEA) and match the properties of entities
with values in the table to find column types (CTA) and relationships between columns (CPA). The best
performing systems in SemTab such as MTab [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], DAGOBAH [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and others such as KGCode-Tab [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
LinkingPark [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], BBW [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], TorchicTab-Heuristic [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and SemTex [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] improve various aspects of the
pipeline such as candidate entity retrieval, scoring functions to rank the matched results, or repeat
the pipeline several times or until reaching equilibrium. Compared to GRAMS+, they often rely on
hand-crafted scoring functions, while GRAMS+ uses distant supervision to learn to classify correct
entities and column relationships. Moreover, GRAMS+ tackles a general setting where we need n-ary
relationships to correctly model data in the tables.
      </p>
      <p>The SemTab 2023 and 2024 also include other tasks, such as Table Topic Detection and Matching
Table Metadata to KG. These are not the focus problems of GRAMS+, and we leave them for future
work.</p>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion</title>
      <p>This paper presents the results of GRAMS+, a distant supervised approach for annotating column types
and relationships of tables, for the SemTab 2024 Accuracy Track. GRAMS+ achieves rank 1 for datasets
on which it was evaluated.</p>
      <p>In future work, we plan to improve the performance of GRAMS+ by jointly predicting column types
and relationships. We also plan to extend GRAMS+ to leverage table context, metadata, and modeling
instructions to support tables without overlapping data to a target knowledge graph.
This material is based upon research supported by the Defense Advanced Research Projects Agency
(DARPA) under Agreement No. HR00112390132 and Contract No. 140D0423C0093. Any opinions,
ifndings and conclusions or recommendations expressed in this material are those of the authors and
do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA); or its
Contracting Agent, the U.S. Department of the Interior, Interior Business Center, Acquisition Services
Directorate, Division V.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Vu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Knoblock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Shbita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Exploiting distant supervision to learn semantic descriptions of tables with overlapping data, in: The Semantic Web-ISWC</article-title>
          <year>2024</year>
          : 23th International Semantic Web Conference,
          <source>ISWC 2024, November 11-15</source>
          ,
          <year>2024</year>
          , Proceedings 20, Springer International Publishing,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Hassanzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Abdelmageed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Efthymiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Cutrona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hulsebos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jiménez-Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khatiwada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Korini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kruit</surname>
          </string-name>
          , et al.,
          <source>Results of semtab</source>
          <year>2023</year>
          , in: CEUR Workshop Proceedings, volume
          <volume>3557</volume>
          ,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          , Sentence-BERT:
          <article-title>Sentence embeddings using siamese BERT-Networks (</article-title>
          <year>2019</year>
          ). arXiv:
          <year>1908</year>
          .10084.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , I. Yamada,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kertkeidkachorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ichise</surname>
          </string-name>
          , H. Takeda,
          <string-name>
            <surname>SemTab</surname>
          </string-name>
          <year>2021</year>
          :
          <article-title>Tabular data annotation with MTab tool</article-title>
          , http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3103</volume>
          /paper8.pdf, ???? Accessed:
          <fpage>2023</fpage>
          -10-6.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.-P.</given-names>
            <surname>Huynh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chabot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Labbé</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <article-title>From heuristics to language models: A journey through the universe of semantic table interpretation with DAGOBAH, Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab) (</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , G. Zhang, C. Jiang,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>KGCODE-Tab results for SemTab 2022</article-title>
          , https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3320</volume>
          /paper5.pdf, ???? Accessed:
          <fpage>2023</fpage>
          -10-6.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karaoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Negreanu</surname>
          </string-name>
          , T. Ma, J.-G. Yao,
          <string-name>
            <given-names>J.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gordon</surname>
          </string-name>
          , C.-Y. Lin,
          <article-title>LinkingPark: An integrated approach for semantic table interpretation</article-title>
          , http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2775</volume>
          /paper7.pdf, ???? Accessed:
          <fpage>2023</fpage>
          -10-6.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Shigapov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zumstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kamlah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Oberlander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mechnich</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Schumm, bbw: Matching CSV to wikidata via meta-lookup</article-title>
          , https://madoc.bib.uni-mannheim.de/57386/3/paper2.pdf, ???? Accessed:
          <fpage>2023</fpage>
          -10-6.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>I.</given-names>
            <surname>Dasoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Dimou,
          <article-title>TorchicTab: Semantic Table Annotation with Wikidata and Language Models</article-title>
          ,
          <source>in: CEUR Workshop Proceedings</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E. G.</given-names>
            <surname>Henriksen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Khorsid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Nielsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Stück</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Sørensen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelgrin</surname>
          </string-name>
          ,
          <article-title>Semtex: A hybrid approach for semantic table interpretation</article-title>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>