Results of GRAMS+ at SemTab 2024

Results of GRAMS+ at SemTab 2024 BinhVu USC Information Sciences Institute

Marina del Rey 90292 CA USA

CraigAKnoblock USC Information Sciences Institute

Marina del Rey 90292 CA USA

FandelLin USC Information Sciences Institute

Marina del Rey 90292 CA USA

Results of GRAMS+ at SemTab 2024 1613-0073 458B70D436E92D611F8C79D67CDB7300 GROBID - A machine learning software for extracting information from scholarly documents SemTab 2024 Semantic Description Semantic Table Interpretation Knowledge Graphs Semantic Web Data Integration

There is an enormous number of tables available on the Web. However, it is difficult to automatically use the tables in data analytic pipelines because of the lack of semantic understanding of their structure and meaning. To address this problem, our approach, GRAMS+, automatically creates semantic descriptions of tables using distant supervision. SemTab is an annual challenge that provides a diverse set of benchmarks for systems that match tabular data with knowledge graphs. In this paper, we present the results of GRAMS+ at SemTab 2024 in the Accuracy Track. The results show that GRAMS+ is scalable and achieves competitive performance in the tasks in which we participated.

Introduction

Matching tabular data to an ontology or a knowledge graph is an essential problem in Data Integration. The task is to annotate types of columns in the tables using classes of the target ontology and relations between columns using the ontology properties. We developed a novel approach, GRAMS+ [1], addressing this problem using distant supervision. The approach leverages the fact that some data in a table will often overlap with data in a knowledge graph (KG), which can be used to discover candidate types and relationships in the table. Then, the approach uses two neural networks (NN) trained with a labeled dataset generated automatically from Wikipedia tables to predict the final column types and relationships.

The Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab) is an annual challenge with the goal of providing benchmarks and evaluations of existing solutions to this problem. In this paper, we present the results of GRAMS+ at the SemTab 2024 challenge focusing on the Accuracy Track. Our approach successfully annotates a very large number of tables and achieves first place on the tasks in which we participated.

The SemTab Challenge

The SemTab 2024 challenge consists of several tracks ranging from semantic table interpretation to dataset assessment and contributions. We focus on the Accuracy Track, which is relevant to our approach. This track contains four matching tasks: (1) the Cell Entity Annotation (CEA) matches a cell to a KG entity, (2) the Column Type Annotation (CTA) assigns a KG class to a column, (3) the Column Property Annotation (CPA) assigns a KG property to the relationship between two columns, and (4) Topic Detection (TD) assigns a KG class to a table. Figure 1 shows an example table annotation.

There are two types of tables in this track: horizontal tables (or relational tables) and entity tables. A horizontal table is a grid where each row represents an entity and each column shares the same semantic type (e.g., Figure 1). An entity table describes a single entity, where each row contains a property of that entity. Finally, the standard micro precision, recall, and F 1 are used to measure the performance of the participating systems [2].

GRAMS+ Approach

Figure 2 shows the overall approach of GRAMS+. It starts by finding KG entities that are mentioned in a table. Then, we use a neural network (NN) to compute the scores of candidate entities of each table cell. The NN model is trained with a labeled dataset automatically generated from Wikipedia tables. Using the discovered candidates and their scores, we predict column types (CTA) and column relationships (CPA). We generate the labeled dataset by leveraging the hyperlinks inside the Wikipedia tables to find corresponding Wikidata entities and predict columns' relation-ships based on the linked entities. We remove context-inconsistent hyperlinks by first automatically assigning a type to each column based on the most common type of its entities. Then, we employ a blocklist to remove all links in a column if the column header is incompatible with the predicted column types. The blocklist is constructed by manually verifying headers that appeared in multiple predicted types. As our approach is detailed in [1], the remainder of this section provides a brief overview of each component in GRAMS+, along with any changes to fit the SemTab 2024 challenge.

Entity Linking

Following typical entity linking (EL) systems, our EL approach consists of three main steps: (1) detect the entity columns, which are the cells that will be linked; (2) retrieve candidate entities for each cell; and (3) compute the candidates' likelihood.

For step 1, we directly use the target entity columns provided in SemTab's datasets instead of running the entity detection. To retrieve candidate entities, GRAMS+ combines multiple search strategies such as using public Wikidata Search API, keyword search using ElasticSearch, and fuzzy search using SymSpell. Given the huge number of tables in the Wikidata Tables dataset in Round 2 (78,745 tables), we cannot use the public Wikidata API to search and only use the two later strategies.

To compute the candidates' likelihood, we use a two-hidden-layer perceptron with RELU activations. It is trained using the auto-label dataset with the following groups of features:

Surface Features include four string similarity functions between a cell and an entity name: Levenshtein, Jaro-Winkler, Monge Elkan, and Generic Jaccard.

Entity-context Similarity Features capture the coherence between a candidate and the surrounding context of a cell. GRAMS+ uses two context similarity features: the weighted dot product of the column header and the candidate description, and the number of cells matched with the candidate's property divided by a large constant representing the maximum number of columns in a table (e.g., 20) for rescaling. The embeddings are computed from a Sentence Transformer model [3] 1 , and the weights of embedding dimensions are learnable parameters. Note that GRAMS+ trains two entity linking models for tables with and without headers. Because tables from the SemTab datasets do not have column headers, GRAMS+ uses the model trained on tables without headers.

Entity Prior Features bias the predictions toward popular entities. Currently, we use the normalized log page rank of a candidate as the prior feature. The normalized log page rank of an entity 𝑒 is calculated as follows:

log(pagerank(𝑒)) − min 𝑒 ′ ∈ℰ log(pagerank(𝑒 ′ )) max 𝑒 ′ ∈ℰ log(pagerank(𝑒 ′ )) − min 𝑒 ′ ∈ℰ log(pagerank(𝑒 ′ ))

where ℰ is the set of entities in KG, pagerank(𝑒) is the pagerank of an entity 𝑒.

Column Type Prediction

To predict the type of a column, we use a greedy algorithm that first selects the type with the highest score from the set of types directly found in the candidate entities of a column. Then, it iteratively refines the prediction by replacing it with an ancestor type within 𝑑 distance of the directed types if the score difference is larger than a specific threshold 𝛿 until 𝑑 reaches the maximum chosen distance (max_distance). The score of a type is computed by summing the maximum likelihood of the candidate entities of the type for each cell and then dividing by the number of rows. We use the same threshold (𝛿 = 0.1) and maximum distance (max_distance = 2) as in [1].

Column Relationship Prediction

To predict the relationship of a column, GRAMS+ first constructs a candidate graph containing potential relationships between columns. Then, GRAMS+ uses a classifier to predict the likelihood of each link in the graph. As the SemTab challenge provides pairs of target columns for predictions, we directly use the most likely relationships between target columns as the final predictions.

The classifier employed to predict the likelihood of links is also a two-hidden-layer perceptron with RELU activations. It is trained on the auto-label dataset with features such as the relative frequency of discovering the link from top K entities, the average link likelihood, the relative frequency of finding contradicting information between the table data and KG data, and whether there is a many-to-many relationship between the source and target of the link.

SemTab 2024 Results

Table 1 reports the performance of GRAMS+ on the Wikidata Tables datasets. We cannot run GRAMS+ on the tBiodiv and tBiomed datasets because the values of the subject columns', which contain the main entities, were anonymized. Since the names are changed, these datasets focus on a different aspect of the problem, which is identifying the anonymous entities. This is not the focus of GRAMS+, and we leave it for future work.

At the time of writing the paper, GRAMS+ achieves first place among the participants on the Wikidata Tables datasets. The two datasets, in total, have approximately 109,000 tables. This shows that GRAMS+ is scalable and can handle a large number of tables.

Related Work

Table Understanding is an essential problem in Data Integration and has attracted many studies over the years. A comprehensive related work to GRAMS+ can be found in [1]. In this section, we briefly discuss work related to GRAMS+ in the setting of the SemTab challenge.

Most systems participating in the SemTab, including GRAMS+, exploit the existing knowledge in a KG. Typically, they first identify KG entities in a table (CEA) and match the properties of entities with values in the table to find column types (CTA) and relationships between columns (CPA). The best performing systems in SemTab such as MTab [4], DAGOBAH [5], and others such as KGCode-Tab [6], LinkingPark [7], BBW [8], TorchicTab-Heuristic [9], and SemTex [10] improve various aspects of the pipeline such as candidate entity retrieval, scoring functions to rank the matched results, or repeat the pipeline several times or until reaching equilibrium. Compared to GRAMS+, they often rely on hand-crafted scoring functions, while GRAMS+ uses distant supervision to learn to classify correct entities and column relationships. Moreover, GRAMS+ tackles a general setting where we need n-ary relationships to correctly model data in the tables.

The SemTab 2023 and 2024 also include other tasks, such as Table Topic Detection and Matching Table Metadata to KG. These are not the focus problems of GRAMS+, and we leave them for future work.

Conclusion

This paper presents the results of GRAMS+, a distant supervised approach for annotating column types and relationships of tables, for the SemTab 2024 Accuracy Track. GRAMS+ achieves rank 1 for datasets on which it was evaluated.

In future work, we plan to improve the performance of GRAMS+ by jointly predicting column types and relationships. We also plan to extend GRAMS+ to leverage table context, metadata, and modeling instructions to support tables without overlapping data to a target knowledge graph.

SemTab'24: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching 2024, co-located with the 23rd International Semantic Web Conference (ISWC), November 11-15, 2024, Baltimore, USA Envelope binhvu@isi.edu (B. Vu); knoblock@isi.edu (C. A. Knoblock); fandel.lin@usc.edu (F. Lin) GLOBE https://binh-vu.github.io/ (B. Vu) Orcid 0000-0001-5808-9288 (B. Vu); 0000-0002-6371-4807 (C. A. Knoblock); 0000-0001-7024-2476 (F. Lin)

Figure 1 :1Figure 1: An example of a table with annotation

Figure 2 :2Figure 2: Overall approach of GRAMS+

Table 11Performance of GRAMS+ on CPA and CTA tasks. Precision and F 1 scores are reported in percentageDatasetCPACTAF 1PrecisionRecallRankF 1PrecisionRecallRankWikidata Tables round 189.898.882.30192.992.992.91Wikidata Tables round 289.999.282.19195.695.695.61

We use the pretrained all-mpnet-base-v2 model.

Acknowledgements

This material is based upon research supported by the Defense Advanced Research Projects Agency (DARPA) under Agreement No. HR00112390132 and Contract No. 140D0423C0093. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA); or its Contracting Agent, the U.S. Department of the Interior, Interior Business Center, Acquisition Services Directorate, Division V.

Exploiting distant supervision to learn semantic descriptions of tables with overlapping data BVu CAKnoblock BShbita FLin The Semantic Web-ISWC 2024: 23th International Semantic Web Conference, ISWC 2024 Springer International Publishing November 11-15, 2024. 2024 Proceedings 20 OHassanzadeh NAbdelmageed VEfthymiou JChen VCutrona MHulsebos EJiménez-Ruiz AKhatiwada KKorini BKruit CEUR Workshop Proceedings 2023 3557 Results of semtab 2023 NReimers IGurevych arXiv:1908.10084 Sentence-BERT: Sentence embeddings using siamese BERT-Networks 2019 PNguyen IYamada NKertkeidkachorn RIchise HTakeda Tabular data annotation with MTab tool 2021. 2023-10-6 SemTab From heuristics to language models: A journey through the universe of semantic table interpretation with DAGOBAH V.-PHuynh YChabot TLabbé JLiu RTroncy Semantic Web Challenge on Tabular Data to Knowledge Graph Matching SemTab 2022 XLi SWang WZhou GZhang CJiang THong PWang KGCODE-Tab results for SemTab 2022 2023-10-6 SChen AKaraoglu CNegreanu TMa J.-GYao JWilliams AGordon C.-YLin LinkingPark: An integrated approach for semantic table interpretation 2023-10-6 RShigapov PZumstein JKamlah LOberlander JMechnich ISchumm bbw: Matching CSV to wikidata via meta-lookup 2023-10-6 TorchicTab: Semantic Table Annotation with Wikidata and Language Models IDasoulas DYang XDuan ADimou CEUR Workshop Proceedings 2023 EGHenriksen AMKhorsid ENielsen AMStück ASSørensen OPelgrin Semtex: A hybrid approach for semantic table interpretation 2023