1. Introduction

Leveraging GPT Models For Semantic Table Annotation

Jean Petit Bikim

0 2

Carick Atezong

Azanzi Jiomekong

0 2

Allard Oelen

Gollam Rabby

Jennifer D'Souza

Sören Auer

1 2 0 Department of Computer Science, University of Yaounde I , Yaounde , Cameroon 1 L3S Research Center, Leibniz University Hannover , Hanover , Germany 2 TIB - Leibniz Information Centre for Science and Technology , Hannover , Germany

This paper outlines our contribution to the Accuracy Track and the Semantic Table Interpretation (STI) & Large Language Models (LLMs) track of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab). Our approach involves using LLMs to address the various tasks presented in the challenge. Specifically, we employed zero-shot and few-shot prompting techniques for most of the tasks, which facilitated the LLMs ability to interpret and annotate tabular data with minimal prior training. For the Column Property Annotation (CPA) task, we took a diferent approach by applying a set of predefined rules, tailored to the structure of each dataset. Our method achieved notable results, with an 1 − exceeding 0.92, demonstrating the efectiveness of LLMs in tackling the SemTab challenge. These results suggest that LLMs hold significant capabilities as a robust solution for semantic table annotation and knowledge graph matching, highlighting their potential to advance the field of semantic web technologies.

eol>Tabular Data Semantic Table Annotation Semantic Table Interpretation Knowledge Graph Large Language Models SemTab Prompt Engineering

1. Introduction 2. Applying GPT-3 for Semantic Table Annotation

This section details the methodology we employed during the SemTab’24 challenge to address the various tasks set by the organizers. The challenge involved multiple stages, each with distinct objectives requiring customized strategies. In Section 2.1, we present a comprehensive overview of the SemTab’24 challenge, outlining its goals, structure, and key requirements. Following that, Section 2 delves into the specific approach we implemented to tackle the challenge’s diverse tasks, including data processing, LLM selection, and performance optimization. Each component of our approach was carefully designed to align with the challenge’s demands while maximizing accuracy and eficiency. Overall, our strategy reflects a combination of innovative techniques and established methods, ensuring robust results across all tasks.

2.1. Overview of the Challenge

The SemTab challenge [6], as described by the organizers, focuses on bench-marking datasets and systems for semantic table annotation. The primary goal of this challenge is to assess and improve the capabilities of automated systems in interpreting and annotating structured data, such as tables, by linking them to relevant KGs. The SemTab challenge serves as an important platform for evaluating advancements in semantic technologies and encouraging the development of novel approaches to table annotation. Participants are required to apply their techniques across diverse tasks and datasets, reflecting real-world scenarios. By setting standardized evaluation metrics and promoting reproducible results, the SemTab challenge plays a crucial role in advancing the field of semantic data annotation.

2.1.1. SemTab Challenge Tracks

This year, the SemTab challenge introduced five distinct tracks, each designed to focus on specific aspects of table annotation: the STI & LLMs track, the accuracy track, the dataset track, the metadata-to-KG track, and the IsGold? track. The STI & LLMs track, alongside the accuracy track, includes a series of critical tasks that highlight key table annotation processes, as illustrated in Fig. 1. The main tasks within these tracks are as follows: • Column Entity Annotation (CEA): This task involves linking the elements in a table’s cells to their corresponding entities in a KG. For example, in Fig. 1, the entity "Kelso Township" in Table (a) match to the QID "Q6386554" in Wikidata. • Column Type Annotation (CTA): This task requires identifying the most specific semantic type to be assigned to a column in the table. For instance, in Table (a) of Fig. 1, The Wikidata entity type for "Kelso Township" and "Ohio Township" has the QID "Q17201685" (Township of Indiana). • Column Property Annotation (CPA): The objective here is to determine the property within the KG that links two columns in a table. For example, in Table (a) of Fig. 1, the Wikidata property that connects columns col0 and col1 is P2044 (elevation above sea level). • Table Topic Detection (TD): This task focuses on assigning an overarching semantic type to an entire table by identifying its primary subject within the KG. For instance, The Wikidata entity that describes the topic of Table (b) in Fig. 1 has the QID Q16823610 (Blue Christmas). • Row Annotation (RA): In this task, participants must link entire rows in the table to the corresponding entities in the KG. For example, the first row of Table (c) in Fig. 1 has the Wikidata QID "Q26689963".

These tasks, while diverse, collectively assess the robustness and flexibility of participating systems in accurately interpreting and annotating tabular data. Each track is designed to target diferent challenges faced in real-world applications, ensuring that systems are tested comprehensively across a wide range of scenarios.

2.1.2. SemTab Datasets

Our focus on semantic table annotation led us to benchmark various datasets from the SemTab challenges published since 20192, allowing us to establish a system that will adapt to diferent datasets.

Table 1 provides a detailed overview of the datasets we employed for the CEA task. The datasets vary in size, complexity, and domain coverage, ofering a comprehensive range of challenges for CEA systems. The datasets tfood [7] (entity, horizontal) and WikidataTableR1 from the 2023 edition, along with Semantic_annotation3(a dataset automatically constructed from 15,000 entities on Wikidata retrieved through API queries and their descriptions as context), were primarily used before the challenge. They served as the foundation for our various experiments and also enriched our training data during the actual challenge phase. Additionally, the training data contained in tbiomed [8], tbiodiv [9] and SuperSemtab24 [10] were used to further enhance our models.

For the CTA, CPA, RA, and TD tasks, we used the datasets proposed by the challenge organizers for the 2024 edition. These datasets cover a diverse range of domains and tasks, which allows for a more comprehensive evaluation of diferent semantic table annotation techniques. Table 2 summarizes the statistics of these datasets, indicating the number of valid and test data for each task. Each dataset provides both validation and test sets to ensure rigorous evaluation and to facilitate fine-tuning during the development process.

2.2. Fine-tuning GPT-3 for LLM and Accuracy Track

In this experiment, we focused on leveraging the capabilities of the GPT-3 model, which contains 175 billion parameters, for addressing various semantic table annotation tasks. Fine-tuning LLMs like GPT-3 can be approached in two main ways: probing and prompt engineering. Probing involves deeper adjustments of the LLMs weights for task-specific learning, while prompt engineering optimizes the input format to guide the model’s responses. For our experiments, we primarily relied on prompt engineering techniques. 2https://orkg.org/comparison/R642266 3https://huggingface.co/datasets/yvelos/semantic_annotation

WikidataTableR1

tfood Entity tfood Horizontal Semantic_annotation

WikidataTableR1 WikidataTableR1 tbiodiv entity tbiodiv entity tbiodiv horizontal tbiodiv horizontal tbiomed entity tbiomed entity tbiomed horizontal tbiomed horizontal

SuperSemtab24 SuperSemtab24

Specifically, few-shot prompting was employed to address the CEA task within the accuracy track, as well as the task in the LLM track. Few-shot prompting allows the model to learn patterns from a small set of examples provided during inference. On the other hand, we adopted zero-shot prompting for the CTA, RA, and TD tasks. Zero-shot prompting does not require any training examples; instead, it relies solely on the LLMs pre-trained knowledge to interpret the prompts. To facilitate these approaches, the datasets were structured such that the diferent SemTab tasks could be efectively interpreted and solved by GPT-3.

For the CPA task, instead of using GPT-3, we used a symbolic rule-based method. The CPA task often Year requires precise identification of relationships between table columns, which can be more efectively handled by deterministic rules. This hybrid strategy allowed us to exploit the strengths of both LLMs and symbolic methods.

The architecture used in this experiment is illustrated in Fig. 2. It involves several key modules, each serving a specific function in the overall system: • Pre-processing Module: This module takes as input a set of tables and applies various cleaning operations such as removing blank spaces, stripping HTML tags, and eliminating special characters. An example of how a cell is processed through this module is shown in Fig 3. • table2vect Module: The table2vect module, as described by Algorithm 1, processes the cleaned dataset and generates task-specific vectors for CEA, CTA, CPA, RA, and TD tasks. These vectors are structured based on the requirements of each annotation task. The Fig. 4 show an example of table2vect process. • Table dataset Modules: This module accepts a vector as input, along with a target file if provided, and then maps the vector elements to their corresponding targets. The output is a new table that represents our dataset. • Prompt Generation Modules (ceaPrompt, ctaPrompt, raPrompt, tdPrompt): These modules transform the rows of Table dataset into a set of questions and answers tailored for each task. For example, in the CEA task, a table cell and its context are framed as a question, while the corresponding entity serves as the answer. Examples of these question-answer pairs are embedded in Fig. 6. • Fine-Tuning Base GPT model: The generated questions and answers are used to fine-tune GPT-3 or GPT-4, ensuring that the model can accurately perform the semantic annotation tasks across diferent datasets.

This modular architecture allows for a flexible and scalable approach to semantic table annotation, enabling the system to adapt to diferent tasks by simply modifying the input prompts and vectors. While GPT-3 handles most of the annotation tasks, the use of a rule-based approach for CPA underscores the importance of integrating symbolic reasoning in cases where relationship extraction is critical.

2.3. Annotating the test data using the model fine-tuned

After fine-tuning the GPT-3 model for semantic table annotation tasks, the resulting model was employed to annotate the test data. The annotation workflow closely follows the first three steps of the finetuning process, as outlined in Fig. 1. This process is structured to handle the diferent annotation tasks eficiently by leveraging the pre-processing pipeline and vector generation approach discussed earlier.

The annotation process begins with inputting the set of tables to be annotated. These tables go through a pre-processing phase, which involves removing irrelevant characters, normalizing formats, and cleaning the data to ensure consistency. Following this, the table2vect algorithm is applied to convert the tables into a set of task-specific vectors. These vectors capture the essential elements needed for annotation, such as table cells and their context. However, for all tasks, the vectors includes a URI cell that is initially left blank. This placeholder will be populated with the correct URI during the inference stage, using the fine-tuned GPT-3.

The fine-tuned LLMs, when performing inference, processes the task-specific prompts generated from these vectors and fills in the blank spaces with the corresponding URIs or semantic labels. For example, in the CEA task, the model identifies the most relevant entity from a knowledge graph, while in the CTA task, it assigns the appropriate semantic type. The transformation from vectors to answers is handled seamlessly by GPT-3, which was trained on similar tasks during fine-tuning.

It is important to note that while GPT-3 was primarily used for tasks such as CEA, CTA, RA, and TD, the CPA task required a diferent approach. The CPA task involves determining the property that links two columns in a table, a challenge that often benefits from deterministic logic rather than generative language models. Therefore, a rule-based method was applied to solve this task, as illustrated in Fig. 7. This rule-based approach relies on predefined relationships and patterns in the data, making it highly efective for capturing the structured nature of properties in knowledge graphs.

By integrating both the generative power of GPT-3 for complex annotation tasks and symbolic methods for rule-based tasks, this hybrid architecture ensures a robust and adaptable annotation pipeline. The resulting annotated datasets maintain high accuracy across all tracks, leveraging the strengths of both AI-driven models and traditional symbolic techniques.

3. Results

This section presents the evaluation results for the SemTab’24 challenge, focusing on both the STI & LLMs track (see Section 3.1) and the accuracy track (see Section 3.2). The outcomes are discussed in detail, highlighting the strengths and limitations observed during the testing phase.

3.1. LLMs Track

In the LLMs track, we fine-tuned the GPT-3 as outlined in Section 2. GPT-3 was also evaluated on the test data by the challenge organizers. Table 3 presents the results, focusing on the CEA task.

The results in Table 3 demonstrate the LLMs ability to perform entity annotation tasks with high accuracy. The fine-tuned LLM achieved an 1 − of 0.899 for the CEA task, which aligns closely with its precision score, indicating a balanced performance. The success in this track can be attributed to efective few-shot prompting and careful data pre-processing, which allowed the LLM to grasp the complex semantic relationships present in the tables.

3.2. Accuracy Track

For the accuracy track, the results cover a broader range of tasks, including CEA, CTA, CPA, RA, and TD, across multiple datasets. The results are summarized in Table 4.

During the challenge, the fine-tuned model was primarily evaluated on the following datasets and tasks: • CEA: WikidatableR1, tbiodiv Entity, tbiodiv Horizontal, tbiomed Entity, tbiomed Horizontal • CTA: WikidataTableR1, tbiodiv Horizontal, tbiomed Horizontal

The results indicate that the model performed well on the CEA task, particularly for the tbiodiv Entity and tbiomed Entity datasets, achieving an 1 − above 0.92. The tbiodiv Horizontal dataset, with its unique table structure, saw a slightly lower performance, with an 1 − of 0.74. This decline is likely due to the complexity introduced by the horizontal orientation of the data, which poses challenges in capturing relationships between entities.

For the CTA task, the model delivered strong results with an 1 − greater than 0.7 for the WikidataTableR1 and tbiomed Horizontal datasets, while scoring 0.648 for tbiodiv Horizontal. The TD task showed a range of 1 − , from 0.78 for tbiodiv Horizontal to 0.621 for tbiomed Horizontal, reflecting the varying dificulty levels of semantic topic detection across datasets.

The RA task produced a high 1 − of 0.719 for tbiodiv Horizontal but a lower 1 − of 0.411 for tbiomed Horizontal. The disparity in performance for these tasks can be attributed to the limited availability of high-quality training data, which likely hindered the model’s ability to generalize efectively.

Lastly, the CPA task sufered from incomplete test data runs, particularly for the WikidataTableR1 dataset, where only 80% of the test data was covered. The incomplete data coverage explains the lower 1 − , as the model had less data to work with, leading to reduced precision and recall.

Overall, while the results show promising performance in several areas, they also highlight the challenges posed by diverse table structures, limited training data, and incomplete test coverage.

4. Conclusion

This paper presented an exploration of utilizing GPT-3 for addressing the SemTab challenge, which involves a series of complex tasks related to entity annotation and classification. To approach this, we employed the base GPT-3 model and refined its capabilities through both few-shot and zero-shot prompting techniques. The model demonstrated promising performance when applied to the complete dataset, achieving commendable results across various tasks. Specifically, for the CEA task, we observed an impressive 1 − exceeding 0.92 when the model was tested on the tbiodiv Entity and Tbiomed Entity datasets. This indicates a high level of accuracy and reliability in the model’s ability to correctly annotate entities within these datasets. However, for other tasks such as CTA and TD , the 1 − ranged between 0.6 and 0.8. This variability in performance can be attributed to the limited size of the training data, which constrained the model’s ability to fully generalize and optimize its predictions across these tasks. Moving forward, future work will focus on completing the remaining annotations that were not finalized before the deadline of this study. Once these annotations are completed, the results will be submitted to the SemTab challenge organizers for formal evaluation. This subsequent evaluation will provide further insights into the model’s performance and its applicability to similar challenges in the field. [5] M. I. Sander Schulhof, The prompt report: A systematic survey of prompting techniques, arxiv (2024). URL: https://arxiv.org/abs/2406.06608. [6] O. Hassanzadeh, Semantic tabular data annotation to knowledge graph matching, in: Semtab challenge, 2024. URL: https://sem-tab-challenge.github.io/2024/. [7] N. Abdelmageed, tfood: Semantic table annotations benchmark for food domain, Zenodo (2023).

URL: https://zenodo.org/records/10048187. [8] N. Abdelmageed, tbiomed: Semantic table annotations benchmark for biomedical domain, Zenodo (2024). URL: https://zenodo.org/records/10996334. [9] N. Abdelmageed, tbiodiv: Semantic table annotations benchmark for biodiversity domain, Zenodo (2024). URL: https://zenodo.org/records/10996688. [10] Cremaschi, Semtab 24: Semantic table annotations benchmark for llm-based approaches, Zenodo (2024). URL: https://zenodo.org/records/11031987.

[1]

G. C.

Azanzi Jiomekong ,

Hippolyte

Tapamo , An ontology for tuberculosis surveillance system , SpringerLink ( 2023 ).

[2]

Y. L. Ruizhe

Ma , f -kgqa: A fuzzy question answering system for knowledge graphs , ScienceDirect ( 2024 ). URL: https://www.sciencedirect.com/science/article/abs/pii/S016501142400263X.

[3]

Jiomekong , Towards an approach based on knowledge graph refinement for tabular data to knowledge graph matching, CEUR-WS ( 2022 ). URL: https://ceur-ws. org/ Vol- 3320 /paper12.pdf.

[4]

Paulheim , Knowledge graph refinement: A survey of approaches and evaluation methods , Semant. Web 8 ( 2017 ) 489 - 508 ( 2016 ). URL: http://dx.doi.org/10.3233/SW-160218.