1. Introduction

1613-0073

Matching using LLMs

Dylan Li Tin Yue

Ernesto Jimenez-Ruiz

ernesto.jimenez-ruiz@city.ac.uk 0

Workshop

Tabular Data Annotation, Knowledge Graphs, Large Language Model, SemTab Challenge, Entity Matching

0 City St George's, University of London

2024

11 15

This paper investigates the use of a Large Language Model (LLM) to match tabular data with knowledge graphs. The system participated in the STI vs. LLMs 2024 SemTab Track, which prompts a model to perform the cell entity annotation (CEA) task. The study covers the processes from data cleaning and matching to its execution in the cloud, while relying on a Lookup API to generate a list of candidates. This project not only contributes to the understanding of the applications of Large Language Models in tabular data annotations but also lays the groundwork for future research in the field.

1. Introduction

Over the past few decades, the use of data has increased dramatically where a large portion of this data is structured as tabular data. This increase is largely because of the growing trend of Open Data publication, which makes data increasingly accessible to the public.

The range of knowledge extracted from the data has not grown in proportion to the exponential increase in data volume. This is mostly because tools and expertise needed are limited in this area of work where datasets are usually large and complex. Tabular data is widely available and accessible on the internet, data silos and data lakes. These tables are crucial as they are hugely demanded in activities such as data analytics, data mining and data integration tasks.

However, this form of data often lacks the contextual understanding that is required for the users and machines to properly interpret. To fully exploit its potential, it is necessary to understand its semantic structure and underlying meaning. Even though Knowledge Graphs [ 1 ] address this issue by providing insights into the significance of the data, the process is still mostly manual.

With the rise of ChatGPT, Large Language Models (LLMs) gained in popularity and are increasingly changing important aspects of our lives. This type of Artificial Intelligence (AI) is currently trained in the order of trillions of tokens. As a result, it appears to recognize and understand human text and can act as a large source of (parametric) knowledge. Their value and benefits are being widely acknowledged in multiple tasks, which leads to the following research question:

“To what extent can LLMs be applied in the context of matching tabular data to knowledge graph?” SemTab Challenge.

Tabular format is one of the most popular ways for organizations to store data.

Tabular data to Knowledge Graph (KG) matching is the process to map elements of a table data to its corresponding semantic tags within a KG such as Wikidata [ 2 ] and DBpedia [ 3 ]. The SemTab challenge1 [ 4, 5 ] has contributed to the systematic evaluation of systems tackling the above task, also known as

Tables are often of poor data quality because of incomplete or missing metadata. Understanding the semantic meaning and context might therefore be dificult when metadata such as proper table titles, CEUR

ceur-ws.org relationships between the diferent data elements or column names are unavailable. Another challenge is that values in cells can be noisy, contain typos, abbreviations and ambiguous names that cannot be matched with the KG. Along with the data quality issues, other challenges can arise within the semantic matching process since relationships between columns are not known and the table columns can represent a more specific or general concept. Moreover, as knowledge is constantly and quickly evolving in the current world, KGs may not always have the most up-to-date information needed.

We have targeted the STI vs LLMs SemTab 2024 track, which involves the use of only LLMs to perform the CEA task which is to match a data cell to a KG entity. Round 1 of the challenge uses the SuperSemTab 24 dataset which has been Automatically Generated (AG). The test set of this dataset includes 74,837 cells to be annotated from 4,044 CSV files, each containing an average of five rows of data. The dataset used for Round 2 is MammoTab 24 [6] which was extracted from 21,149,260 Wikipedia pages. Unlike Round 1, it consists of 2,500 tables for the training set and 500 tables for the testing set. Both SuperSemTab and MammoTab target Wikidata as KG.

Related Work. Based on a recent survey carried out on semantic interpretation of tabular data [7], more than 85 systems have been proposed to tackle the tabular data to KG matching problem since 2007. Recently there has been an increase of approaches relying on the features of pre-trained and large language models like Do-duo [8], TURL [9], DAGOBAH SL 2022 [10], TorchicTab [11], Korini and Bizer [12], TableGPT [13], and TableLlama [14]. In the SemTab 2024 challenge [5], only three systems have participated in the STI vs. LLMs track: CitySTI (ours), TSOTSA [15], and Kepler-aSI [16].

2. The CitySTI System

CitySTI combines state-of-the-art LLMs with natural language processing (NLP) techniques to perform the SemTab’s CEA task. It does not only handle the data matching with the appropriate entities but also performs data cleaning. Figure 1 provides a general overview of the diferent components of CitySTI, which are summarised as follows.

Data cleaning. This component aims at cleaning the input tabular data from noise and correcting any misspelled words. Figure 2 shows the used prompts to guide the LLM in the process. Note that we used diferent prompts in each of the rounds of the STI vs LLMs SemTab 2024 track. Candidate generator (lookup). The candidates are extracted via the lookup service provided by the target KG. Given an input query (e.g., the text value of a cell) the look-up service extracts candidates (partially) matching the query. CitySTI implements an API to access the lookup services of diferent KGs and extract the top-5 candidates for each query (see Figure 3).

Matching. This component associates the data cell with its appropriate KG entity. It gets as input the top-5 candidates from the lookup component and communicates with an LLM model to identify the best choice. The set of candidates are fed within the matching prompt along with the clean CSV data to perform the matching (see Figure 2). If only one candidate is retrieved from the lookup, it will be automatically assigned to its cell data without any need of a matching prompt. An entity cache was also implemented as values to be annotated usually have frequent occurrences. It uses a dictionary that stores the pools of candidates to avoid repeated retrieval of the KG entity for the same cell data in the same table. The cache was automatically cleared every 15 seconds to prevent memory overload.

The matching component implemented two diferent approaches, tailored to the specific dataset in each of the rounds of the STI vs LLMs SemTab 2024 track, as described next. Codes for both approaches are available in this GitHub repository: https://github.com/dylanlty/CitySTI-2024 LLM component - approach round 1. In this approach, the system loops through all the CSV files in a folder, and annotates each cell data, while skipping the numeric and date values. The system then feeds the entire table to the GPT-4o-mini model to improve its relevance and accuracy when performing the data cleaning and matching.

All the components were run on a Virtual Machine (VM) instance in Google Cloud Platform’s Compute Engine. Google Cloud Platform was used to process the large volumes of data. This was essential because matching the large volumes of data cells to a KG is very time and resources demanding. Before creating a VM instance through the Compute Engine component, it was properly set up to avoid any performance bottleneck. Since a large amount of memory may be required (e.g., to store the entities in the dictionary), 8 GB of memory was chosen. The N2 machine series was selected to run as this works well for the kind of task required where the system only involved annotating data cells and performing read-write operations. Additionally, a 10 GB SSD was allocated to store all the input data and the necessary packages.

During the implementation of the system, open-source models on HuggingFace like the Llama-3-8BInstruct were explored and then run on RunPod.2 The performance of these models did not match as those currently provided by Google’s Gemini or OpenAI’s GPT models. It proved to be costly to operate and significantly more expensive than the APIs ofered by the other major AI providers (see Section 3). LLM component - approach round 2. The second round of the challenge undertook some changes to address some issues faced in the previous round and because the MammoTab 24 dataset is more challenging in terms of the size of the tables. The first change was that the system now reads and annotates based on the target file ( i.e., cells for which there is ground truth) given by the SemTab challenge instead of processing all potential cells.

Another diferent approach taken involved processing and providing context in batches instead of the entire tabular data and the use of the Gemini-1.5-flash model. This was done to address the token usage limits set by the LLM Provider. Unlike the approach in Round 1, the datasets are loaded into a Pandas DataFrame, cleaned in batches of 15 rows at a time, and written to a temporary file. This file is then sliced into diferent rows to provide the model some context when matching. The number of rows provided varies from 4 to 6, depending on the position of the data cell.

3. Results 4. Conclusions, Challenges, and Future Work

This paper presented CitySTI, a system to perform the annotations of tabular data to KG. The approach taken to tackle the tasks makes use of prompting an LLM model to clean the data and match the KG entities. The results of it are satisfactory considering it is the first time delving into the SemTab challenge together with the realm of LLMs. Since CitySTI relies only on prompting, it can be further improved and may even have a greater performance if fine-tuning is implemented. This will remove the need for large prompts to achieve the desirable output and reduce the number of tokens needed. The prompt structure can be further optimized to enable the system to process several data cells in a single prompt which would significantly reduce the number of requests being sent. Another aspect of the development of the system that can be improved is its ability to be more fault tolerance in terms that it can recover from errors. It should be able to continue where it left on and this would save significant amount of time as opposed to manually removing or adding the files.

Limitations. While many LLMs are free and open-sourced, OpenAI and Google charge for their APIs when using one of their models. The main reason behind this is they have to maintain and operate these advanced AI models and since they are expensive to run, these models come with usage restrictions that add extra challenges to the task. The usage restrictions can come in the form of rate or token limits. Even though Google provides a free-tier option for its Gemini API, the rate limit is insuficient to annotate several thousands of cell data.

Additionally, the lookup search for entities on Wikidata is not very flexible as it is unable to recognize typos or incorporate context words related to the searched term like a modern search engine does. The accuracy could have been significantly improved if the lookup search was more advanced. In the near future, we plan to explore alternative APIs to access the target KG. [5] O. Hassanzadeh, N. Abdelmageed, M. Cremaschi, V. Cutrona, F. D’Adda, V. Efthymiou, B. Kruit, E. Lobo, N. Mihindukulasooriya, N. H. Pham, Results of SemTab 2024, in: SemTab’24: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching 2024, co-located with the 23rd International Semantic Web Conference (ISWC), 2024. [6] M. Marzocchi, M. Cremaschi, R. Pozzi, R. Avogadro, M. Palmonari., MammoTab: a giant and comprehensive dataset for Semantic Table Interpretation, in: Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, SemTab 2022, co-located with the 21st International Semantic Web Conference (ISWC), CEUR-WS.org, 2022. [7] M. Cremaschi, B. Spahiu, M. Palmonari, E. Jimenez-Ruiz, Survey on Semantic Interpretation of

Tabular Data: Challenges and Directions, arXiv preprint arXiv:2411.11891 (2024). [8] Y. Suhara, J. Li, Y. Li, D. Zhang, c. Demiralp, C. Chen, W.-C. Tan, Annotating columns with pretrained language models, in: Proceedings of the 2022 International Conference on Management of Data, SIGMOD ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 1493–1503.

URL: https://doi.org/10.1145/3514221.3517906. doi:10.1145/3514221.3517906. [9] X. Deng, H. Sun, A. Lees, Y. Wu, C. Yu, TURL: table understanding through representation learning, Proc. VLDB Endow. 14 (2020) 307–319. URL: http://www.vldb.org/pvldb/vol14/p307-deng.pdf. doi:10.5555/3430915.3442430. [10] V.-P. Huynh, Y. Chabot, T. Labbé, J. Liu, R. Troncy., From Heuristics to Language Models: A Journey Through the Universe of Semantic Table Interpretation with DAGOBAH, in: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab), CEUR-WS.org, 2022. [11] I. Dasoulas, D. Yang, X. Duan, A. Dimou, TorchicTab: Semantic Table Annotation with Wikidata and Language Models, in: SemTab’23: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching 2023, co-located with the 22nd International Semantic Web Conference (ISWC), 2023. [12] K. Korini, C. Bizer, Column Type Annotation using ChatGPT, in: Joint Proceedings of Workshops at the 49th International Conference on Very Large Data Bases (VLDB 2023), Vancouver, Canada, August 28 - September 1, 2023, volume 3462 of CEUR Workshop Proceedings, CEUR-WS.org, 2023.

URL: https://ceur-ws.org/Vol-3462/TADA1.pdf. [13] L. Zha, J. Zhou, L. Li, R. Wang, Q. Huang, S. Yang, J. Yuan, C. Su, X. Li, A. Su, T. Zhang, C. Zhou, K. Shou, M. Wang, W. Zhu, G. Lu, C. Ye, Y. Ye, W. Ye, Y. Zhang, X. Deng, J. Xu, H. Wang, G. Chen, J. Zhao, Tablegpt: Towards unifying tables, nature language and commands into one GPT, CoRR abs/2307.08674 (2023). URL: https://doi.org/10.48550/arXiv.2307.08674. doi:10.48550/ARXIV.2307. 08674. arXiv:2307.08674. [14] T. Zhang, X. Yue, Y. Li, H. Sun, Tablellama: Towards open large generalist models for tables, CoRR abs/2311.09206 (2023). URL: https://doi.org/10.48550/arXiv.2311.09206. doi:10.48550/ARXIV.2311. 09206. arXiv:2311.09206. [15] J. P. Bikim, C. Atezong, A. Jiomekong, A. Oelen, G. Rabby, J. D’Souza, S. Auer, Leveraging GPT Models For Semantic Table Annotation, in: SemTab’24: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching 2024, co-located with the 23rd International Semantic Web Conference (ISWC), 2024. [16] W. Baazouzi, M. Kachroudi, S. Faiz, Kepler-aSI : Semantic Annotation for Tabular Data, in: SemTab’24: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching 2024, colocated with the 23rd International Semantic Web Conference (ISWC), 2024.

[1]

Weikum , Knowledge Graphs 2021 : A Data Odyssey , Proc. VLDB Endow . 14 ( 2021 ) 3233 - 3238 . URL: http://www.vldb.org/pvldb/vol14/p3233-weikum.pdf.

[2]

Vrandecic ,

Krötzsch , Wikidata: a free collaborative knowledge base , Commun. ACM 57 ( 2014 ) 78 - 85 .

[3]

Auer ,

Bizer , G. Kobilarov,

Lehmann ,

Cyganiak , Z. Ives, DBpedia: A Nucleus for a Web of Open Data , in: The Semantic Web, Springer Berlin Heidelberg, 2007 , pp. 722 - 735 .

[4]

Jimenez-Ruiz ,

Hassanzadeh ,

Efthymiou ,

Chen , K. Srinivas, SemTab 2019 : Resources to Benchmark Tabular Data to Knowledge Graph Matching Systems , in: The Semantic Web: ESWC, Springer International Publishing, 2020 .