<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Matching using LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dylan Li Tin Yue</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ernesto Jimenez-Ruiz</string-name>
          <email>ernesto.jimenez-ruiz@city.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Tabular Data Annotation, Knowledge Graphs, Large Language Model, SemTab Challenge, Entity Matching</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>City St George's, University of London</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <fpage>11</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper investigates the use of a Large Language Model (LLM) to match tabular data with knowledge graphs. The system participated in the STI vs. LLMs 2024 SemTab Track, which prompts a model to perform the cell entity annotation (CEA) task. The study covers the processes from data cleaning and matching to its execution in the cloud, while relying on a Lookup API to generate a list of candidates. This project not only contributes to the understanding of the applications of Large Language Models in tabular data annotations but also lays the groundwork for future research in the field.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Over the past few decades, the use of data has increased dramatically where a large portion of this
data is structured as tabular data. This increase is largely because of the growing trend of Open Data
publication, which makes data increasingly accessible to the public.</p>
      <p>The range of knowledge extracted from the data has not grown in proportion to the exponential
increase in data volume. This is mostly because tools and expertise needed are limited in this area of
work where datasets are usually large and complex. Tabular data is widely available and accessible on
the internet, data silos and data lakes. These tables are crucial as they are hugely demanded in activities
such as data analytics, data mining and data integration tasks.</p>
      <p>
        However, this form of data often lacks the contextual understanding that is required for the users and
machines to properly interpret. To fully exploit its potential, it is necessary to understand its semantic
structure and underlying meaning. Even though Knowledge Graphs [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] address this issue by providing
insights into the significance of the data, the process is still mostly manual.
      </p>
      <p>With the rise of ChatGPT, Large Language Models (LLMs) gained in popularity and are increasingly
changing important aspects of our lives. This type of Artificial Intelligence (AI) is currently trained
in the order of trillions of tokens. As a result, it appears to recognize and understand human text
and can act as a large source of (parametric) knowledge. Their value and benefits are being widely
acknowledged in multiple tasks, which leads to the following research question:</p>
      <p>“To what extent can LLMs be applied in the context of matching tabular data to knowledge graph?”
SemTab Challenge.</p>
      <p>Tabular format is one of the most popular ways for organizations to store data.</p>
      <p>
        Tabular data to Knowledge Graph (KG) matching is the process to map elements of a table data to its
corresponding semantic tags within a KG such as Wikidata [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and DBpedia [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The SemTab challenge1
[
        <xref ref-type="bibr" rid="ref4">4, 5</xref>
        ] has contributed to the systematic evaluation of systems tackling the above task, also known as
      </p>
      <p>Tables are often of poor data quality because of incomplete or missing metadata. Understanding the
semantic meaning and context might therefore be dificult when metadata such as proper table titles,
CEUR</p>
      <p>ceur-ws.org
relationships between the diferent data elements or column names are unavailable. Another challenge
is that values in cells can be noisy, contain typos, abbreviations and ambiguous names that cannot
be matched with the KG. Along with the data quality issues, other challenges can arise within the
semantic matching process since relationships between columns are not known and the table columns
can represent a more specific or general concept. Moreover, as knowledge is constantly and quickly
evolving in the current world, KGs may not always have the most up-to-date information needed.</p>
      <p>We have targeted the STI vs LLMs SemTab 2024 track, which involves the use of only LLMs to
perform the CEA task which is to match a data cell to a KG entity. Round 1 of the challenge uses the
SuperSemTab 24 dataset which has been Automatically Generated (AG). The test set of this dataset
includes 74,837 cells to be annotated from 4,044 CSV files, each containing an average of five rows of
data. The dataset used for Round 2 is MammoTab 24 [6] which was extracted from 21,149,260 Wikipedia
pages. Unlike Round 1, it consists of 2,500 tables for the training set and 500 tables for the testing set.
Both SuperSemTab and MammoTab target Wikidata as KG.</p>
      <p>Related Work. Based on a recent survey carried out on semantic interpretation of tabular data [7],
more than 85 systems have been proposed to tackle the tabular data to KG matching problem since
2007. Recently there has been an increase of approaches relying on the features of pre-trained and large
language models like Do-duo [8], TURL [9], DAGOBAH SL 2022 [10], TorchicTab [11], Korini and Bizer
[12], TableGPT [13], and TableLlama [14]. In the SemTab 2024 challenge [5], only three systems have
participated in the STI vs. LLMs track: CitySTI (ours), TSOTSA [15], and Kepler-aSI [16].</p>
    </sec>
    <sec id="sec-2">
      <title>2. The CitySTI System</title>
      <p>CitySTI combines state-of-the-art LLMs with natural language processing (NLP) techniques to perform
the SemTab’s CEA task. It does not only handle the data matching with the appropriate entities but also
performs data cleaning. Figure 1 provides a general overview of the diferent components of CitySTI,
which are summarised as follows.</p>
      <p>Data cleaning. This component aims at cleaning the input tabular data from noise and correcting
any misspelled words. Figure 2 shows the used prompts to guide the LLM in the process. Note that we
used diferent prompts in each of the rounds of the STI vs LLMs SemTab 2024 track.
Candidate generator (lookup). The candidates are extracted via the lookup service provided by
the target KG. Given an input query (e.g., the text value of a cell) the look-up service extracts candidates
(partially) matching the query. CitySTI implements an API to access the lookup services of diferent
KGs and extract the top-5 candidates for each query (see Figure 3).</p>
      <p>Matching. This component associates the data cell with its appropriate KG entity. It gets as input
the top-5 candidates from the lookup component and communicates with an LLM model to identify the
best choice. The set of candidates are fed within the matching prompt along with the clean CSV data
to perform the matching (see Figure 2). If only one candidate is retrieved from the lookup, it will be
automatically assigned to its cell data without any need of a matching prompt. An entity cache was
also implemented as values to be annotated usually have frequent occurrences. It uses a dictionary that
stores the pools of candidates to avoid repeated retrieval of the KG entity for the same cell data in the
same table. The cache was automatically cleared every 15 seconds to prevent memory overload.</p>
      <p>The matching component implemented two diferent approaches, tailored to the specific dataset in
each of the rounds of the STI vs LLMs SemTab 2024 track, as described next. Codes for both approaches
are available in this GitHub repository: https://github.com/dylanlty/CitySTI-2024
LLM component - approach round 1. In this approach, the system loops through all the CSV files
in a folder, and annotates each cell data, while skipping the numeric and date values. The system then
feeds the entire table to the GPT-4o-mini model to improve its relevance and accuracy when performing
the data cleaning and matching.</p>
      <p>All the components were run on a Virtual Machine (VM) instance in Google Cloud Platform’s
Compute Engine. Google Cloud Platform was used to process the large volumes of data. This was
essential because matching the large volumes of data cells to a KG is very time and resources demanding.
Before creating a VM instance through the Compute Engine component, it was properly set up to avoid
any performance bottleneck. Since a large amount of memory may be required (e.g., to store the entities
in the dictionary), 8 GB of memory was chosen. The N2 machine series was selected to run as this
works well for the kind of task required where the system only involved annotating data cells and
performing read-write operations. Additionally, a 10 GB SSD was allocated to store all the input data
and the necessary packages.</p>
      <p>During the implementation of the system, open-source models on HuggingFace like the
Llama-3-8BInstruct were explored and then run on RunPod.2 The performance of these models did not match as
those currently provided by Google’s Gemini or OpenAI’s GPT models. It proved to be costly to operate
and significantly more expensive than the APIs ofered by the other major AI providers (see Section 3).
LLM component - approach round 2. The second round of the challenge undertook some changes
to address some issues faced in the previous round and because the MammoTab 24 dataset is more
challenging in terms of the size of the tables. The first change was that the system now reads and
annotates based on the target file ( i.e., cells for which there is ground truth) given by the SemTab
challenge instead of processing all potential cells.</p>
      <p>Another diferent approach taken involved processing and providing context in batches instead of
the entire tabular data and the use of the Gemini-1.5-flash model. This was done to address the token
usage limits set by the LLM Provider. Unlike the approach in Round 1, the datasets are loaded into a
Pandas DataFrame, cleaned in batches of 15 rows at a time, and written to a temporary file. This file is
then sliced into diferent rows to provide the model some context when matching. The number of rows
provided varies from 4 to 6, depending on the position of the data cell.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions, Challenges, and Future Work</title>
      <p>This paper presented CitySTI, a system to perform the annotations of tabular data to KG. The approach
taken to tackle the tasks makes use of prompting an LLM model to clean the data and match the
KG entities. The results of it are satisfactory considering it is the first time delving into the SemTab
challenge together with the realm of LLMs. Since CitySTI relies only on prompting, it can be further
improved and may even have a greater performance if fine-tuning is implemented. This will remove the
need for large prompts to achieve the desirable output and reduce the number of tokens needed. The
prompt structure can be further optimized to enable the system to process several data cells in a single
prompt which would significantly reduce the number of requests being sent. Another aspect of the
development of the system that can be improved is its ability to be more fault tolerance in terms that it
can recover from errors. It should be able to continue where it left on and this would save significant
amount of time as opposed to manually removing or adding the files.</p>
      <p>Limitations. While many LLMs are free and open-sourced, OpenAI and Google charge for their APIs
when using one of their models. The main reason behind this is they have to maintain and operate these
advanced AI models and since they are expensive to run, these models come with usage restrictions
that add extra challenges to the task. The usage restrictions can come in the form of rate or token
limits. Even though Google provides a free-tier option for its Gemini API, the rate limit is insuficient
to annotate several thousands of cell data.</p>
      <p>Additionally, the lookup search for entities on Wikidata is not very flexible as it is unable to recognize
typos or incorporate context words related to the searched term like a modern search engine does. The
accuracy could have been significantly improved if the lookup search was more advanced. In the near
future, we plan to explore alternative APIs to access the target KG.
[5] O. Hassanzadeh, N. Abdelmageed, M. Cremaschi, V. Cutrona, F. D’Adda, V. Efthymiou, B. Kruit,
E. Lobo, N. Mihindukulasooriya, N. H. Pham, Results of SemTab 2024, in: SemTab’24: Semantic
Web Challenge on Tabular Data to Knowledge Graph Matching 2024, co-located with the 23rd
International Semantic Web Conference (ISWC), 2024.
[6] M. Marzocchi, M. Cremaschi, R. Pozzi, R. Avogadro, M. Palmonari., MammoTab: a giant and
comprehensive dataset for Semantic Table Interpretation, in: Proceedings of the Semantic Web
Challenge on Tabular Data to Knowledge Graph Matching, SemTab 2022, co-located with the 21st
International Semantic Web Conference (ISWC), CEUR-WS.org, 2022.
[7] M. Cremaschi, B. Spahiu, M. Palmonari, E. Jimenez-Ruiz, Survey on Semantic Interpretation of</p>
      <p>Tabular Data: Challenges and Directions, arXiv preprint arXiv:2411.11891 (2024).
[8] Y. Suhara, J. Li, Y. Li, D. Zhang, c. Demiralp, C. Chen, W.-C. Tan, Annotating columns with
pretrained language models, in: Proceedings of the 2022 International Conference on Management of
Data, SIGMOD ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 1493–1503.</p>
      <p>URL: https://doi.org/10.1145/3514221.3517906. doi:10.1145/3514221.3517906.
[9] X. Deng, H. Sun, A. Lees, Y. Wu, C. Yu, TURL: table understanding through representation learning,
Proc. VLDB Endow. 14 (2020) 307–319. URL: http://www.vldb.org/pvldb/vol14/p307-deng.pdf.
doi:10.5555/3430915.3442430.
[10] V.-P. Huynh, Y. Chabot, T. Labbé, J. Liu, R. Troncy., From Heuristics to Language Models: A
Journey Through the Universe of Semantic Table Interpretation with DAGOBAH, in: Semantic
Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab), CEUR-WS.org, 2022.
[11] I. Dasoulas, D. Yang, X. Duan, A. Dimou, TorchicTab: Semantic Table Annotation with Wikidata
and Language Models, in: SemTab’23: Semantic Web Challenge on Tabular Data to Knowledge
Graph Matching 2023, co-located with the 22nd International Semantic Web Conference (ISWC),
2023.
[12] K. Korini, C. Bizer, Column Type Annotation using ChatGPT, in: Joint Proceedings of Workshops
at the 49th International Conference on Very Large Data Bases (VLDB 2023), Vancouver, Canada,
August 28 - September 1, 2023, volume 3462 of CEUR Workshop Proceedings, CEUR-WS.org, 2023.</p>
      <p>URL: https://ceur-ws.org/Vol-3462/TADA1.pdf.
[13] L. Zha, J. Zhou, L. Li, R. Wang, Q. Huang, S. Yang, J. Yuan, C. Su, X. Li, A. Su, T. Zhang, C. Zhou,
K. Shou, M. Wang, W. Zhu, G. Lu, C. Ye, Y. Ye, W. Ye, Y. Zhang, X. Deng, J. Xu, H. Wang, G. Chen,
J. Zhao, Tablegpt: Towards unifying tables, nature language and commands into one GPT, CoRR
abs/2307.08674 (2023). URL: https://doi.org/10.48550/arXiv.2307.08674. doi:10.48550/ARXIV.2307.
08674. arXiv:2307.08674.
[14] T. Zhang, X. Yue, Y. Li, H. Sun, Tablellama: Towards open large generalist models for tables, CoRR
abs/2311.09206 (2023). URL: https://doi.org/10.48550/arXiv.2311.09206. doi:10.48550/ARXIV.2311.
09206. arXiv:2311.09206.
[15] J. P. Bikim, C. Atezong, A. Jiomekong, A. Oelen, G. Rabby, J. D’Souza, S. Auer, Leveraging GPT
Models For Semantic Table Annotation, in: SemTab’24: Semantic Web Challenge on Tabular
Data to Knowledge Graph Matching 2024, co-located with the 23rd International Semantic Web
Conference (ISWC), 2024.
[16] W. Baazouzi, M. Kachroudi, S. Faiz, Kepler-aSI : Semantic Annotation for Tabular Data, in:
SemTab’24: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching 2024,
colocated with the 23rd International Semantic Web Conference (ISWC), 2024.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum</surname>
          </string-name>
          ,
          <source>Knowledge Graphs</source>
          <year>2021</year>
          :
          <article-title>A Data Odyssey</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>14</volume>
          (
          <year>2021</year>
          )
          <fpage>3233</fpage>
          -
          <lpage>3238</lpage>
          . URL: http://www.vldb.org/pvldb/vol14/p3233-weikum.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Vrandecic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krötzsch</surname>
          </string-name>
          ,
          <article-title>Wikidata: a free collaborative knowledge base</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>57</volume>
          (
          <year>2014</year>
          )
          <fpage>78</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , G. Kobilarov,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z. Ives,</surname>
          </string-name>
          <article-title>DBpedia: A Nucleus for a Web of Open Data</article-title>
          , in: The Semantic Web, Springer Berlin Heidelberg,
          <year>2007</year>
          , pp.
          <fpage>722</fpage>
          -
          <lpage>735</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Jimenez-Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hassanzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Efthymiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          , K. Srinivas,
          <string-name>
            <surname>SemTab</surname>
          </string-name>
          <year>2019</year>
          :
          <article-title>Resources to Benchmark Tabular Data to Knowledge Graph Matching Systems</article-title>
          , in: The Semantic Web: ESWC, Springer International Publishing,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>