=Paper=
{{Paper
|id=Vol-3889/paper4
|storemode=property
|title=Scalable Table-to-Knowledge Graph Matching from Metadata using LLMs
|pdfUrl=https://ceur-ws.org/Vol-3889/paper4.pdf
|volume=Vol-3889
|authors=Nathan Vandemoortele,Bram Steenwinckel,Sofie Van Hoecke,Femke Ongenae
|dblpUrl=https://dblp.org/rec/conf/semtab/VandermoorteleS24
}}
==Scalable Table-to-Knowledge Graph Matching from Metadata using LLMs==
Scalable Table-to-Knowledge Graph Matching from
Metadata using LLMs
Nathan Vandemoortele1,∗ , Bram Steenwinckel1 , Sofie Van Hoecke1 & Femke Ongenae1
1
Ghent University - imec, Technologiepark-Zwijnaarde 122, Gent, 9052, Belgium
Abstract
Addressing the challenge of interoperability when integrating and interpreting large datasets from
diverse sources is essential for businesses aiming to make informed, data-driven decisions. Therefore, the
2024 Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab) focuses on using
metadata, such as column names, to map tables to semantic concepts within standardized vocabularies
or Knowledge Graphs (KGs). The challenge involves mapping two datasets, one to the DBpedia ontology
and the other to a custom vocabulary.
Our approach begins with applying Retrieval-Augmented Generation (RAG) for a broad search of relevant
matches, ensuring scalability. We then refine these matches using a Large Language Model (LLM) with
Chain-of-Thought (CoT) prompting and Self-Consistency (SC). Finally, we combine the results using
Reciprocal Rank Fusion (RRF) to obtain a final ranking of the matches.
This method achieves hit rates of 62% (top 1) and 82% (top 5) for the first dataset, and 84% (top 1) and 98%
(top 5) for the second. The LLM’s strong semantic understanding and extensive knowledge base provide
significant advantages over traditional human labeling, which is often laborious and time-consuming.
Furthermore, the LLM’s zero-shot capability removes the need for additional task-specific training data,
making this solution applicable across various domains.
Despite limitations like computational costs and the need for well-defined concepts in the ontology or
vocabulary, our approach remains cost-effective compared to extensive human labeling. Moreover, we
leave room to trade performance for scalability if needed, pending further research.
Keywords
data linkage, knowledge graphs, large language models, retrieval augmented generation, zero-shot
learning, metadata
1. Introduction
In today’s data-driven world, many organizations and companies frequently face the challenge
of integrating and interpreting large datasets from various sources to streamline their business
operations and their ability to make data-driven decisions [1]. For example, consider a multina-
tional corporation that has recently acquired several smaller companies, each with a different
data management system. Merging such heterogeneous data sources while maintaining the
intrinsic semantic meaning of the available information is a difficult task for humans.
SemTab’24: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching 2024, co-located with the 23rd
International Semantic Web Conference (ISWC), November 11-15, 2024, Baltimore, USA
∗
Corresponding author.
Envelope-Open nathan.vandemoortele@ugent.be (N. Vandemoortele); bram.steenwinckel@ugent.be (B. Steenwinckel);
sofie.vanhoecke@ugent.be (S. Van Hoecke); femke.ongenae@ugent.be (F. Ongenae)
Orcid 0009-0009-0118-3519 (N. Vandemoortele); 0000-0002-3488-2334 (B. Steenwinckel); 0000-0002-7865-6793 (S. Van
Hoecke); 0000-0003-2529-5477 (F. Ongenae)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
table_name": ”≈", "table_columns": ["RANG", "Museum",
"Stadt", "Facebook-Fans"]
Vocabulary
table_name: Museum
…
table_columns: RANG Name Stadt Fans locationName
location
table_cells:
Unavailable locationCity
city
livingPlace
Figure 1: Matching table metadata only, i.e., table and column names, to a vocabulary or ontology
without any access to table data. In this example, the column “Stadt” of the “Museum” table is matched
to the “locationCity” property within the given vocabulary.
One possible solution to this problem involves matching tables to standardized vocabularies or
Knowledge Graphs (KGs) [2, 3]. This process, known as semantic table linking or table matching,
aims to align data within tables with predefined concepts. The “Semantic Web Challenge on
Tabular Data to Knowledge Graph Matching (SemTab)1 ” has been hosted for several years at
confidentia
the International Semantic Web Conference (ISWC) to stimulate the creation of such automated
semantic annotation systems. The goal is to automate the assignment of a DBpedia or Wikidata
entity to a whole column based on table data, assign a DBpedia or Wikidata entity to different
cells, infer relations between different columns where possible, or assign a KG class to an entire
table. One of the objectives of this year’s challenge2 is to map columns to semantic concepts in
a vocabulary or KG based solely on the information in the header of the columns, i.e. column
and table names. The objective is illustrated in Figure 1.
This automated mapping presents considerable challenges, as the true meaning of the table
relies on interpreting the data within the cells.
• For instance, a column named “ID” or “Value” provides little to no information about its
semantic meaning to help linking it to a concept in a KG.
• Many column names are ambiguous because they can have multiple potential meanings
depending on the context. For example, a column named “Date” could refer to, amongst
others, a birth date, transaction date, or event date.
• Tables can employ different naming conventions, languages, abbreviations, or terminolo-
gies. For instance, one dataset might use “CustID” while another uses “CustomerID” to
represent the same concept.
• As organizations are dynamic, new tables can be added and existing ones modified. This
dynamic nature requires continuous updates and maintenance of data structures, which
is resource-intensive if done manually.
• As organizations accumulate new tables, the scalability of semantic mapping becomes a
concern. Mapping metadata to vocabularies or KGs for potentially thousands of tables is
not humanly feasible, necessitating automated and scalable solutions.
We hypothesize that Large Language Models (LLMs) offer a promising solution to the chal-
lenges of semantic mapping when relying solely on metadata. Even in the absence of actual
1
https://www.cs.ox.ac.uk/isg/challenges/sem-tab/
2
https://sem-tab-challenge.github.io/2024/tracks/metadata-to-kg-track.html
data content (e.g., due to restricted access), LLMs can leverage their intrinsic knowledge and
reasoning capabilities to tackle automated mapping problems. In this paper, we investigate
the zero-shot capabilities of GPT-4o3 , a state-of-the-art LLM at the time of writing, for the
“Table Metadata to KG” task in the 2024 “Semantic Web Challenge on Tabular Data to Knowl-
edge Graph Matching” challenge. Zero-shot learning refers to the ability to transfer to an
unseen problem without new task-specific training data. To remain cost-effective, we adapt
an advanced Retrieval Augmented Generation (RAG) [4] solution that retrieves only the most
relevant information within the vocabulary or KG. This ensures that the generated responses
are contextually accurate while maintaining scalability. In this paper, we present the following
key contributions:
• While RAG solutions exist in the context of ontology matching, we are the first to
incorporate and apply (LLM-based) rerankers in this context.
• Our main innovation comprises a novel completion stage, where we uniquely combine
two prompting techniques, Chain-of-Thought and Self-Consistency, with Reciprocal
Rerank Fusion to find the best mappings.
• Our methodology efficiently scales to handle large and diverse datasets in a zero-shot
manner.
The remainder of this paper is as follows. We first discuss related work in Section 2. Next,
we present the table datasets and vocabularies provided by the SemTab organizers in Section 3,
followed by a detailed explanation of our methodology in Section 4. In Section 5, we present
the results. We conclude the paper in Section 6 and suggest additional paths for future work.
2. Related Work
Tasks regarding table topic annotation and column type annotation are closely related to the
general problem of semantic table understanding [5]. The SemTab challenge [6] has always been
formulated as an unsupervised table annotation problem. Systems like MTab [7], CSV2KG [8],
and DAGOBAH [9], which participated in the SemTab challenge in previous years, tackle three
main tasks in KG matching: creating cell entity annotations (CEA), column type annotation
(CTA), and paired column property annotations (CPA). These pipelines typically begin by
linking individual cell contents, such as the various column entries, to ontology entities, and
then predicting the most likely column type based on these linkages. To do so, these systems
need available cell values to be mapped to KG entities before column types can be identified,
which is not applicable in this context as here the actual table data is not available.
Given that we can only rely on metadata information for the task addressed in this paper, the
related work methodologies that are applicable align more closely with ontology alignment
techniques. These can range from traditional string similarity methods [10] to semantic similar-
ity measures utilizing resources like WordNet [11]. WordNet is, however, designed for general
purposes and may not include specialized terms or jargon used in specific, custom vocabularies.
More recent work includes vector representations for semantic similarity, using, e.g., word2vec
3
https://platform.openai.com/docs/models
to get promising results in semantic textual similarity tasks [12]. Nowadays, also LLM-based ap-
proaches have achieved traction for tasks like ontology alignment [13], entity matching [14] and
subject annotation [15]. Since LLMs are trained on large data corpora, they can identify complex
relations and patterns between different objects in natural language, leading to more accurate
matching. LLMs demonstrate their effectiveness to improve metadata matching performance,
either through additional fine-tuning on table-based tasks [16] or without fine-tuning [3], and
show how contextual information, such as the dataset description, can be incorporated in an
LLM prompt to match a vocabulary to a table’s column topics [17]. This last task is similar to
the one we are trying to solve in this paper. However, providing the full vocabulary as context
within each prompt would not be cost-effective at scale. To this end, RAG pipelines have been
proposed [18, 19], though they have not yet been applied in this specific context. Furthermore,
advanced prompt optimization strategies remain largely unexplored.
3. Metadata Datasets
Two datasets were provided in the challenge, which both contained a vocabulary on the one
hand and table metadata (i.e., table and column names) for multiple tables without the actual
cell data on the other hand:
• Dataset 1 consists of metadata from a selected set of HTML-based tables that can be
found on the web, and needs to get mapped to a subset of the DBpedia ontology4 . The goal
is to map each table column to a corresponding ontology property. While the ontology is
hierarchically structured, it is treated as a simple vocabulary of properties, each defined
by labels and short descriptions, with no consideration of the relationships between them.
In total, metadata for 77 web tables were provided, of which 141 columns needed to be
mapped. The vocabulary consists of 2881 different properties.
• Dataset 2 contains a set of tables, available in a database accessible for public use.
The metadata of these tables needs to be mapped to a custom vocabulary with clear
descriptions. However, the vocabulary is not derived from an existing publicly available
KG. The goal is, again, to map each table column to one vocabulary item. This dataset
contains metadata for 75 tables, of which 1181 columns needed to be mapped, and has
1181 items in the vocabulary.
4. Methodology
Our architecture adapts an advanced RAG solution with a novel LLM-based completion stage.
RAG is a technique used to incorporate new knowledge into LLMs through in-context learning
(ICL), such as a vocabulary or a knowledge graph (KG). It promises a scalable approach by
retrieving only relevant information, thereby significantly driving down costs. ICL enables
LLMs to learn from examples within the context window without altering their weights.
Our solution’s two-stage pipeline is schematically visualized in Figure 2 and consists of
a retrieval stage and a completion stage. In this section, we explain each component of the
4
http://dbpedia.org/ontology/
pipeline step by step. You can find the prompt templates used in the “appendix” directory of
our GitHub repository: https://github.com/predict-idlab/Meta2Concept
Retrieval Stage Completion Stage
Reciprocal
Rank
Fusion
Embedding Top-K Top 1
Preprocessing
Model matching + Top 5
Reranker
Self-
consistency
Figure 2: Overview of the two-stage methodology composed of a retrieval and completion stage after
an initial preprocessing step. The hexagonal knot and lama icon refer to the OpenAI and Llama models,
respectively.
confidential
4.1. Preprocessing
Both tables and vocabularies can vary in size, language, and quality. For example, the descrip-
tions of the provided vocabulary items can be rather short (see Section 3), or the provided table
metadata can contains abbreviations or be multilingual (see introduction). To address these
variations, a preprocessing step is applied to both the vocabulary and metadata as a first step as
detailed in the following subsections.
4.1.1. Enriching of vocabulary
An item within a vocabulary represents the information of a single concept or property, e.g., a
DBpedia property label and its description for the first dataset. We ask GPT-4o to translate the
item labels to English, fix grammar mistakes, convert proper names to their type (e.g., Harelbeke
becomes → City (Harelbeke)) and write acronyms in full. Closed-source LLMs like GPT-4o
enrich the embedding model in the next stage, as they are more knowledgeable and frequently
updated with the latest information. We process the vocabulary in chunks of 100 items per
prompt, which is overall inexpensive and enables the use of GPT-4o. In most cases, this results
in property labels becoming self-explanatory. Below is an example of vocabulary enrichment
using GPT-4o:
Input: prompt(label=‘ept itm’)
Output: ‘European Poker Tour (EPT) In The Money (ITM)’
This helps generate more accurate embeddings for the next stage. In this case, you might also
find matches closer to poker, tours, or money.
Item descriptions are more fine-grained representations of the label in semantic space. How-
ever, for these descriptions to be effective as embeddings, they must be specific and not open to
interpretation. For example, “season” is simply described as “season”, however “season” can
refer to the yearly seasons but also television seasons, or sports seasons. Attempting to assign
one interpretation without knowing the ground truth could skew the embeddings vector in a
specific direction, whereas it might be preferable to leave such terms undefined at this stage.
This problem is specific to Dataset 1, hence, we focus on enriching and utilizing only its labels.
For Dataset 2, we use and retain the original descriptions, as they are deemed to be of high
quality.
4.1.2. Enriching of table metadata
The metadata consists of a table name and column names, one or more of which need to be
mapped. We ask GPT-4o to transform the column and table name into a sentence in the form:
{column} {relation} {table_name}
No other information about the other column names in the table is included. Furthermore,
we ask GPT-4o to translate words to English, fix grammar mistakes, convert names to their type
and write acronyms in full, similar to what we did with the vocabulary labels. The following is
an example of table metadata enrichment using GPT-4o:
Input: prompt(column=‘Height (m)’, table_name=‘Mountain’)
Output: ‘Height in meters of a mountain’
4.2. Retrieval Stage
Figure 3 illustrates the Retrieval Stage. It consists of an advanced RAG solution including an
embedding step and a reranking step, both of which will be further explained in the subsequent
subsections.
Table Top-k
Query matches Rerank
Embedding
Query Embedding
Enrichment
RankGPT
Model
Similarity
Search Query
Preprocessing What is the best description of column {column} in the
table {table_name} with columns {table_columns}?
Prompt
{item 1}
The following are descriptions related to the query {query}
{item 2}
[1] {item 1}
{item 3}
Item Embedding {item 4}
[2] {item 2}
…
Enrichment Model {item 5} Rank the descriptions based on their relevance to the query
{item 6}
Vocabulary {item 7} [2] > [1] > [3] > [4] > …
Figure 3: Detailed overview of the retrieval stage: Vocabulary items are enriched with an LLM and
embedded with an embedding model, then stored. Table metadata is embedded similarly to create
query embeddings. A similarity search retrieves the top-k matches for the query embedding, which are confidential
reranked by RankGPT based on relevance.
4.2.1. Embedding model
Dense embeddings provide a fast way to retrieve relevant documents at scale. The main idea is
to perform a coarse search for the relevant items within the vocabulary or KG at a low cost,
which are then refined further downstream.
The process begins with indexing the enriched vocabulary, where it is chunked by item and
embedded using a dense embedding model, as shown in Figure 3. We use text-embedding-3-
large (OpenAI)5 , which converts sentences into 3072-dimensional vectors. These embeddings
are stored and can be reused for comparison with each new query. Next, the enriched table
metadata is embedded similarly to generate query embeddings. We then perform a similarity
search, comparing the embedded query with each item in the vocabulary using cosine similarity,
which can be expressed as:
𝑐𝑜𝑠𝑖𝑛𝑒 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = 𝑑𝑜𝑡(𝐼 𝑡𝑒𝑚𝑣𝑒𝑐𝑡𝑜𝑟 , 𝑄𝑢𝑒𝑟𝑦𝑣𝑒𝑐𝑡𝑜𝑟 )
Since the vectors are normalized, there is no need to divide by their lengths.
Finally, we rank the items according to the cosine similarity and retrieve the top 150 matches
to be reranked. This number was chosen based on cost considerations. Ideally, this number
should be empirically validated using a validation set; however, in this challenge, we were
provided with only 10 samples for Dataset 1 and none for Dataset 2. Table 2 indicates the
number is sufficient for Dataset 1 with a 100% hit rate but may be inadequate for Dataset 2,
which has a 95% hit rate. This likely stems from a larger vocabulary and a higher likelihood
of encountering similar items. We believe there is a probable relationship between the size
of the dataset and the optimal number of retrieval candidates, which we have yet to evaluate
thoroughly.
4.2.2. Reranker
The top 150 candidates identified in the previous step show high semantic similarity with the
given table query, but their order is highly dependent on the embedding model. Embedding
models have several drawbacks: both the vocabulary item and query are embedded indepen-
dently, resulting in some loss of information. They are also often trained for specific tasks and
therefore require further fine-tuning to achieve the best results. This goes against our goal of
making the pipeline generically reusable. Furthermore, they do not have the same knowledge
or reasoning capabilities of an LLM.
This is where LLM-based rerankers come in. These rerankers find logical connections
between several “passages” (our vocabulary items) and a query simultaneously, albeit at a
higher computational cost, which is why we only use it on the top-ranked results in order
to improve scalability. We use RankGPT [20], a library that provides a prompt template to
transform LLMs into reranking agents. We selected Llama-3-70B (Meta), an open-weight LLM,
to ensure cost-effectiveness given the size of our datasets. We utilized RankGPT’s permutation
generation with a sliding window (step size of 10 and window size of 20) to limit the number of
candidates needing reranking in each prompt. Moreover, this measure decreases the prompt
size and improves performance by mitigating context length sensitivity. RankGPT’s sliding
window is applied from back to front, allowing items to be promoted in rank but not demoted
outside the window size. The reranking of vocabulary items is determined by their relevance to
the following new query:
5
https://platform.openai.com/docs/models/embeddings
Query:
What is the best description of column {column} in
the table '{table_name}' with columns {table_columns}?
Notice that this query now uses all the metadata of the table. Similarly, items now contain all
the information of the original vocabulary, i.e. the labels with their descriptions, allowing for
a more fine-grained ranking. We no longer use the information obtained during enrichment
because the clean-up step in preprocessing could introduce errors that may confuse the LLM.
However, for Dataset 1, we found the descriptions to be poor, which is why we included the
domain and range based on the properties provided by DBpedia6 . If the domain is an “owl:Thing”
or the range is a datatype, the field is left empty. This results in the following items:
Item:
Dataset 1: [, , ]
Dataset 2: