<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Portorož, Slovenia
$ maximilian.staebler@dlr.de (M. Stäbler)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>GC-DAM: Graph and Contextual Embeddings for Heterogeneous Data Asset Matching</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maximilian Stäbler</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Lange</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Samir Kipper</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chris Langdon</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frank Köster</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Drucker School of Business, Claremont Graduate University</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>German Aerospace Center (DLR) - Institute for AI Safety &amp; Security</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Data assets-such as datasets, data services, APIs, algorithms, and analytical models-are valuable digital resources that organizations use to create value, support decision-making, and optimize business processes. Matching and integrating these assets, despite diferences in semantic languages, ontologies, or schemas, is essential for building scalable and interoperable dataspaces. However, existing approaches often focus solely on semantic similarities, overlooking structurally similar assets from other domains that could be highly relevant. To address this gap, we present Graph and Contextual Embeddings for Heterogeneous Data Asset Matching (GC-DAM). GC-DAM employs two embedding strategies to match data assets based on both semantic and structural attributes. Structural (morphological) features are automatically incorporated into a knowledge graph, enabling the identification of assets that are structurally similar to a query but may originate from diferent domains, while metadata descriptions capture the semantic (contextual) features. This dual approach overcomes the limitations of methods that rely solely on semantic descriptions. We validate our approach against a custom dataset of 10,000 Kaggle data assets. Our multimodal embedding achieves 77% agreement on our custom dataset, demonstrating its ability to identify structurally similar assets across diverse domains, even when they are semantically diferent. The dataset and code are publicly available to the research community.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multi-Modal-Embedding</kwd>
        <kwd>Heterogeneous dataspaces</kwd>
        <kwd>Knowledge Graphs</kwd>
        <kwd>Automated Interoperability</kwd>
        <kwd>LLM</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Dataspaces have emerged as a pivotal concept in the European dataspace, driven by initiatives such as the
European Data Governance Act (DGA) and the adoption of the FAIR Data Principles. These frameworks
aim to foster trust, ensure data sovereignty, and promote seamless data sharing across diverse sectors,
including research, business, and public services [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Gaia-X (https://gaia-x.eu/) exemplifies the
potential of dataspaces to enable collaborative value creation by integrating heterogeneous datasets
under common standards. However, achieving semantic interoperability remains a critical challenge
for their efective implementation.
      </p>
      <p>Consider the following scenario: a mobility researcher needs to find trafic flow datasets to develop a
congestion prediction model for a smart city project. In one dataspace, a municipality describes their
dataset as "Urban Trafic Intensity Measurements" with metadata focusing on sensor locations and
sampling frequency. In another, a similar dataset exists but is described as "City Vehicular Movement
Analytics" emphasizing the analytical methods applied to the raw data. In a third, a dataset labeled
"Metropolitan Transportation Metrics" documents comparable data but uses industry-specific
terminology. Despite all three datasets containing structurally compatible trafic flow data that could benefit the
researcher, purely semantic search methods would likely identify only one or two of these resources,
missing potentially valuable data assets due to terminology discrepancies-a common challenge in the
absence of globally established standards for data description.</p>
      <p>
        The Semantic Web community has long contributed to addressing semantic heterogeneity through
RDF-based solutions and ontologies. Yet, as dataspaces expand in scale and complexity, new
methodologies are required to overcome challenges such as vocabulary mismatches, structural heterogeneity,
and contextual divergence [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. These issues are particularly pronounced in federated environments
where stakeholders operate under varying schemas and standards. Automating the discovery of similar
concepts and data assets is essential for managing the growing complexity of dataspaces. Systems must
identify and compare data assets accurately, even when expressed in diferent semantic languages,
ontologies, or schemas, enabling seamless use across platforms. Despite progress, challenges remain,
including the scalability of ontology-based solutions in dynamic ecosystems and the computational
complexity of reasoning across conflicting ontologies. Nonetheless, ongoing research continues to
advance semantic interoperability through refined terminological relationships (e.g., synonyms,
hyponyms, hypernyms) and improved real-time context alignment, ofering hope for resolving these
issues in the near future.
      </p>
      <p>To address these challenges, this paper introduces the GC-DAM (Graph and Contextual Embeddings
for Heterogeneous Data Asset Matching), a framework designed to enhance semantic interoperability
and facilitate eficient data asset discovery within dataspaces. The key research questions we address
are:</p>
      <p>1. How can we efectively match heterogeneous data assets across domains when semantic
descriptions alone are insuficient? 2. Can the combination of structural and semantic embeddings overcome
vocabulary mismatches and structural heterogeneity in dataspaces? 3. What are the optimal
techniques for integrating graph-based structural representations with contextual semantic embeddings to
maximize matching accuracy?</p>
      <p>Our contributions include:
• A novel multimodal embedding approach that combines graph-based structural embeddings with
contextual semantic embeddings to match heterogeneous data assets
• A framework that automatically extracts entities and relationships from metadata to build
knowledge graphs that capture structural similarities
• An evaluation methodology demonstrating that our approach achieves superior matching
performance compared to using either embedding type alone</p>
      <p>The unique value of combining structural and semantic embeddings lies in their complementary
strengths. While semantic embeddings excel at capturing thematic and contextual similarities based on
textual descriptions, they often miss structurally compatible assets that use diferent terminology or
conceptual frameworks. Structural embeddings, conversely, identify assets with similar organizational
patterns, entity relationships, and morphological characteristics, regardless of domain-specific
vocabulary. By leveraging both perspectives simultaneously, GC-DAM can discover relevant assets that would
remain hidden to approaches relying solely on semantic matching, significantly expanding the pool of
potential resources available to users in heterogeneous dataspaces.</p>
      <p>GC-DAM directly addresses key challenges outlined in the W3C Dataspaces Community Group,
particularly Issue #2: Data Discovery1. By leveraging multimodal embedding techniques and advanced
semantic matching algorithms, GC-DAM enables the identification of structurally and semantically
similar data assets across heterogeneous datasets. This capability is essential for discovering relevant
resources within dynamic and distributed dataspaces. Furthermore, GC-DAM aligns especially with
the "F " in the FAIR principles by ensuring that data assets are findable. To illustrate its potential
applicability, we consider challenges that might arise in environments like the Gaia-X. In such scenarios,
where diverse stakeholders share and access data assets, GC-DAM could address issues related to
semantic alignment and integration. By enabling the accurate discovery of relevant data assets and
ensuring seamless semantic interoperability, GC-DAM provides scalable solutions that help navigate the
complexities of dataspaces. Additionally, its modular design promotes reusability and standardization,
making it adaptable to a wide range of dataspace implementations.</p>
      <p>This work contributes to advancing the expressiveness and integration of semantic technologies in
dataspace architectures. By fostering collaboration between researchers and practitioners at workshops
such as Semantics in Dataspaces (SDS) 2025, we aim to drive innovation in this critical domain.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>To address our research questions on efective heterogeneous data asset matching through multimodal
embeddings, we examine three key research areas: approaches for entity matching in heterogeneous
environments, embedding techniques for structural and semantic representations, and methods for
integrating multiple embedding modalities.</p>
      <sec id="sec-2-1">
        <title>Heterogeneous Data Asset Matching</title>
        <p>
          Traditional approaches to data integration, such as Extract-Transform-Load (ETL) systems and
rulebased methods, often struggle with scalability and adaptability in dynamic federated environments
[
          <xref ref-type="bibr" rid="ref1 ref5">5, 1</xref>
          ]. This is particularly evident in dataspaces with diverse formats ranging from structured databases
to unstructured multimedia content [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Ontology-based matching approaches have long dominated
the field, with systems like LogMap using lexical matching, graph structure matching, and logical
reasoning to identify correspondences [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. While these methods ofer high semantic precision and formal
consistency guarantees, they face several limitations that our approach addresses. First, they sufer
from computational bottlenecks when applied to large-scale datasets [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. As noted in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], traditional
ontology matching "severely lacks performance when dealing with large matching problems." Second,
they struggle with understanding textual variations and nuanced semantics across domains, often
relying on exact lexical matches or predefined synonyms.
        </p>
        <p>
          Recent partition-based matching approaches like COMA++ and Falcon attempt to address scalability
issues by dividing large ontologies into manageable partitions [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. However, these methods have
significant limitations that our multimodal embedding approach overcomes. COMA++ uses "relatively
simple heuristic rules to partition the input schemas resulting often in too few or too many partitions"
and relies on limited information about partitions (only root nodes) to determine similarity. In contrast,
our GC-DAM framework leverages comprehensive structural information through graph embeddings,
capturing deeper relationships between entities.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Semantic and Structural Embedding Methods</title>
        <p>
          Recent advancements in embedding-based techniques demonstrate promise in addressing heterogeneous
matching challenges. Large Language Models (LLMs) have emerged as powerful tools for generating
semantic embeddings that capture contextual relationships between data entities [
          <xref ref-type="bibr" rid="ref10 ref7">10, 7</xref>
          ]. These semantic
approaches excel at capturing thematic similarities and domain-specific vocabulary but often miss
structurally compatible assets that use diferent terminology. For semantic embeddings, approaches
range from basic word vector averaging to sophisticated contextual models. Word embedding methods
have been applied to ontology alignment with moderate success, with hybrid approaches incorporating
string-based similarity and semantic vector similarity showing improved performance [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. However,
as noted in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], these methods primarily capture lexical similarity and struggle with cross-domain
vocabulary diferences-a limitation our multimodal approach explicitly addresses. Structural embedding
methods, conversely, focus on graph-based representations of data assets. Geometric modeling
approaches like JOIE transform RDFS ontologies into view graphs and model entities as vectors or shapes
(e.g., Concept2Box representing concepts as boxes) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. While efective for capturing relationships
within a domain, these methods typically operate independently of semantic embeddings, missing
opportunities for complementary information. Our approach diverges by using Node2Vec’s biased
random walk strategy to capture both local and global graph structures while complementing this with
rich semantic information.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Multimodal Embedding Integration</title>
        <p>
          The integration of multiple embedding modalities for comprehensive data representation remains an
underdeveloped area. Existing multimodal approaches primarily focus on integrating diferent data
types (e.g., text and images) rather than diferent representation perspectives of the same data [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
As highlighted in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], multimodal embedding models facilitate "the integration of diverse data types
into a unified vector space," enabling "seamless cross-modality vector similarity searches." Current
multimodal models like CLIP, ImageBind, and visualBERT integrate visual and textual information but
target cross-modal retrieval rather than complementary perspectives on structured data [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. These
approaches show "a generalized advantage of multimodal representations over language-only ones on
concrete word pairs, but not on abstract ones," indicating domain-specific performance variations that
must be considered in data asset matching.Unlike these approaches, our GC-DAM framework uniquely
combines semantic embeddings (capturing contextual meaning) with structural embeddings (capturing
morphological relationships) to provide a comprehensive representation of data assets. While methods
like JOIE and EmbedS attempt to bridge knowledge graphs and ontologies, they typically use a single
embedding strategy with diferent targets rather than truly multimodal representations [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>In summary, while significant progress has been made in entity resolution, ontology alignment,
and data discovery using various embedding techniques, existing approaches typically focus on either
semantic or structural aspects in isolation. GC-DAM distinguishes itself by integrating both perspectives
into a unified framework that captures comprehensive similarity across heterogeneous dataspaces.
This novel integration enables identification of matches that would be missed by purely semantic or
structural approaches alone, addressing a critical gap in current dataspace interoperability solutions.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>This section details our approach to matching and aligning heterogeneous data assets across diverse
environments. By employing multimodal embeddings, we establish interoperability in complex dataspaces,
enabling identification and integration of similar assets across domains.</p>
      <p>We define three key concepts central to our approach:
• Semantic Embeddings: Dense vector representations capturing contextual meaning, thematic
content, and semantic relationships within data asset descriptions and metadata.
• Structural Embeddings: Vector representations derived from knowledge graphs encoding
morphological characteristics and entity relationships independent of specific terminology.
• Multimodal Embeddings: The integrated combination of both semantic and structural
embedding spaces for comprehensive similarity assessment.</p>
      <p>To illustrate, consider a trafic dataset: semantic embeddings capture that it contains "vehicle flow
measurements on urban roads," while structural embeddings identify its temporal sequence format
and geospatial reference structure. A dataset using diferent terminology (e.g., "metropolitan transit
analytics") with similar structural patterns would be recognized through multimodal matching despite
semantic diferences.</p>
      <p>
        Figure 1 provides an overview of our process in three steps: (a) heterogeneous data assets
representation, (b) transformation into distinct structural and semantic embeddings, and (c) clustering matching
datasets using cosine similarity measures. The output, termed a "complex information object" (CIO),
represents query results by distinguishing between semantic and contextual representations [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <sec id="sec-3-1">
        <title>3.1. Semantic Embedding – Contextual Metadata</title>
        <p>
          For semantic representation, we incorporate rich contextual metadata capturing domain-specific insights
critical for accurate data interpretation. This includes functional descriptions, column annotations,
and API interface details transformed into high-dimensional embeddings using the stella_en_1.5B_v52
(S_EN) model. Built on Alibaba’s GTE architecture [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], S_EN is optimized for semantic similarity tasks
with several key adaptations:
• Embedding Dimensionality: 1536-dimensional output space balancing representational
capacity and eficiency
• Specialized Loss Function: Enhanced contrastive learning with hard negative mining
• Domain Adaptation: Transfer learning capabilities for dataspace-specific terminology
Metadata Integration and Preprocessing We systematically integrate five metadata categories:
(1) Descriptive (titles, descriptions, keywords), (2) Structural (schemas, data types, relationships), (3)
Administrative (authorship, version history), (4) Technical (quality metrics, update frequency), and (5)
Domain-Specific annotations (sensor specifications for IoT data, geographic coordinates for spatial
data). These elements undergo preprocessing through normalization (standardizing formats, resolving
acronyms), noise reduction (removing redundant fields), domain-aware tokenization, and entity linking
to standard ontologies where available. This preprocessing is adaptive to specific domains while
maintaining cross-domain applicability.
        </p>
        <p>Embedding Generation Process The S_EN model transforms preprocessed metadata into dense
vector embeddings that capture semantic relationships beyond simple keyword matching, including
conceptual associations, hierarchical relationships, functional similarities, and implicit contextual
connections. These embeddings enable precise entity matching across diverse datasets through similarity
computation (using cosine distance), cross-domain alignment, and contextual disambiguation. This
approach enhances precision, reduces ambiguity, and enables integration of datasets that difer structurally
but share similar contextual meanings.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Structural Embedding – Structure and Dependencies</title>
        <p>To capture structural characteristics, we convert data assets into knowledge graphs (KGs). Nodes
represent real-world entities (organizations, people, products) while edges denote relationships between</p>
        <sec id="sec-3-2-1">
          <title>2https://huggingface.co/dunzhang/stella_en_1.5B_v5</title>
          <p>them, capturing data element interconnections.</p>
          <p>The process begins with entity extraction using the GLiNER model [19], a Named Entity Recognition
system that frames NER as a matching problem in a shared latent space. We adapted GLiNER with:
• Custom entity types for dataspace-specific elements
• Improved span representation techniques using contextualized attention
• Specialized entity type embeddings aligned with common dataspace ontologies</p>
          <p>GLiNER employs a DeBERTa-v3-large encoder to generate contextualized token embeddings
aggregated into span representations. The span representation is computed as:
s, = SpanAttn(h, h+1, . . . , h )
(1)
where h through h are token embeddings and SpanAttn weights tokens based on contextual
importance. After entity extraction, relationships are identified through dependency parsing and
domain-specific rules. The resulting KG is transformed into embeddings using Node2Vec with the
following configuration: The embedding was generated using the following parameters: a walk length
of 80 steps per walk, 10 walks per node, a return parameter (p) of 1, an in-out parameter (q) of 0.5,
a window size of 10, and 128 embedding dimensions. Before embedding generation, we include an
Entity Resolution step to identify and merge duplicate entities, ensuring accurate and non-redundant
structural representations.</p>
          <p>Figure 2 shows an example KG structure and visualization where nodes represent entities
colorcoded by type, and edges indicate relationships. Node size reflects connection degree, highlighting
central entities. These structural embeddings enable matching of data assets with similar organizational
patterns regardless of terminology.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. GC-DAM Framework</title>
        <p>For each embedding space, we generate cosine similarity matrices representing pairwise distances
between all data assets. For each query asset, we identify the top 10 most similar assets based on these
scores, then consolidate results into a final table combining both similarity types. This ensures matches
are both structurally aligned and semantically coherent.</p>
        <p>Formally, let  = {1, . . . , } be a dataspace of data assets. Let Descr :  → R1536 denote
the description embedding using the S_EN model processing titles, subtitles and descriptions. Let
{0,1},ℳ :  → ,ℳ be the graph embedding with boolean parameter  and embedding model
ℳ, where ,ℳ = (, ),ℳ is the knowledge graph with vertices  and edges . The structural
embedding S,ℳtruc :  → R128 maps data assets to Node2Vec embeddings.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Similarity Measurement and Matching</title>
        <p>We use cosine similarity to compare embedding vectors. For vectors ,  ∈ R, the cosine similarity is:
 (, ) =</p>
        <p>· 
‖‖‖‖
Given a search query search, we identify the ten most similar data assets in both embedding spaces:
DDescr = {{1, . . . 10} ⊆  :
 (Descr(), Descr(search)) ≥   (Descr(), Descr(search))
for all  ∈  ∖ {1, . . . , 10}}
DStruc = {{1, . . . 10} ⊆  :
 (Struc(), Struc(search)) ≥   (S,ℳtruc(), Struc(search))</p>
        <p>,ℳ ,ℳ ,ℳ
for all  ∈  ∖ {1, . . . , 10}}</p>
        <p>We then compare the entries of DDescr and DStruc to determine parameters for the structural
embedding that yield the highest number of matching entries. This achieves comparable similarity between
found data assets with respect to both semantic and structural representations. The structural
approach captures similarity influenced by morphological aspects arising from the particular graph model,
complementing the semantic approach’s focus on contextual meaning.</p>
        <p>The combined multimodal embedding approach enables identification of matches that would be
missed by approaches relying on either embedding type alone, providing a comprehensive framework
for heterogeneous data asset matching across diverse dataspaces.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>This section outlines the implementation and experiments conducted to evaluate the proposed method.
We present the frameworks, algorithms, datasets, experimental scenarios, and performance metrics used.
The goal is to demonstrate the method’s efectiveness across diferent data environments and benchmark
it against existing approaches. All implementations are available on GitHub3. The Kaggle-dataset can
be downloaded here4. The Section covers the creation of our heterogeneous dataset for evaluating
GC-DAM, followed by the implementation details, concluding with experiments on our Kaggle dataset.</p>
      <sec id="sec-4-1">
        <title>4.1. Kaggle Dataset Evaluation</title>
        <p>The evaluation of the Kaggle dataset focuses on assessing whether GC-DAM can efectively identify
suitable matches within a highly heterogeneous and cross-domain dataset. These matches are not
necessarily semantically similar but align with the query in terms of structure and context. The
evaluation specifically aims to quantify the relevance of the top 10 matching results returned by
GCDAM for a given query dataset. The dataset is organized as a structured dataframe with multiple
attributes describing each dataset. Key columns include:
• id: A unique identifier for each dataset, formatted as username/dataset-slug.
• titleNullable: The title of the dataset.
• subtitle: A descriptive subtitle providing additional context about the dataset.
• description: A detailed explanation of the dataset’s contents and purpose.
• usabilityRatingNullable: A numerical rating (e.g., 0.941176) representing the dataset’s usability.
• keywords: A list of tags or keywords associated with the dataset.
• domains: Categorical classifications indicating the dataset’s domain (e.g., "CROSS_SECTOR",
"EDUCATION").
• licenses: Information about the licensing terms for each dataset.</p>
        <p>• isPrivate: A boolean value indicating whether the dataset is private or publicly accessible.</p>
        <p>This structured metadata enables a comprehensive analysis of datasets, including their usability,
domain classification, and descriptive content. The inclusion of categorical and textual attributes
supports diverse applications such as trend analysis, domain-specific studies, and metadata-driven
recommendations.</p>
        <p>Figure 4 provides an overview of the distribution of datasets within the Kaggle dataset based on
domain classification and keyword frequency. The top chart reveals that "CROSS-SECTOR" datasets
dominate, indicating their broad applicability across multiple fields, while other domains like "SPORT,"
"HEALTH," and "FINANCE" also feature prominently. The logarithmic scale highlights the long-tail
distribution of domain-specific datasets. The bottom chart focuses on keyword frequencies, showcasing
common tags such as "business," "tabular," and "data visualization," which reflect popular themes in data
science applications. Keywords like "classification," "image," and "computer science" further emphasize
the diversity of dataset topics available on Kaggle. Together, these visualizations underscore the
versatility and thematic richness of the platform’s dataset collection. Table 1 presents the top 10
GCDAM matching results based on semantic and structural embeddings of the Sample Sales Data dataset
(part of the Kaggle dataset).</p>
        <sec id="sec-4-1-1">
          <title>3https://github.com/maxistaebler/GC-DAM 4https://tinyurl.com/ty8xvzte</title>
          <p>The Sample Sales Data dataset5, as described by its authors, is a valuable resource for segmentation,
customer analytics, and clustering. It contains anonymized sales information, including order details,
customer data, sales figures, and shipping details. This dataset serves as an illustrative example of the
challenges posed by heterogeneous dataspaces, where datasets difer significantly in structure, content,
and application domains. The experiments conducted on this dataset demonstrate the efectiveness of
GC-DAM in enabling robust and accurate search functionality within such complex environments.</p>
          <p>The results highlight the complementary strengths of semantic and structural embeddings. Semantic</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>5https://www.kaggle.com/datasets/kyanyoga/sample-sales-data</title>
          <p>embeddings primarily rely on textual metadata, such as names and content descriptions, to identify
matches. Manual inspection reveals that matched datasets often share similar thematic content or
application domains. However, relying solely on semantic embeddings can overlook datasets with
structural relevance but limited textual similarity.</p>
          <p>For example, datasets like armitaraz/google-war-news6 or kapatsa/modelled-time-series7 may not
appear semantically related to the query but are highly relevant in terms of structural alignment
and contextual relevance. Structural embeddings address this limitation by capturing graph-based
relationships and dependencies within datasets. These embeddings excel at identifying matches based
on structural patterns, such as schema similarities or shared data formats. For instance, the
googlewar-news dataset discusses economic impacts of war, aligning with the query’s context when viewed
through its structural focus on economic indicators. Similarly, the kapatsa/modelled-time-series dataset
provides yearly US GDP values corrected for inflation, resonating with the structural characteristics of
sales data.</p>
          <p>The GC-DAM approach combines these two embedding spaces to leverage their complementary
strengths. By integrating semantic and structural perspectives, GC-DAM identifies datasets that are
both contextually and structurally relevant to the query. This multimodal approach ensures that
matches are not only thematically coherent but also aligned in terms of schema and data organization.
Compared to using only one type of embedding, GC-DAM significantly improves recall and precision in
heterogeneous dataspaces. Currently, both semantic and structural embeddings are treated equally when
selecting the top 10 matching results for a given query. However, future applications or domain-specific
use cases may require weighting these embeddings diferently based on their relative importance for
the task at hand. Such adjustments would need to be carefully evaluated for each specific use case to
ensure optimal performance. This example with the Sample Sales Data dataset is one of many used for
illustration purposes; readers are encouraged to explore additional examples and further analyses by
accessing the accompanying code or the dataset itself.</p>
          <p>To systematically evaluate the relevance of the top 10 matching results, we employed an
"LLM-as-ajudge" approach using ChatGPT-4o. This method leverages the advanced capabilities of Large Language
Models to act as evaluators for assessing the suitability of matches based on specific criteria derived
from the query. The primary criterion for judgment was whether the retrieved dataset’s inferred content
was comparable to the reference dataset, based solely on the dataset names provided to the LLM.</p>
          <p>Acknowledging the potential limitations and biases inherent in LLM-based evaluation, we
incorporated a Human-in-the-Loop validation step to ensure the reliability of the LLM’s judgments. Two
experienced domain scientists independently reviewed a random sample of 1000 judgments made
by the LLM across various queries and embedding types. This human validation served as a crucial
benchmark for the LLM’s performance on the relevance assessment task. The results showed a high
level of agreement between the human evaluators and the LLM, with a concordance rate of 87.4%. This
substantial agreement indicates that, for the specific task of judging dataset relevance based on names,
the LLM-as-a-judge approach provided suficiently reliable and consistent evaluations, validating its
use for large-scale assessment in this study.</p>
          <p>Figure 5 illustrates the distribution of positive matches among the top 10 results as judged by
ChatGPT4o. We designed a specific prompt for ChatGPT-4o to evaluate all top 10 matching results for both
semantic and contextual embeddings. The prompt instructed the model to determine whether each
result was relevant to the query based on inferred similarity in dataset content, explicitly stating that
the judgment should be based on whether the name suggests comparable data, irrespective of domain
or application. The use of LLMs in this context provides several advantages:
1. Scalability: LLMs like ChatGPT-4o can eficiently evaluate large datasets without requiring
extensive manual efort[20].
2. Contextual Understanding: LLMs are capable of capturing nuanced relationships between
datasets based on structure and context rather than relying solely on surface-level semantic</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>6https://www.kaggle.com/datasets/armitaraz/google-war-news 7https://www.kaggle.com/datasets/kapatsa/modelled-time-series</title>
          <p>similarity, particularly when considering dataset names and inferred content[20, 21].
3. Consistency: Unlike human evaluators who may introduce variability in judgment, LLMs
can provide consistent evaluations across queries when presented with the same prompt and
input[21].</p>
        </sec>
        <sec id="sec-4-1-4">
          <title>The detailed prompt was as follows:</title>
        </sec>
        <sec id="sec-4-1-5">
          <title>The following is the name of the reference dataset: {reference_id}.</title>
          <p>Compare the names in the list below to see if they describe similar datasets to the reference
dataset. By "similar," it is meant that you can infer from the name that the dataset contains
comparable data. It does not need to be in the same domain or application—the key is
whether the dataset content aligns with the reference dataset.</p>
          <p>Please return a list of True/False values for each dataset in the list, indicating whether there
is a similarity or not.</p>
          <p>List to compare: {dataset_values}</p>
          <p>However, it is crucial to critically reflect on the LLM-as-a-judge paradigm. LLMs can be sensitive
to prompt wording, prone to hallucination, and may perpetuate biases present in their training data.
Their decision-making process is often opaque, making it dificult to fully understand the rationale
behind each judgment compared to explicit rule-based systems or human reasoning. Furthermore,
relying solely on LLMs without validation can lead to unreliable results, especially in complex or
subjective evaluation tasks. The high agreement observed in our human validation step underscores
the necessity of such corroboration, demonstrating that while LLMs can be powerful evaluation tools,
their judgments require verification for academic rigor, particularly when assessing nuanced concepts
like dataset relevance across heterogeneous types.</p>
          <p>The evaluation yielded average agreement percentages (indicating the percentage of relevant matches
within the top 10) of 71.90% for semantic embeddings, 48.20% for contextual embeddings, and 77% for the
combined GC-DAM approach. Figure 5 shows the distribution of the number of relevant matches within
the top 10 for each query. Both embedding approaches achieved at least four positive matches for almost
all queries. Semantic embeddings exhibited broader coverage with approval ratings ranging from five to
nine positive matches, while contextual embeddings displayed a concentrated distribution around four
positive matches. The combined GC-DAM results show a strong propensity for 6 to 8 relevant matches
in the top 10, reflecting the benefit of integration. The "LLM-as-a-judge" paradigm, validated by human
experts, proved efective in this scenario due to its ability to assess datasets holistically by considering
inferred content and potential structural or contextual relevance suggested by the names. This approach
aligns well with GC-DAM’s goal of identifying suitable matches in heterogeneous environments while
ensuring compliance with FAIR principles through accurate data discovery mechanisms.</p>
          <p>In summary, leveraging LLMs as evaluators, critically validated by human experts, enhances our
ability to systematically assess GC-DAM’s performance across diverse datasets. This methodology
not only streamlines evaluation processes but also provides actionable insights into improving
embedding techniques for dataspace applications, demonstrating the utility of LLMs as a robust, albeit not
standalone, evaluation tool.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>This paper presented GC-DAM, a multimodal embedding-based framework designed to address the
challenges of data asset matching in heterogeneous and cross-domain dataspaces. By integrating
structural and semantic embeddings, GC-DAM efectively identifies data assets that align with queries
in both structure and context. This dual embedding strategy surpasses traditional approaches by
capturing nuanced relationships between datasets, enabling robust matching across diverse domains.
Our evaluation demonstrate GC-DAM’s capacity to uncover meaningful connections between datasets
that may not appear semantically similar at first glance but align contextually and structurally.</p>
      <p>The implementation of GC-DAM aligns with key challenges outlined in the W3C Dataspaces
Community Group, particularly Challenge #2: Data Discovery. The framework facilitates the discovery of
structurally and semantically similar data assets, supporting FAIR-compliant data sharing practices by
making datasets findable and interoperable. Moreover, GC-DAM’s modular design ensures adaptability
across various dataspace architectures, ofering a scalable foundation for addressing interoperability
issues in federated environments like Gaia-X. To enhance the evaluation of GC-DAM’s performance,
we employed an "LLM-as-a-judge" approach using state-of-the-art models such as ChatGPT-4o. This
methodology provided consistent and scalable assessments of dataset matches by leveraging the
contextual understanding and reasoning capabilities of LLMs. The use of LLMs as evaluators proved
particularly efective in capturing subtle contextual alignments between datasets, which are often
overlooked by traditional metrics. In conclusion, GC-DAM represents a significant step forward in
advancing semantic interoperability within dataspaces. By aligning with the goals of workshops like
SDS 2025 and addressing key W3C challenges, this work lays the groundwork for future innovations in
dataspace architectures while fostering collaboration between researchers and practitioners.
Challenges Despite its promising results, several challenges remain. Scalability is a critical limitation
when applying GC-DAM to large-scale or highly dynamic dataspaces. The computational complexity
of generating and comparing multimodal embeddings can lead to increased overhead, particularly
in real-time applications. Additionally, while the integration of structural and semantic embeddings
provides a balanced perspective, further research is needed to refine this integration to prevent biases
toward the one or the other.</p>
      <p>Future Work Future work will focus on addressing these limitations by exploring eficient algorithms
for embedding generation and comparison to improve scalability. Advanced techniques for balancing
semantic and structural embeddings will be investigated to enhance robustness in cross-domain
scenarios. Furthermore, we aim to integrate GC-DAM more closely with ongoing W3C Dataspaces initiatives
by proposing new use cases and challenges, such as multimodal embedding techniques for data asset
matching (e.g., Issue #68). These eforts will ensure that GC-DAM continues to evolve as a foundational
tool for enabling trusted and eficient data sharing within dataspaces.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the writing of this paper, the author(s) used DeepL and GPT-4o in order to: Grammar, translation
and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.
https://www.alibabacloud.com/blog/gte-multilingual-series-a-key-model-for-retrievalaugmented-generation_601776, 2024.
[19] U. Zaratiana, N. Tomeh, P. Holat, T. Charnois, GLiNER: Generalist Model for Named
Entity Recognition using Bidirectional Transformer, 2023. doi:10.48550/arXiv.2311.08526.
arXiv:2311.08526.
[20] D. Li, B. Jiang, L. Huang, A. Beigi, C. Zhao, Z. Tan, A. Bhattacharjee, Y. Jiang, C. Chen, T. Wu, K. Shu,
L. Cheng, H. Liu, From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge,
2025. doi:10.48550/arXiv.2411.16594. arXiv:2411.16594.
[21] J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y. Shen, S. Ma, H. Liu, S. Wang, K. Zhang, Y. Wang,
W. Gao, L. Ni, J. Guo, A Survey on LLM-as-a-Judge, 2025. doi:10.48550/arXiv.2411.15594.
arXiv:2411.15594.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Otto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Ten</given-names>
            <surname>Hompel</surname>
          </string-name>
          , S. Wrobel (Eds.),
          <article-title>Designing Data Spaces: The Ecosystem Approach</article-title>
          to Competitive Advantage, Springer International Publishing, Cham,
          <year>2022</year>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>030</fpage>
          -93975-5.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Theissen-Lipp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kocher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lange</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Decker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paulus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pomp</surname>
          </string-name>
          , E. Curry, Semantics in Dataspaces: Origin and Future Directions,
          <source>in: Companion Proceedings of the ACM Web Conference</source>
          <year>2023</year>
          , ACM, Austin TX USA,
          <year>2023</year>
          , pp.
          <fpage>1504</fpage>
          -
          <lpage>1507</lpage>
          . doi:
          <volume>10</volume>
          .1145/3543873.3587689.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Seo</surname>
          </string-name>
          ,
          <article-title>Classifying schematic and data heterogeneity in multidatabase systems</article-title>
          ,
          <source>Computer</source>
          <volume>24</volume>
          (
          <year>1991</year>
          )
          <fpage>12</fpage>
          -
          <lpage>18</lpage>
          . doi:
          <volume>10</volume>
          .1109/2.116884.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hutterer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Krumay</surname>
          </string-name>
          ,
          <article-title>Integrating Heterogeneous Data in Dataspaces -</article-title>
          A
          <string-name>
            <surname>Systematic Mapping Study</surname>
          </string-name>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Boukhers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lange</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Beyan</surname>
          </string-name>
          ,
          <article-title>Enhancing Data Space Semantic Interoperability through Machine Learning: A Visionary Perspective</article-title>
          , in:
          <source>Companion Proceedings of the ACM Web Conference</source>
          <year>2023</year>
          , ACM, Austin TX USA,
          <year>2023</year>
          , pp.
          <fpage>1462</fpage>
          -
          <lpage>1467</lpage>
          . doi:
          <volume>10</volume>
          .1145/3543873.3587658.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ganzha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Paprzycki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Pawłowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Szmeja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wasielewska</surname>
          </string-name>
          ,
          <article-title>Towards Semantic Interoperability Between Internet of Things Platforms</article-title>
          , in: R.
          <string-name>
            <surname>Gravina</surname>
            ,
            <given-names>C. E.</given-names>
          </string-name>
          <string-name>
            <surname>Palau</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Manso</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Liotta</surname>
          </string-name>
          , G. Fortino (Eds.), Integration, Interconnection, and
          <source>Interoperability of IoT Systems</source>
          , Springer International Publishing, Cham,
          <year>2018</year>
          , pp.
          <fpage>103</fpage>
          -
          <lpage>127</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -61300-
          <issue>0</issue>
          _
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Hu,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Akrami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>A Benchmarking</surname>
          </string-name>
          <article-title>Study of Embeddingbased Entity Alignment for Knowledge Graphs</article-title>
          ,
          <source>Proceedings of the VLDB Endowment</source>
          <volume>13</volume>
          (
          <year>2020</year>
          )
          <fpage>2326</fpage>
          -
          <lpage>2340</lpage>
          . doi:
          <volume>10</volume>
          .14778/3407790.3407828. arXiv:
          <year>2003</year>
          .07743.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>O.</given-names>
            <surname>Benjelloun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Menestrina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Whang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Widom</surname>
          </string-name>
          ,
          <article-title>Swoosh: A generic approach to entity resolution</article-title>
          ,
          <source>The VLDB Journal</source>
          <volume>18</volume>
          (
          <year>2009</year>
          )
          <fpage>255</fpage>
          -
          <lpage>276</lpage>
          . doi:
          <volume>10</volume>
          .1007/ s00778-008-0098-x.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Papadakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kirielle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Christen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Palpanas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A</given-names>
            <surname>Critical</surname>
          </string-name>
          Re
          <article-title>-evaluation of Benchmark Datasets for (Deep) Learning-Based Matching Algorithms</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2307</volume>
          .
          <fpage>01231</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xi</surname>
          </string-name>
          ,
          <string-name>
            <surname>N. Zhang,</surname>
          </string-name>
          <article-title>LambdaKG: A Library for Pre-trained Language Model-Based Knowledge Graph Embeddings</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2210</volume>
          .
          <fpage>00305</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Barlaug</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Gulla</surname>
          </string-name>
          ,
          <article-title>Neural Networks for Entity Matching: A Survey, ACM Transactions on Knowledge Discovery from Data 15 (</article-title>
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          . doi:
          <volume>10</volume>
          .1145/3442200.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Relation-Aware Entity Alignment for Heterogeneous Knowledge Graphs</article-title>
          ,
          <source>in: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>5278</fpage>
          -
          <lpage>5284</lpage>
          . doi:
          <volume>10</volume>
          .24963/ijcai.
          <year>2019</year>
          /733. arXiv:
          <year>1908</year>
          .08210.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zaniolo</surname>
          </string-name>
          ,
          <article-title>Multilingual Knowledge Graph Embeddings for Crosslingual Knowledge Alignment</article-title>
          ,
          <year>2017</year>
          . arXiv:
          <volume>1611</volume>
          .
          <fpage>03954</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>N.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Halevy,
          <source>VerifAI: Verified Generative AI</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Peeters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <source>Entity Matching using Large Language Models</source>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2310</volume>
          .
          <fpage>11244</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rafailov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Finn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yao</surname>
          </string-name>
          , Aligning Modalities in Vision Large Language Models via Preference Fine-tuning,
          <year>2024</year>
          . arXiv:
          <volume>2402</volume>
          .
          <fpage>11411</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>P.</given-names>
            <surname>Knowles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Page</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mitwicki</surname>
          </string-name>
          ,
          <article-title>Decentralised semantics in distributed data ecosystems: Ensuring the structural, definitional, and contextual harmonisation and integrity of deterministic objects and objectual relationships (</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Alibaba</surname>
          </string-name>
          ,
          <string-name>
            <surname>GTE-Multilingual</surname>
            <given-names>Series</given-names>
          </string-name>
          :
          <article-title>A Key Model for Retrieval-Augmented Generation,</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>