<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Hersonissos, Greece
$ benedikt.arnold@fit.fraunhofer.de (B. T. Arnold)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Towards Enabling FAIR Dataspaces Using Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Benedikt T. Arnold</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johannes Theissen-Lipp</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diego Collarana</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christoph Lange</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sandra Geisler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edward Curry</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Decker</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer Institute for Applied Information Technology FIT</institution>
          ,
          <addr-line>St. Augustin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>RWTH Aachen University</institution>
          ,
          <addr-line>Aachen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universidad Privada Boliviana</institution>
          ,
          <country country="BO">Bolivia</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Galway</institution>
          ,
          <addr-line>Galway</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Dataspaces have recently gained adoption across various sectors, including traditionally less digitized domains such as culture. Leveraging Semantic Web technologies helps to make dataspaces FAIR, but their complexity poses a significant challenge to the adoption of dataspaces and increases their cost. The advent of Large Language Models (LLMs) raises the question of how these models can support the adoption of FAIR dataspaces. In this work, we demonstrate the potential of LLMs in dataspaces with a concrete example. We also derive a research agenda for exploring this emerging field.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Dataspaces</kwd>
        <kwd>FAIR Data Principles</kwd>
        <kwd>Large Language Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Dataspaces The term dataspace was coined in 2005 and has evolved since then, with scientific
definitions gathered in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In this work, we refer to dataspaces as a multi-sided data platform
connecting participants in an ecosystem [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Importantly, dataspaces do not incorporate an
integration layer to bridge heterogeneity; instead, each data source remains unaltered,
eliminating the need to transfer data to a centralized storage location. This omission of the integration
aspect in data exchange reduces the initial workload and concentrates on an as needed
bestefort strategy for data integration [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Significantly, the decentralized storage concept benefits
dataspace participants as they maintain sovereignty over their data [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Dataspaces commonly
use semantics to ensure a “common language” and foster FAIR data sharing [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
FAIR Principles The FAIR Data Principles2 aim at supporting the Findability, Accessibility,
Interoperability, and Reusability of data and have gained widespread adoption. FAIR comprises
these technical prerequisites: persistent identifiers (PIDs), rich metadata and open protocols – of
which PIDs and rich metadata are closely linked with semantics [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In this paper, we highlight
some of the most complicated and tedious associated tasks, which are: Enhancing existing
semantic metadata schemas, creating instances from these schemas and understanding semantic
data. We see a high potential to tackle these tasks more eficiently using LLMs – fostering the
incorporation of the FAIR principles if done correctly.
      </p>
      <p>
        Large Language Models Generative AI, in the form of Large Language Models (LLMs),
represents a fundamental shift in text processing and knowledge generation methodologies.
LLMs have shown excellent capabilities in understanding and generating human-like text,
leading to breakthroughs in various downstream applications, including text summarization,
content generation, and conversational systems [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. LLMs such as GPT-4 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] or Mixtral 8x7B3
predict the next tokens in an auto-regressive manner, given an input sequence (prompt). They
are first trained on a massive and diverse dataset in an unsupervised manner to produce
base models and subsequently fine-tuned on a smaller and more specific dataset, aiming at
performance improvement in specific tasks, like instruction-following [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Tasks in FAIR Dataspaces and the Potential of LLMs</title>
      <p>With a practical example, we demonstrate how tasks related to providing and consuming FAIR
data via dataspaces can be aided by LLMs, in here GPT-4. Following this proof-of-concept, we
outline paths to solve these tasks more sophisticatedly for more realistic cases in section 3.</p>
      <p>Rich metadata benefits especially Findability and Interoperability. This can be achieved using
semantics (cf. section 1). For simplicity, consider the specification of an ofered dataset in
listing 1. In the listing, two SHACL (Shapes Constraint Language)4 shapes ensure that each
dataset has a title and a usage policy. The policy is needed to make the dataset a valid dataspace
ofering for other participants to conclude a usage contract. Our example considers the cultural
domain and how digitized paintings can be ofered. A relevant ontology in the cultural domain is</p>
      <sec id="sec-2-1">
        <title>2https://www.go-fair.org/fair-principles/ 3https://mistral.ai/news/mixtral-of-experts/ 4https://www.w3.org/TR/shacl/</title>
        <p>the Integrated Authority File (GND), associated with a Knowledge Graph (KG) that represents “the
largest collection of cultural and research authority data in the German-speaking countries”5.
We start our example with a museum curator who wants to ofer her data in the dataspace and is
the first to do so from the art domain. Therefore, they must extend the existing metadata schema
with corresponding properties. The curator is familiar with domain-specific metadata standards
but wants to focus on the concepts instead of syntactical subtleties of formats such as Turtle6.
She knows that a painter can be represented as a painting’s gndo:firstArtist that is a
gndo:DifferentiatedPerson, which has the property gndo:gndIdentifier that
provides the person’s GND id. gndo denotes the GND namespace. Using this id is important
as it can serve as a unique Persistent Identifier (PID) for this person and thus supports the
FAIR principles. The property gndo:dateOfProduction gives the painting’s creation date.
With this domain knowledge, we employ the following prompt:</p>
        <p>User: As a Semantic Web expert, please extend the SHACL shapes given below such that I can also describe
digital versions of paintings with them, using properties from the Integrated Authority File, the GND. For
the painter, please use a ‘gndo:firstArtist’ that is a ‘gndo:DiferentiatedPerson’ which has its GND id as a
property. For the year of creation, please use ‘gndo:dateOfProduction’.
```turtle ...</p>
        <p>Note that we have not included the exact identifier for the GND id in the prompt. The
response of GPT-4, with surrounding explanatory comments omitted, are the shapes presented
in listing 2 – importantly, the existing shapes were not harmed. The painter with their identifier
and the date of production have been correctly added. However, the painter is also referenced
with their gndo:preferredNameForThePerson (lines 17–22), which was not intended.
Besides, the gndo:dateOfProduction’s datatype (line 33) is either an rdfs:Literal or
xsd:dateTime. Hence, we ask GPT-4 to correct these mistakes using a subsequent prompt:
User: Please do not include the preferred name of the painter, and consider that the dateOfProduction
should be of type ‘xsd:dateTime’.</p>
        <p>The resulting shapes (listing 3) again remain consistent with their previous versions, and the
issues are correctly resolved. To support people new to the metadata schema and semantics,
GPT-4 can also be used to visualize it, e.g., in UML, using a text-based notation such as PlantUML7.</p>
        <p>Besides creating the initial schema, an important task is creating the instances based on the
schema. We use the real example of Caspar David Friedrich’s painting “Der Wanderer über dem
Nebelmeer”, which was painted around 1818, according to the GND8. For this, we can make use
of the OpenAI API’s Function Calling feature9. This allows GPT-4 to invoke externally defined
functions in a two-turn interaction. We define a function to extract the id for a painter from
the GND given the name with a SPARQL query. Based on the following prompt, we let GPT-4
create the instance, using the function if needed:</p>
      </sec>
      <sec id="sec-2-2">
        <title>5https://gnd.network/Webs/gnd/EN/Home/home_node.html</title>
        <p>6https://www.w3.org/TR/turtle/
7https://plantuml.com/
8https://d-nb.info/gnd/1069695750/about/lds.ttl
9https://platform.openai.com/docs/guides/function-calling?lang=python
User: You are given the following schema:
```turtle
... omitted for conciseness ...
```
Please create an instance of it for the painting “Der Wanderer über dem Nebelmeer” by Caspar David
Friedrich, which was created in 1818. You may look up the GND ID of the artist if you need to.</p>
        <p>First, the supplied function is invoked with the painter’s name. Subsequently, the instance
in listing 4 is generated, alongside a description of how GPT-4 resolved prompt ambiguities:
GPT-4: Please note that:
- I’ve used `de` language tag with the title to specify the language is German.
- For the `odrl:hasPolicy`, I created a placeholder URI (`http://example.org/policy/12345`) as the policy
detail wasn’t specified.
- The date of production `1818-01-01` is set to January 1st, 1818, as the exact date wasn’t provided. In this
example, the time component is set to midnight. Adjust the date according to specific requirements or
available data.</p>
        <p>
          We generate the usage policy based on the Open Digital Rights Language (ODRL)10, following
the approach presented by Mustafa et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], who supplement GPT-4 with a sophisticated
description of the key parts of the ODRL ontology. We provide the following prompt:
User: Given the following instance, create an ODRL policy that allows the dataset to be used within
Germany until 2024-05-10. ... omitted for conciseness ...
        </p>
        <p>Listing 5 shows the resulting usage policy, preserving the policy’s Internationalized Resource
Identifier (IRI) from listing 4.</p>
        <p>Finally, we start a new conversation with GPT-4 and ask it to, given the SHACL shapes,
explain the created instance, together with its associated usage policy and get this result:
GPT-4: This set of information is essentially a structured way to describe a dataset, in this case, related to
a digital representation of a painting. Here’s the breakdown:
1. Dataset Identification: The data snippet describes a dataset named “Der Wanderer über dem Nebelmeer”
which is German for “The Wanderer above the Sea of Fog.” This dataset is identified on the web at
ex:DerWandererÜberDemNebelmeer.
2. Title: The title of the dataset is given as “Der Wanderer über dem Nebelmeer,” which aligns with the
actual painting’s title.
3. Policy Association: The dataset is associated with a policy indicated by the URI &lt;http://example.org/
policy/12345&gt;. This policy defines how the dataset can be used or accessed.
4. Artist Information:
• The dataset specifies an artist associated with it, identified under the system gndo:firstArtist.
• This artist is a "Diferentiated Person" meaning they are uniquely identified, not just a name but a
specific individual.
• The artist has a unique identifier "118535889," which helps in uniquely identifying and retrieving
information about the artist from a database or registry.
5. Date of Production: The dataset includes a production date, 1818-01-01T00:00:00 formatted in the standard
10https://www.w3.org/TR/odrl-model/
Listing 1: The base SHACL shapes.
... omitted for conciseness ...</p>
        <p>˓→
˓→
˓→
] .
sh:property [
sh:path gndo:firstArtist ;
sh:class gndo:DifferentiatedPerson ;
sh:maxCount 1 ; ← ˒ sh:minCount 1 ;
sh:message "A digital representation of a
˓→ painting must have exactly one
˓→ gndo:firstArtist." ;
sh:node [
a sh:PropertyShape ;
sh:path gndo:gndIdentifier ;
... omitted for conciseness ...</p>
        <p>] ;
] ;
sh:property [
sh:path gndo:dateOfProduction ;
sh:maxCount 1 ; ← ˒ sh:minCount 1 ;
sh:datatype xsd:dateTime ;
sh:message "A digital representation of a
painting must have exactly one
gndo:dateOfProduction of type xsd:dateTime."
;
Listing 3: The corrected shapes from listing 2.
ex:DerWandererÜberDemNebelmeer a dcat:Dataset ;
dct:title "Der Wanderer über dem Nebelmeer"@de ;
odrl:hasPolicy &lt;http://example.org/policy/12345&gt; ;
gndo:firstArtist [
a gndo:DifferentiatedPerson ;
gndo:gndIdentifier "118535889" ;
] ;
gndo:dateOfProduction
˓→ "1818-01-01T00:00:00"^^xsd:dateTime .</p>
        <p>Listing 2: The SHACL shapes with extensions
related to paintings, by GPT-4.
@prefix odrl: &lt;http://www.w3.org/ns/odrl/2/&gt; .
@prefix ex: &lt;http://example.org/data/&gt; .
@prefix xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt; .</p>
        <p>˓→
] ;
] .
ex:policy12345 a odrl:Policy ;
odrl:permission [
a odrl:Permission ;
odrl:action odrl:use ;
odrl:target ex:DerWandererÜberDemNebelmeer ;
odrl:constraint [
a odrl:Constraint ;
odrl:leftOperand odrl:spatial ;
odrl:operator odrl:eq ;
odrl:rightOperand "DE"^^xsd:string ;
] ,
[
a odrl:Constraint ;
odrl:leftOperand odrl:dateTime ;
odrl:operator odrl:lteq ;
odrl:rightOperand
"2024-05-10T23:59:59"^^xsd:dateTime ;
Listing 4: The instance as created by GPT-4 based
on the description prompt and the GND lookup.</p>
        <p>Listing 5: The policy generated for the instance
in listing 4. The IRI has been preserved.
date-time format. This likely refers to the creation date of the painting.
6. Policy Details:
• The dataset is associated with a specific usage policy (ex:policy12345), which is an object defining
permissions related to the dataset.
• This policy allows the action of use on the dataset but has constraints:
– It can only be used within Germany (“DE”), indicated by the spatial constraint.</p>
        <p>– It must be used before 2024-05-10T23:59:59, as indicated by the dateTime constraint.</p>
        <p>This structured format helps machines and applications understand and process detailed metadata about
digital resources, ensuring they’re used and shared according to specified rules and providing essential
context such as creation details and usage rights.</p>
        <p>This is mostly a valid description of the instance. However, some important details get lost,
including the reference to the source of the object’s id, the GND.</p>
        <p>In summary, we have shown that GPT-4 can help with some of the key tasks for preparing
data for a FAIR exchange using dataspaces. The example was intentionally simple, and the LLM
outputs had some issues. Still, it shows the general applicability of LLMs in this domain. In the
next section, we propose a research agenda for using open LLMs in real dataspace settings, in
combination with KGs to address some of the LLMs’ intrinsic limitations.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Research Agenda</title>
      <p>
        As Fernandez et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] point out, LLMs have drastically changed the capabilities of Natural
Language Processing systems as they can generate text and code of viable quality based on a
natural language interface–with a better understanding of the underlying data’s semantics. We
are convinced that LLMs can aid many tasks in dataspaces, enhancing their adoption and
FAIRness. In this section, we provide a research agenda, with research questions across multiple areas
related to dataspaces that we consider pivotal for assessing how LLMs can benefit dataspaces.
Interactive and Automated Systems LLMs can be used in conversational settings and as
prompt-based generators, raising the question of how assisting systems should be designed:
interactive or as automated background services. We believe that this depends on the context.
      </p>
      <p>
        One example of an automated system could be a dataspace metadata broker like the Federated
Catalogue in Gaia-X. The stored Self-Descriptions in RDF can automatically be enhanced with
an explanation by a specifically prompted LLM that can access the ontologies underlying the
catalogue, see Integration of Knowledge and Correctness. Besides, mapping user data schemas to
common standards can be automated using LLMs, fostering FAIRness [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>In recent unpublished work, we use an interactive method to create Self-Descriptions with
a natural language interface, leveraging conventional technology: Users provide attributes
conversationally, and can ask questions. The system auto-corrects inputs using known facts
and generates the self-description with standard RDF software, thus avoiding syntax errors.</p>
      <p>
        For interactive approaches, it will be crucial to determine how the system has to be designed
to be user-friendly. For automated approaches, ways have to be found to deal with unexpected
output from the LLM, i.e., both hallucinations and unmatched output format requirements. For
the latter, token sampling based on formal grammars [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is an interesting research direction.
Adaptation: Prompt Engineering and Fine-Tuning When it comes to adapting a model
to a new context like dataspaces and new tasks, two options are possible: The first technique
is called Prompt Engineering and describes crafting specific prompts to obtain the desired
output. Typically, there is additional context or example outputs given in the prompt that the
LLM can use in addition to the knowledge encoded in its parameters. Another approach for
adaption is fine-tuning where knowledge and desired output format are directly encoded into
the parameters of the LLM, eficiently possible with adapter parameters [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. For this, a dataset
has to be carefully crafted, and enough data has to be available.
      </p>
      <p>
        Both approaches difer cost- and time-wise. Fine-tuning incurs an initial additional cost
for constructing the dataset and performing the resource-intense fine-tuning process. Prompt
Engineering does not add cost in the beginning apart from building the prompt but requires
more time during inference to process the sophisticated, context-enhanced prompts. Both
approaches are worth investigating in the context of dataspaces: Few-shot prompting can
be used to obtain the desired results when data scarcity or restrictive usage policies prohibit
ifne-tuning. However, fine-tuning is of particular interest in generating structured data, e.g., in
JSON or Turtle, and can be required to make smaller LLMs adhere to the syntax.
Integration of Knowledge and Correctness Due to the intrinsic limitations of LLMs, i.e.,
hallucination, expensive updates, and lack of provenance, extensive research is carried out
on interfacing them with KGs [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. This is possible based on a more sophisticated variant of
Prompt Engineering, Retrieval Augmented Generation (RAG) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In RAG, the prompt is encoded
using an embedding model and matched against the embeddings of knowledge in the form of
documents saved in vector stores. The most similar results are then injected into the prompt.
KGs can also be encoded into embeddings, making the knowledge stored in them accessible for
LLMs. For outputs generated from injected KG data, the source part of the graph can be given
as provenance. This way, users can even verify the results.
      </p>
      <p>
        More broadly, Pan et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] present a roadmap summarizing the possibilities for interfacing
both technologies. The integration of KGs and LLMs is pivotal, especially in the context of FAIR
dataspaces, as the metadata exchange in dataspaces is KG-based, especially in IDS and Gaia-X.
Exploiting them as a source of knowledge can make LLMs more reliable for assisting dataspace
tasks. Besides, open alternatives to OpenAI’s function calling, like agent settings, are worth
investigating for interfacing with data sources, e.g. for PID retrieval.
      </p>
      <p>
        Open Models for Data Sovereignty In our example in section 2, we used the highly
advanced but proprietary LLM GPT-4 by OpenAI to demonstrate general applicability. While valid
for examples, two main reasons make using GPT-4 in dataspaces problematic: First, its
proprietary nature creates a reliance on the user for the availability of OpenAI’s API. Software built
upon this API becomes immediately useless once availability is not given. As the efectiveness
of prompts is often highly specific to the LLM at hand, there is a significant vendor lock-in.
Besides, data sovereignty is one of the core selling points of dataspaces, i.e., the data stays in
the participants’ infrastructures and is exchanged peer-to-peer. This contradicts the widespread
adoption of closed models like GPT-4 for dataspaces, which require sending sensitive data to a
third-party service. For GDPR-relevant data, this can even be illegal. For these reasons, freely
available LLMs should be preferred, although their performance does not yet match that of
GPT-4. Multiple models with diferent strengths and vastly difering sizes have emerged from
the small Phi-211 (2.7B param.) to large models like the aforementioned Mixtral-8x7B (46.7B
param.). For this to be feasible, we propose to investigate the following research questions:
• Given an application domain and a specific LLM use case: What is a reasonable tradeof
between model size (i.e., inference cost), fine-tuning efort, model performance and safety?
• How can synergies between dataspace participants be used to fine-tune models?
• Where is using GPT-4 appropriate and its benefits outweigh the downsides?
• How can dataspace participants be empowered to perform inference on the edge with
equal or less (personnel) cost compared to the hosted OpenAI API?
Eficiency And Latency Energy eficiency and latency are major aspects for self-hosted LLMs
and out of control for closed models like GPT-4. Since the widespread interest in LLMs has grown,
performance has drastically improved, e.g. with QLoRA [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] for fine-tuning. Quantization, i.e.
representing the LLM parameters with fewer than the canonical 16 bits, strongly reduces the
required amount of RAM and the inference latency, allowing for larger models on less powerful
hardware, even CPUs, at often marginally reduced performance [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>
        Subsequent research in the context of dataspaces will focus on how those scientific and
engineering advances can be translated into dataspace-related tools. Specifically, we believe it is
crucial to determine how the dataspace participants’ acceptance of LLM-based tools is afected
by latency, energy consumption, and the balance of latency and output quality.
Safety LLMs generate text based on their internal parameters, optimized on large amounts
of text, cf. section 1. Therefore, the outputs reflect what the model has “seen” during training,
leading to the adoption of biases in the used data–which is problematic for marginalized groups
in particular, in-depth analyzed by Bender et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. This is especially critical because, for most
LLMs, the underlying training data is not openly available for review.
      </p>
      <p>
        It is pivotal for our research to keep safety in mind and be careful while developing and
deploying LLMs for dataspace applications. As an overarching consideration for our research,
awareness of this topic and the potential problems must be addressed. Unified data-model
life-cycles are needed to ensure the provenance and trust in the data used to train LLMs [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
Guardrails must be implemented to avoid problematic LLM generations being displayed to users
or used in automatic systems. However, this must be balanced with preserving the system’s
core functionality. The integration of factual knowledge with provenance using KGs aims at
this. To ensure safety, the actual dataspace participants must be involved during research, e.g.,
through user studies, to detect problematic cases. Besides ethical safety, measures have to be
applied as well for securely handling input and output, especially when code is executed.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>This work has been partially funded by the German Federal Ministry of Education and Research
(BMBF) under grant number FAIRDS05, by the German Federal Ministry for Economic Afairs
11https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
and Climate Action (BMWK) under grant number 01MJ22011B, by the German Federal
Government Commissioner for Culture and the Media (BKM) under grant number 2522DIG012 and with
ifnancial support of Science Foundation Ireland (SFI) under grant number SFI/12/RC/2289_P2.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bader</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pullmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mader</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tramp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Quix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. W.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Akyürek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Böckmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. T.</given-names>
            <surname>Imbusch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lipp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Geisler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lange</surname>
          </string-name>
          ,
          <article-title>The International Data Spaces Information Model - An Ontology for Sovereign Exchange of Digital Content</article-title>
          , in: ISWC,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Theissen-Lipp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kocher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lange</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Decker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paulus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pomp</surname>
          </string-name>
          , E. Curry, Semantics in Dataspaces: Origin and
          <string-name>
            <given-names>Future</given-names>
            <surname>Directions</surname>
          </string-name>
          ,
          <source>WWW '23 Companion</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Otto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jarke</surname>
          </string-name>
          ,
          <article-title>Designing a multi-sided data platform: findings from the International Data Spaces case</article-title>
          ,
          <source>Electronic Markets</source>
          <volume>29</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Curry</surname>
          </string-name>
          , Future Research Directions for Dataspaces,
          <source>Data Ecosystems, and Intelligent Systems</source>
          , Springer International Publishing, Cham,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Geisler</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-E. Vidal</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Cappiello</surname>
            ,
            <given-names>B. F.</given-names>
          </string-name>
          <string-name>
            <surname>Lóscio</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Gal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Jarke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lenzerini</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Missier</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Otto</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Paja</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Pernici</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Rehof</surname>
          </string-name>
          ,
          <article-title>Knowledge-driven data ecosystems toward data transparency</article-title>
          ,
          <source>Journal of Data and Information Quality</source>
          <volume>14</volume>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          , et al.,
          <article-title>Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks</article-title>
          ,
          <source>NeurIPS</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7] OpenAI, GPT-4
          <source>technical report</source>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2303</volume>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Mustafa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. T.</given-names>
            <surname>Arnold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Collarana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nadgeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Quix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lange</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Decker</surname>
          </string-name>
          ,
          <article-title>From Instructions to ODRL Usage Policies: An Ontology Guided Approach, under review (</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Elmore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Franklin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Krishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <article-title>How large language models will disrupt data management</article-title>
          ,
          <source>Proceedings of the VLDB Endowment</source>
          <volume>16</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Sundaram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Musen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Making Metadata More FAIR Using Large Language Models</surname>
          </string-name>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2307</volume>
          .
          <fpage>13085</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Josifoski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Peyrard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>West</surname>
          </string-name>
          ,
          <article-title>Grammar-constrained decoding for structured NLP tasks without finetuning</article-title>
          , in: EMNLP,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Dettmers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pagnoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holtzman</surname>
          </string-name>
          , L. Zettlemoyer,
          <article-title>QLoRA: Eficient finetuning of quantized LLMs</article-title>
          , in: NeurIPS,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Collarana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Busch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lange</surname>
          </string-name>
          ,
          <article-title>Knowledge graph treatments for hallucinating large language models</article-title>
          ,
          <source>ERCIM News</source>
          <year>2024</year>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Unifying Large Language Models and Knowledge Graphs: A Roadmap, IEEE T-KDE (</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <surname>Eficient</surname>
            <given-names>LLM</given-names>
          </string-name>
          <article-title>Inference on CPUs</article-title>
          , in: NeurIPS,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Bender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gebru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McMillan-Major</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shmitchell</surname>
          </string-name>
          ,
          <article-title>On the dangers of stochastic parrots: Can language models be too big?</article-title>
          ,
          <source>in: ACM FAcct</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>E.</given-names>
            <surname>Curry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zaarour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Timilsina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Al-Qatf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Haque</surname>
          </string-name>
          ,
          <article-title>Foundation Data Space Models: Bridging the Artificial Intelligence and Data Ecosystems (Vision Paper)</article-title>
          ,
          <source>in: 2023 IEEE International Conference on Big Data (BigData)</source>
          , IEEE,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>