Towards LLM-augmented Creation of Semantic Models for Dataspaces

Towards LLM-augmented Creation of Semantic Models for Dataspaces SayedHoseini sayed.hoseini@hs-niederrhen.de Hochschule Niederrhein

Krefeld Germany

AndreasBurgdorf burgdorf@uni-wuppertal.de Institute for Technologies and Management of Digital Transformation University of Wuppertal

Wuppertal Germany

AlexanderPaulus paulus@uni-wuppertal.de Institute for Technologies and Management of Digital Transformation University of Wuppertal

Wuppertal Germany

TobiasMeisen meisen@uni-wuppertal.de Institute for Technologies and Management of Digital Transformation University of Wuppertal

Wuppertal Germany

ChristophQuix christoph.quix@hs-niederrhen.de Hochschule Niederrhein

Krefeld Germany

Fraunhofer FIT St. Augustin

Germany

AndréPomp pomp@uni-wuppertal.de Institute for Technologies and Management of Digital Transformation University of Wuppertal

Wuppertal Germany

Towards LLM-augmented Creation of Semantic Models for Dataspaces 1613-0073 CFAEE9F86E37A5C401515F456DEE40C6 GROBID - A machine learning software for extracting information from scholarly documents Dataspace Semantic Modeling LLMs

Dataspaces aim to enable smooth and reliable data exchange between different organizations. They have gained increasing attention in Europe following the enactment of the European Data Governance Act. This legislation emphasizes trust, accessibility, and shared dataspaces, which require semantic interoperability grounded in the FAIR principles. Although semantic descriptions in the form of semantic models and ontologies are integral to dataspaces, their full potential remains underutilized. Meaningful metadata, including contextual information, enhances data usability, but manually creating semantic models can be challenging. Large Language Models (LLMs) offer a new way to utilize data in dataspaces. Their advanced natural language processing capabilities enable context-aware data processing and semantic understanding. This paper presents initial experiments on customizing and optimizing LLMs for semantic labeling and modeling tasks. The contributions of this work include research questions for future investigations, early experiments demonstrating the applicability of LLM for semantic labeling, and proposed directions to address discovered challenges.

Introduction

The European Data Governance Act [1,2] outlines definitions and objectives aimed at bolstering trust, broadening data accessibility, and promoting shared dataspaces. Its impact extends across various data consumers and providers from academia as well as businesses. Efficient data sharing within dataspaces necessitates semantic interoperability as an essential design principle, grounded in the required adherence to FAIR principles [3]. Thereby, the utilization of semantic descriptions and ontologies are already part of many dataspaces, but the potential is far from being fully utilized in actual implementations [4]. An example of this is semantic interoperability and integration, a key aspect of dataspaces that requires aggregating and integrating large amounts of heterogeneous data from different sources. Managing data can be challenging, not only due to the variety of data formats such as XML, CSV, JSON, relational data, and graph data. In addition, data is often distributed across different departments within an organization, under different governance regimes, and data models. It is important to have a clear and logical structure of information, which fosters a common understanding in dataspaces, i.e., a lingua franca for data moderation [5] based on the Linked Data principles.

Meaningful metadata is crucial for enhancing data usability, particularly for users with limited domain knowledge or those unfamiliar with a dataset. Annotating raw data from heterogeneous data sources with semantically rich models enhances data interpretability and usability [6,7]. This type of semantic data expands beyond typical extractable metadata, such as schema, data types, sizes, and formats, to include contextual information that is not inherent to the specific data source. The field of Semantic Data Management (SDM) [8] aims to represent the metadata about heterogeneous data sources in the form of ontologies or knowledge graphs (KG) serialized in a language of the Semantic Web. Hence, the goal is to establish an additional layer between the data and the knowledge layer [9]. This is highly relevant for dataspaces because they integrate data from various systems and platforms, which requires data to be interoperable and seamlessly exchangeable between systems. In order to implement SDM in practice, conceptualizations in the form of KGs and/or ontologies [10], and a mapping between concepts and data items are required. Semantic models provide these mappings from single datasets to a common data model to represent data consistently across different applications in a way that is understandable and interpretable by both humans and machines [11].

With companies increasingly acknowledging the importance of data for their business operations, semantic descriptions are often integrated into data management and governance strategies [12,13], where an ontology or KG serves as a conceptual representation of an organization's data assets. A data source that is semantically well-annotated can be identified and interpreted by leveraging conceptual representations of the data and by comprehending the provided context information stored in the model. However, a huge initial overhead, coming from the time-consuming manual process of creating meaningful semantic descriptions for data sources, hampers the widespread adoption of SDM in practise [8]. Creating semantic models entails deciphering the existing data source, consulting appropriate conceptualizations, and establishing connections between data attributes and concepts provided by the conceptualization.

Automating this task can be challenging and complex. Futia et al. [14] present a method based on graph neural networks covering the process of the process of semantic modeling. However, the model can only optimize semantic models for which historic training data exists. Xu et al. [15] train a cross-modal network to learn semantic features between data sources and semantic models. They admit that the method has shortcomings in dynamically augmenting the semantic models to cover concepts that are not part of the original training data. Moreover, challenges remain with detecting and correcting potentially incorrect attribute types if a source attribute has more than one attribute type, and distinguishing similar attributes with the same entities and semantic types.

Following the rise of Large Language Models (LLMs), one can expect to see a major impact on the landscape of data utilization and exchange within dataspaces. LLMs, such as OpenAI's GPT-3.5 and GPT-4.0, have demonstrated remarkable capabilities in understanding, generating, and processing vast amounts of textual data [16,17]. Their abilities in natural language processing enable advanced semantic understanding and context-aware data processing within dataspaces. A promising field of LLM application is the integration of heterogeneous data sources stored in a dataspace.

R-50 R-100 R-150 R-200 R-250 R-300 R-350 R-400 R-450 R-500 R-550 R-600 R-650 R-700 -60 -120 -

In this article, we highlight some initial experiments in this direction to examine the question of how such general-purpose AI systems can be customized and optimized for data integration tasks in the sense of SDM. In particular, we make the following contributions:

• A set of research questions to be answered in future research endeavors • Early experiments to illustrate the applicability of LLMs to the tasks of semantic labeling • Potential future research directions to address the identified challenges with the applicability of LLMs to the tasks of semantic labeling and modeling.

Semantic Data Management

Figure 1 illustrates the basic idea of semantic models. The raw datasets in the dataspace are represented at the bottom; they can be in different formats and structures, such as tabular data or hierarchical JSON data, but have partially overlapping content. The semantic model is a projection from a shared conceptualization onto the different datasets. It utilizes relevant entities and relationships of the conceptualization, in this case, the schema.org ontology (prefix schema:), to formalize the context information of the dataset. An essential part of each semantic model are the mappings, indicated as dotted lines, which link attributes in the datasets to classes in the semantic model using properties of these classes. These elementary mappings are referred The semantic model creation process as formalized by [21]. Specific terms used for available automation are stated in purple.

R-50 R-100 R-150 R-200 R-250 R-300 R-350 R-400 R-450 R-500 R-550 R-600 R-650 R-

to as semantic labels. The semantic model captures the precise meaning of the dataset, explicitly encoding the semantic types and relationships among its attributes within the graph.

Following the definition of semantic labeling by Pham et al. [18], a semantic label is an annotation of a dataset attribute by a tuple consisting of a class (subject) and a property (predicate). In this work, a semantic label is represented as a triple: (subject, predicate, schema attribute). For example, the semantic label of the table's column 'Title' is constructed through the subject 'schema:Movie' and the predicate 'schema:title' modeling the relationship between them. This connects the table's content to the attribute 'titel' in the JSON object, indicating an entry point for data integration between the two (heterogeneous) datasets. Moreover, the semantic model doesn't merely rely on a static conceptualization; it can also introduce novel classes and properties [19]. The necessity for evolution becomes apparent when users contribute datasets containing concepts and relationships not yet included in the conceptualization. The semantic label for the JSON key 'verkaeufe' is represented as the triple: (schema:ScreeningEvent, :ticketsSold, 'verkaeufe'). Here, the predicate is a novel property for that specific domain, which is not (yet) present in this form in the general-purpose schema.org ontology. This new knowledge can be systematically integrated, thus perpetually advancing the conceptualization layer [20]. Semantic models complement extractable metadata (such as data types, sizes, formats, etc.) to convey context information that may not be inherent to the dataset at hand, for instance, a starting date of a 'schema:ScreeningEvent' as shown in Figure 1.

The underlying process of semantic model creation has been formalized by [21] and is visualized in Figure 2, starting with the identification of the schema of the dataset, followed by a semantic labeling phase, in which basic concepts are assigned to the identified attributes. Automated semantic labeling, referred to as Semantic Type Detection [22], is the process of identifying these labels using algorithms and machine learning models. Subsequent to the semantic labeling, the semantic modeling phase builds the remaining semantic model by formalizing the context information. During semantic modeling, semantic relation inference [14] refers to the process of automated identification of relationships and additional concepts, resulting in a generated semantic model. All automation is followed by the semantic refinement phase, where the modeler is involved in the modeling process to correct any errors present before the semantic model is finalized and stored for documentation purposes. In practice, semantic relation inference depends heavily on accurate semantic labels [23,24], which underscores the importance of semantic type detection in fully automated systems to induce as few errors as possible for the modeler to correct.

Research Questions

With an introduction to the field of SDM at hand, we move on to formulate the research questions that have motivated this work. The goal is to leverage LLMs to improve the automation of semantic model creation for large quantities of heterogeneous data sources that share a common domain in a dataspace.

• RQ1: How to utilize LLMs to perform semantic type detection with a fixed set of labels coming from a pre-selected conceptualization (such as WikiData, or schema.org)? • RQ2: How to utilize LLMs to perform semantic type detection against an arbitrary domain ontology, i.e., with no labeled dataset or zero-shot classification? • RQ3: How can LLMs be utilized to identify and formalize the context of a given dataset, creating a full semantic model?

Related Work

First, in the scope of this contribution, we consider only works after the launch of ChatGPT in November 2022. While one can make a strong case that (large) language models have existed before this date, we decided to draw a line due to the massive performance increase which was quite suddenly accessible to the public. Most of the works found in this limited range are pre-prints that have not yet been peer-reviewed and published in scientific journals. To the best of our knowledge, so far, except for the below-mentioned approaches, there seems to be no further LLM-based efforts on the integration of the semantics of several heterogeneous data sources modeled directly in a language of the Semantic Web in order to generate semantic models in the sense of Figure 1. Korini et al. [25] are among the first to report the application of LLMs for the Column Type Annotation (CTA) task. CTA is a schema-level annotation task that represents a simplified form of our interpretation of semantic labeling (no predicate) as it aims to map the underlying table schema to a conceptualization. They view CTA as a multi-class classification problem and evaluate different prompt designs. One important artifact that is highlighted is that ChatGPT tends to ignore the instruction to use terms from the label space, and instead answers using different terms. This is a known drawback of contemporary LLMs known as the Hallucination problem [26]. Their proposed solutions for this challenge involve first determining a set of classes of the entities described in the table and depending on this set asking ChatGPT to annotate columns using only the relevant subset of the overall vocabulary. The evaluation of the approach reports competitive performance when evaluated against the more traditional models which are mostly directly fine-tuned for the CTA task and require significant amounts of task-specific training data [27].

A dataspace stores heterogeneous data of any type of which the majority may be in some form relational, but an important fraction may be more complex, e.g. nested and even unstructured (video, text, audio, ...). We found several works [28,29,30] that aim at customizing LLMs via fine-tuning for tables in particular. The goal is to solve the challenge of table understanding which is closely related to understanding the semantics of a data source as it includes the CTA task for example. Usmani et al. [13] highlight the importance of multi-modal knowledge graphs for dataspaces, present a review of the current state, and propose an ontology towards further development. Furthermore, since important use cases for datasets can be attributed to numerical data, it is important to have solid numerical reasoning skills [31]. Several works suggest that modern LLMs excel in simple problem settings, but they fall short of human expert performance in problems requiring numerical reasoning over long contexts. As the complexity of challenging mathematical problems increases, LLMs currently exhibit suboptimal performance [17,32].

There exist several works that investigate the use of LLMs for Knowledge Graph Engineering [33,34,35]. Here the goal is to utilize the LLM for common tasks related to KGs. For example, Meyer et al. [36] investigate SPARQL query generation as well as knowledge extraction from fact sheets and KG exploration among others. They present several prompts that essentially test how to pose questions in natural language executed against serialized KGs. The experiments show that LLMs can return syntactically correct SPARQL queries and even entire serialized RDF models with the desired form based on a task formulated in natural language. Further, LLMs can find relationships in KG and answer basic questions correctly, e.g., "Are there any connections between US and UK?". They report major performance differences between GPT-3.5 and GPT-4 in favor of the latter. One particular prompt aims at extracting knowledge from tables and converting it into a serialized KG, which is very close to the idea of semantic modeling. The experiment illustrates several problems with the output of contemporary LLMs:

• A tendency to prioritize the usage of schema.org vocabulary. While this works well for well-known entities and properties, the LLMs invent reasonable, but non-existent classes and properties (in the schema.org namespace) for concepts and relations that are too specific. • Non-deterministic output: For multiple runs of the same prompt to the LLM, the output varies. For instance, while in three out of four runs a printer manufacturer was represented as a separate typed entity, in one run it was only expressed as a string literal. • Invention of non-existent properties, prefixes, and classes: If the LLM cannot identify a fully matching class for a concept or a relation, URIs for those elements are invented for the raw RDF output. While this would be possible in its own namespace, the classes and properties are placed in existing namespaces, such as schema.org, resulting in the generated URIs not being resolvable. • Non-functional queries: SPARQL queries generated by ChatGPT -3 did not return the expected results when executed against a knowledge graph, albeit being syntactically correct. All queries needed slight modifications to work, such as correcting the referencing of non-existent classes.

To conclude, the results obtained by Meyer et al. show that the problems commonly observed with LLMs also limit their ability to conduct tasks in the semantic domain. It is therefore not possible to use LLMs out-of-the-box for semantic relation inference. Since semantic type detection is simpler than semantic relation inference and also relies heavily on obtaining context, this area of automation is investigated more closely.

Semantic Type Detection with LLMs

To investigate the suitability of LLMs for semantic type detection, we conducted four exemplary experiments using ChatGPT 4.0. Therefore, we manually selected three datasets from the VC-SLAM corpus [37] which contains datasets created by human modelers in combination with their data description and semantic model as evaluation datasets. Dataset 1 (VC-SLAM 0001) has seven labels that are close to natural language, Dataset 2 (VC-SLAM 0018) has 21 labels that are mostly human readable, and Dataset 3 (VC-SLAM 0068) consists of 24 labels, some of which are abbreviations. The experiments are briefly described in the following and the results are given in Table 1.

Experiments

Experiment 1 -Mapping to VC-SLAM: This experiment explores ChatGPT 's ability to map dataset labels to the corresponding concepts within the VC-SLAM ontology, provided solely in Turtle (TTL) format, without any additional contextual information. This setup aims to assess the base capability of ChatGPT to utilize the ontology's structure and content for semantic type detection. The task involves presenting dataset labels to ChatGPT and instructing it to identify the most fitting ontology concept for each label.

Prompt Experiment 1

• You are a tool for semantic type detection. I will provide you an owl ontology that consists of all the concepts you know. This ontology is called VC-SLAM. Later I will additionally provide the labels of three data sets. For each label you return the fitting concept from the ontology. • The first data set consists of the following labels: type, longitude, address, latitude, tvm_identifier, pay_by_credit_card, pay_by_cash Please return the results in the following form: label,concept Experiment 2 -VC-SLAM with Documentation: In this experiment, the methodology is similar to the Mapping to VC-SLAM experiment, but it includes comprehensive documentation of the VC-SLAM ontology and datasets. This tests the hypothesis that additional contextual information enhances the accuracy of semantic type detection.

Experiment 3 -schema.org Ontology: This experiment shifts the focus to a generalpurpose ontology to compare ChatGPT 's adaptability and performance with a different ontology structure. This experiment provides insights into the model's versatility and the challenges of applying a broad ontology like schema.org to a specific dataset, highlighting differences in specificity and applicability.

Experiment 4 -Simplified VC-SLAM: The final experiment aims to investigate the impact of ontology complexity on semantic type detection accuracy. By reducing the VC-SLAM ontology to only include concept names and their descriptions without further relations, this experiment seeks to determine whether a simplified ontology framework would enhance ChatGPT 's mapping accuracy due to decreased complexity and ambiguity.

Table 1

The table demonstrates the results of the four experiments that we run on three datasets. The datasets refer to the following original datasets from VC-SLAM [37]: 1: 0001, 2: 0018, 3: 0068

Results

The outcomes of the experiments are measured across the three different datasets. Experiment 1 -Mapping to VC-SLAM reveals a varying performance with a 57.1% accuracy rate for the first dataset (4 out of 7 labels correctly mapped), 33.3% for the second (7 out of 21), and a notably lower 12.5% for the third (3 out of 24). These results indicate that while ChatGPT can achieve some level of correct mapping based on the ontology structure alone. Experiment 2 -VC-SLAM with Documentation showed improved performance with a 71.4% accuracy rate for the first dataset (5 out of 7 labels correctly mapped), 61.9% for the second (13 out of 21), and 20.8% for the third (5 out of 24). These results highlight the significant impact of extra contextual information in enhancing ChatGPT 's semantic type detection capabilities, leading to more accurate mappings.

Experiment 3 -schema.org Ontology further demonstrated ChatGPT 's adaptability with impressive accuracies: 100% for the first dataset (7 out of 7 labels correctly mapped), 76.2% for the second (16 out of 21), and 45.8% for the third (11 out of 24). Reasons for this may be that ChatGPT is better at handling ontologies that were already part of the training data, or that the descriptions in schema.org are more meaningful than those of the VC-SLAM ontology.

Experiment 4 -Simplified VC-SLAM yielded mixed results: 57.1% accuracy for the first dataset (4 out of 7 labels correctly mapped), 42.9% for the second (9 out of 21), and 50% for the third (12 out of 24). These outcomes suggest that simplification of the ontology does not necessarily lead to improved performance across all datasets, reflecting the complex balance between ontology complexity and the effectiveness of semantic type detection with ChatGPT.

During these experiments, several key findings emerged. First, the availability of additional contexts, such as ontology documentation, significantly improves ChatGPT 's ability to accurately map dataset labels to ontology concepts, underscoring the importance of rich contextual information for semantic type detection tasks. Second, the experiments revealed ChatGPT 's adaptability to different ontologies, with performance variations highlighting the model's capability to handle both specialized and general-purpose ontologies. Lastly, the simplification of the ontology structure was shown to potentially enhance semantic type detection accuracy, suggesting that the complexity of an ontology can affect the efficiency and effectiveness of label mapping. These findings contribute valuable insights into the potential of leveraging LLMs for semantic type detection, indicating promising pathways for automating and refining the data categorization process. The experiments underscore the significance of ontology design and contextual information in optimizing the performance of semantic type detection tasks using AI models like ChatGPT.

Semantic Model Creation with LLMs

Adding LLMs into the process of semantic model creation provides automation algorithms with the ability to profit from the advantages that these pre-trained models provide. Taking the findings from both related work and our experiments into account, it can be deduced that results obtained from LLMs need to be verified and checked before being applied in a semantic model creation scenario within dataspaces. In the following, two approaches on how to utilize LLM-generated output in automated semantic model creation are conceptualized.

Unifying KGs with LLMs for Semantic Modeling

Table 2

The pros and cons of LLMs vs KGs as described by [38] Although the technologies for linked data and the Semantic Web have become more mature in recent years, the amount of data considered in Semantic Web applications is far less than in Big Data applications [39]. Thus, scalability to large, heterogeneous data sets is a major challenge for applying Semantic Web technologies in dataspaces for which LLMs can be a great help. However, even though LLMs can effectively possess rich knowledge learned from massive amounts of training data and benefit downstream tasks at the fine-tuning stage, as previously described, they still have significant limitations due to the lack of external knowledge [17]. In contrast, KGs are structured knowledge models that explicitly store rich factual knowledge. However, KGs are difficult to construct and evolve by nature, making it challenging to generate new facts and represent unseen knowledge [40]. Therefore, it is reasonable to view KGs and LLMs as two complementary technologies whose integration has the potential to produce synergy, capitalizing on the strengths of each while mitigating their respective weaknesses.

For a detailed discussion on research towards the unification of language models and KGs we refer to the survey by Pan et al. [38]. They contrast the pros and cons of (large) language models vs KGs. Table 2 confirms the previous findings about the drawbacks of LLMs, namely hallucination, a lack of domain-specific knowledge, indecisiveness, and a lack of interpretability. Conversely, the automated construction of KGs is equally challenging, and current approaches to KGs are inadequate in handling the incomplete and dynamically changing nature of realworld KGs. Additionally, many of the current techniques for KGs are tailored to particular tasks and, therefore not easy to generalize to broader applications. This suggests that KG and LLMs indeed complement and may synergize with each other. Pan et al. further predict three main directions for future research toward this goal: KG-enhanced LLMs, which incorporate KGs during the pre-training and inference phases of LLMs to enhance understanding of the knowledge learned by LLMs. Here we direct the interested reader to the survey by Hu et al. [41]. Then there are LLM-augmented KGs, mentioned in a similar form by Meyer et al. [36] (see Section 3). Ultimately, these directions may be integrated to produce Synergized LLMs + KGs, in which LLMs and KGs play equal roles and work in a mutually beneficial way to facilitate reasoning driven by both data and knowledge. This fusion may possibly address some of the contemporary challenges discussed in Section 3 and represent one answer to the third research question (RQ3) on how LLMs can enhance the semantic model creation process. Figure 3 illustrates the integration of the two directions and its stated goals. As the most basic semantic unit, entities play a crucial role, and incorporating their knowledge into LLMs helps to improve semantic understanding. In addition, there are also a large number of relational triples in the knowledge graph, which can provide sufficient structured information to further improve the semantic understanding. Since conventional LLMs trained on plain text data are not designed to understand (graph-)structured data such as knowledge graphs, they might not fully grasp or understand the information conveyed by the KG structure. This assumption is confirmed by our experiments (see Section 4), since reducing the representation of ontologies to plain text significantly improves the performance. This indicates that ChatGPT does not handle the graph structure well. Synergized LLMs + KGs promise to be able to understand the underlying graph structure which could improve the performance of KG technology e.g. in discovering unseen facts and exploration for example.

Multimodal KGs are becoming increasingly important for dataspaces as they integrate different modalities, including text, image, audio, and video data, into a single graph, allowing for a comprehensive representation of complex data [13]. As previously described, it is important for semantic modeling to have solid numerical reasoning skills. Similar to KGs, regular datasets like CSVs or JSONs can be viewed as (semi-)structured data that represent a further modality. Therefore, effectively leveraging representations from multiple modalities, in particular tables and spreadsheets, would be a significant milestone towards the unification of KGs and LLMs.

Finally to remedy the problems with hallucinations and updates of the internal knowledge of LLMs as real-world situations change, the incorporation of knowledge from KGs represents a logical solution. For hallucination, the KGs can be leveraged as an external source to validate or fact-check the output of an LLM. Editing the knowledge of an LLM live without re-training is an attractive idea. However, current methods have severe problems, and further research is required [42,43]. A potential solution to this problem is presented in the next section.

LLM-Supported Interactive Semantic Model Design

Since semantic model creation is usually performed inside a semantic modeling platform, integrating LLMs into the semantic model creation following an interactive pattern requires the surrounding platforms to offer additional functions. None of the existing semantic modeling platforms, such as SAND [44], MantisTable [45] or PLASMA [46], offer the ability to communicate with a generative AI to refine semantic models. Future platforms to support manual semantic model creation will likely integrate the interaction with an LLM as a central component of their design. For example, a semantic model creation could be fitted into a session with an interactive LLM. Any type of generative LLM can be used, however, using knowledge-enhanced LLMs (see Section 5.1) helps to reduce the effects of unwanted phenomena, such as hallucinations.

Users input their desired changes using natural language, and the LLM alters the model accordingly. This process requires two central features to be realized: First, users must be able to formulate their changes using a prompt-like interface. Any identified shortcomings can be expressed in text form, even using natural language, to interact with the system and improve the usability of the semantic modeling platform for users with little or no previous knowledge of semantic technologies. Additionally, the platform should provide a process for piping and filtering LLM output to minimize the impact of known drawbacks, such as hallucinations.

Figure 4 visualizes this process in which the modeler and the LLM serve as the interacting participants of a communication. All interactions between both parties are conducted through various services, such as the modeling platform and the LLM's API. These services apply modifications and transform the contained data to match the other side's data model. For example, when a semantic model generation is requested using an LLM, the current semantic model is provided to the LLM, preferably using a pre-configured, session-based GPT specialized in semantic model creation. The request is appended to the interactive session, resulting in an updated model being generated by the LLM. The LLM's extensive knowledge and advanced capability to process natural speech input allows it to modify the semantic model based on the modeler's intentions, proposing a formalized solution to shortcomings such as syntactical errors. The LLM-generated output undergoes post-processing to ensure presentability to the modeler, particularly when generating large semantic models. The changes made by the LLM in the last iteration are highlighted in separate steps in the generated model, making it easy to identify the changes made based on the last input when displaying the results in the modeling platform. In case the LLM generates corresponding textual output, it is parsed and attached to the updated model using a special set of RDF properties. This enables the modeler to verify the reasoning behind the modifications made to specific elements. Once the post-processing is complete, the proposed semantic model is transferred back to the user and displayed.

Conclusion

This article explores the applicability of modern LLMs for semantic data management in dataspaces, in particular to the tasks of semantic model creation. Our objective is to address the provided research questions to offer directions of future research for preparing LLMs for the complex task of creating semantic models for vast amounts of heterogeneous data sources in dataspaces.

Regarding RQ1 and RQ2, the experiments in Section 4 demonstrate the feasibility of utilizing LLMs for semantic type detection with a fixed or limited set of labels derived from legacy knowledge graphs. LLMs show promise in achieving significant accuracy in semantic type detection tasks, especially when additional contextual information or documentation is provided alongside the ontology. in particular, Experiment 3, which used the schema.org ontology, showcases the high adaptability and potential of LLMs to accurately map dataset labels to ontology concepts with an accuracy reaching up to 100% for certain datasets. This indicates that LLMs can serve as a powerful tool for semantic type detection. Experiment 4's approach, using a simplified version of the VC-SLAM ontology, offers insight into how LLMs might tackle semantic type detection tasks when the ontology is minimized to basic concept names and descriptions, achieving up to 57.1% accuracy in some cases. The findings suggest that LLMs, including ChatGPT, can effectively engage in semantic type detection tasks even when presented with new, unfamiliar, or arbitrary domain ontologies, by leveraging their inherent understanding of language and context.

Regarding RQ3, exploiting the vast knowledge and reasoning capabilities of LLMs to automate semantic modeling is an attractive idea. However, significant research is still necessary to integrate KGs with LLMs to produce synergy between these two complementary technologies (see Section 5.1). LLMs do not navigate on graphs or handle numerical data sets well. They may suffer from hallucinations and cannot acquire domain-specific knowledge [47] easily.

The LLM-supported interactive semantic model design (see Section 5) establishes a unique way of generating semantic models, providing another possible answer to RQ3 on how LLMs can enhance the semantic model creation process. However, it requires several additions to today's semantic modeling platforms. In theory, the creation of a semantic model can be a fully immersive experience, where modifications can even be made through voice commands. These modifications are then converted to prompts and interpreted by the natural language processing capabilities of LLMs. The resulting changes are automatically visualized, effectively utilizing the LLM as a semantic modeling system. While the presented results and concepts represent a first approach to the topic, the stated research questions remain open to inspire future research in this area.

Figure 1 :1Figure 1: Semantic models creating a unified mapping from different data sources (datasets) to an ontology.

Figure 2 :2Figure 2:The semantic model creation process as formalized by[21]. Specific terms used for available automation are stated in purple.

Figure 3 :3Figure 3:The unification of KGs and LLMs. Adopted from[38].

Figure 4 :4Figure 4: Communication-based interactive semantic model generation process

Acknowledgements

This work has been sponsored by the German Federal Ministry of Education and Research in the funding program "Forschung an Fachhochschulen" (grant no. 13FH557KX0) and funding program "Datenkompetenzzentren für die Wissenschaft" (grant no. 16DKZ2056B). Declaration of generative AI and AI-assisted technologies in the writing process: During the preparation of this work, the author(s) used OpenAI's generative AI (ChatGPT v3.5 & v4), DeepL and Grammarly to improve the writing, make suggestions, and for rephrasing. After using this tool/service, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the publication.

White paper on the data governance act JBaloup EBayamlıoğlu ABenmayor CDucuing LDutkiewicz TLalova-Spinks YMiadzvetskaya BPeeters CiTiP Working Paper 2021 2021 LNagel JJHierro EPerea DLycklama CMertens A.-STaillandier MMarques JGelhaar AMarguglio UAhle Design Principles for Data Spaces: Position Paper 2021 E. ON Energy Research Center Technical Report The fair guiding principles for scientific data management and stewardship MDWilkinson MDumontier IJAalbersberg GAppleton MAxton ABaak NBlomberg J.-WBoiten LBDa Silva Santos PEBourne Scientific data 3 2016 Semantics in dataspaces: Origin and future directions JTheissen-Lipp MKocher CLange SDecker APaulus APomp ECurry Companion Proceedings of the ACM Web Conference 2023, WWW '23 Companion ACM 2023 Semantic Integration and Interoperability SAuer 2022 Springer International Publishing The web and linked data as a solid foundation for dataspaces SMeckler RDorsch DHenselmann AHarth Companion Proceedings of the ACM Web Conference 2023 Semantic web and knowledge graphs for industry 4.0 MYahya JGBreslin MIAli Applied Sciences 11 2021 SHoseini JTheissen-Lipp CQuix arXiv:2310.15373 Semantic data management in data lakes 2023 Applying semantics to reduce the time to analytics within complex heterogeneous infrastructures APomp APaulus AKirmse VKraus TMeisen Technologies 6 2018 Knowledge graphs AHogan ACM Comput. Surv 54 2022 Enabling data spaces: existing developments and challenges GSolmaz FCirillo JFürst TJacobs MBauer EKovacs JRSantana LSánchez Proceedings of the 1st International Workshop on Data Economy, DE '22 the 1st International Workshop on Data Economy, DE '22 ACM 2022 Using semantic technologies to manage a data lake: Data catalog, provenance and access control HDibowski SSchmid YSvetashova CHenson TTran Proc. Scalable Semantic Web Knowledge Base Systems Workshop Scalable Semantic Web Knowledge Base Systems Workshop CEUR WS 2020 2757 Towards multimodal knowledge graphs for data spaces AUsmani MJKhan JGBreslin ECurry Companion Proceedings of the ACM Web Conference 2023 WWW '23 Semi: A semantic modeling machine to build knowledge graphs with graph neural networks GFutia AVetrò JCDe Martin SoftwareX 12 2020 Automatic semantic modeling of structured data sources with cross-modal retrieval RXu WMayer HChu YZhang H.-YZhang YWang YLiu ZFeng Pattern Recognition Letters 177 2024 Evaluating the logical reasoning ability of chatgpt and gpt-4 HLiu RNing ZTeng JLiu QZhou YZhang arXiv:2304.03439 2023 A survey on evaluation of large language models YChang XWang JWang ACM Trans. Intell. Syst. Technol 2024 Just Accepted Semantic Labeling: A Domain-Independent Approach MPham SAlse CAKnoblock PSzekely The Semantic Web -ISWC 2016 Springer International Publishing 2016 Bottom-up Knowledge Graph-based Data Management APomp Berichte aus dem Maschinenbau

Shaker

2020 You are missing a concept! enhancing ontology-based data access with evolving ontologies APomp JLipp TMeisen Proc. ICSC, IEEE ICSC, IEEE 2019 Recent advances and future challenges of semantic modeling APaulus ABurgdorf APomp TMeisen Proc. 15th IEEE ICSC, IEEE 15th IEEE ICSC, IEEE 2021 Sherlock: A deep learning approach to semantic data type detection MHulsebos KHu MBakker Proceedings of the 25th ACM SIGKDD the 25th ACM SIGKDD ACM 2019 Learning Semantic Models of Data Sources Using Probabilistic Graphical Models BVu CKnoblock JPujara The World Wide Web Conference, WWW '19 ACM 2019 Leveraging Linked Data to Infer Semantic Relations within Structured Sources MTaheriyan CAKnoblock PSzekely JLAmbite YChen Proceedings of the 6th International Workshop on Consuming Linked Data the 6th International Workshop on Consuming Linked Data

COLD

2015. 2015 KKorini CBizer arXiv:2306.00745 Column type annotation using chatgpt 2023 Survey of hallucination in natural language generation ZJi NLee RFrieske TYu DSu YXu EIshii YJBang AMadotto PFung ACM Comput. Surv 55 2023 Annotating columns with pre-trained language models YSuhara JLi YLi DZhang CDemiralp W.-CChen Tan Proceedings of the 2022 International Conference on Management of Data, SIGMOD '22 the 2022 International Conference on Management of Data, SIGMOD '22 ACM 2022 PLi YHe DYashar WCui SGe HZhang DRFainman DZhang SChaudhuri arXiv:2310.09263 Table-gpt: Table-tuned gpt for diverse table tasks 2023 Tabllm: Few-shot classification of tabular data with large language models SHegselmann ABuendia HLang MAgrawal XJiang DSontag International Conference on Artificial Intelligence and Statistics

PMLR

2023 TZhang XYue YLi HSun arXiv:2311.09206 Tablellama: Towards open large generalist models for tables 2023 Tab2kg: Semantic table interpretation with lightweight semantic profiles SGottschalk EDemidova Semantic Web 13 2022 YZhao YLong HLiu arXiv:2311.09805 Docmath-eval: Evaluating numerical reasoning capabilities of llms in understanding long documents with tabular data 2023 BPAllen LStork PGroth arXiv:2310.00637 Knowledge engineering using large language models 2023 L.-PMeyer JFrey KJunghanns FBrei KBulert SGründer-Fahrer MMartin arXiv:2308.16622 Developing a scalable benchmark for assessing large language models in knowledge graph engineering 2023 JFrey L.-PMeyer NArndt FBrei KBulert arXiv:2309.17122 Benchmarking the abilities of large language models for rdf knowledge graph creation and comprehension: How well do llms speak turtle? 2023 L.-PMeyer CStadler JFrey arXiv:2307.06917 Llm-assisted knowledge graph engineering: Experiments with chatgpt 2023 VC-SLAM -A Handcrafted Data Corpus for the Construction of Semantic Models ABurgdorf APaulus APomp TMeisen Data 7 2022 Unifying large language models and knowledge graphs: A roadmap SPan LLuo YWang IEEE Transactions on Knowledge and Data Engineering 2024 An analysis of links in wikidata AHaller APolleres DDobriy NFerranti SJRodríguez Méndez European Semantic Web Conference Springer 2022 Scaling up knowledge graph creation to large and heterogeneous data sources EIglesias SJozashoori M.-EVidal Journal of Web Semantics 75 2023 A survey of knowledge enhanced pre-trained language models LHu ZLiu ZZhao LHou LNie JLi IEEE Transactions on Knowledge and Data Engineering 2023 YYao PWang BTian SCheng ZLi SDeng HChen NZhang arXiv:2305.13172 Editing large language models: Problems, methods, and opportunities 2023 RCohen EBiran OYoran AGloberson MGeva arXiv:2307.12976 Evaluating the ripple effects of knowledge editing in language models 2023 SAND : A Tool for Creating Semantic Descriptions of Tabular Sources BVu CAKnoblock The semantic web Springer 2022 13384 Mantistable v: A novel and efficient approach to semantic table interpretation RAvogadro MCremaschi SemTab@ ISWC 2021 PLASMA: Platform for Auxiliary Semantic Modeling Approaches AlexanderPaulus AndreasBurgdorf LarsPuleikis TristanLanger AndréPomp TobiasMeisen International Conference on Enterprise Information Systems 2021 Large Language Models Struggle to Learn Long-Tail Knowledge NKandpal HDeng ARoberts EWallace CRaffel 2022