<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic Data Management for Managing Heterogeneous Data Sources in Chemistry 4.0</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sayed Hoseini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Electr. Engineering and Computer Science, Niederrhein University of Applied Sciences</institution>
          ,
          <addr-line>Krefeld</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Managing large volumes of data poses significant challenges due to the variety of formats, distribution across departments, and diferent governance structures within organizations. In research and industry environments, this complexity is compounded by the need for streamlined data handling processes to support automated workflows and machine learning (ML) applications. Integrating implicit contextual knowledge alongside data artifacts is critical, especially for non-expert users accessing the data. Data lakes provide a scalable solution by aggregating raw data from disparate sources with minimal upfront integration costs. However, without proper integration, data analysis and interpretation is hindered, rendering the data lake efectively inoperable. This PhD research addresses these challenges by applying semantic data management (SDM) techniques inside a semantic data lake. While initial milestones have been achieved through a systematic literature review and a concrete implementation, further eforts lie ahead. First, the emergence of large language models ofers numerous opportunities for automating previously manual processes. Leveraging these models can significantly improve the eficiency of common SDM tasks. Second, extending the application of SDM techniques to data analytics can facilitate the integration of diverse data sources into ML pipelines. Ultimately, we aim to bridge the gap between Big Data and Semantic Web technologies, anticipating the development of advanced semantic data lake solutions in the foreseeable future.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Semantic Data Management</kwd>
        <kwd>Semantic Data Lakes</kwd>
        <kwd>Semantic Machine Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        very heterogeneous, e.g., streams of machine sensors, interfaces of control software, databases
and images of microscopes, and scripts and model checkpoints from ML. As a result, tasks
for analyzing data, such as collecting, accessing, searching, understanding and processing
data, become very time-consuming. This makes it dificult to realize visions such as Chemistry
4.0, which refers to the digital transformation of the chemical industry and emphasizes the
integration of data-driven systems for increasing degrees of automation [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The centralized
management of all (meta-) data with integrated data analytics using a uniform data management
system is thus very attractive and actively researched [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ].
      </p>
      <p>
        Data lakes are scalable schema-less repositories to ingest raw data in its original format
from heterogeneous data sources. Only minimal efort is required for ingesting data into a
data lake making it an eficient tool for collecting, storing, linking, and transforming datasets
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, this approach only postpones the upfront cost of integration, which is why they
sufer from the risk of turning into a data swamp [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. In addition, many existing systems
lack matured functions to support data analytics [
        <xref ref-type="bibr" rid="ref3 ref8">3, 8</xref>
        ]. Furthermore, industrial ML sufers from
low transparency of ML towards non-ML experts, poor and non-unified descriptions of ML
practices for reviewing or comprehension due to custom-made ad-hoc solutions tailored only
to specific applications afecting their re-usability [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>The main goal of this research is to develop a prototype for the industrial chemistry context
of the HIT that not only manages the various (meta-)data assets, but also facilitates data
integration, ultimately empowering users unfamiliar with data analytics to derive ML models.</p>
      <p>The importance of data integration is rooted in the fact that those users, who ingest the
data in the lake and are responsible for the data, may not belong to the group of data scientists,
who are going to use the data later on. Likewise, a data scientist crafting a specific model
seeks clarity and ease of understanding the detail about the design. Thus, the implicit context
knowledge needs to be committed alongside any created artifacts to assist a third party with
limited domain knowledge to interpret and use the received assets later on.</p>
      <p>The problem statement can be formulated mathematically. Let:
•  = {1, 2, . . . , }: the set of heterogeneous data sources,
•  = {1, 2, . . . , }: the set of analytical models applied to  for generating insights,
•  = {1, 2, . . . , }: the set of metadata artefacts that describe and link data items
managed and stored by a data lake. The objective is to minimize the human efort required to
prepare and integrate heterogeneous data sources through metadata, leveraging the capability
of the lake to derive insights from ML with maximum automation and smart assistance.</p>
      <p>Minimize: total(, ,  ) = prep(,  ) + use(, ,  ) + meta( )
• prep(,  ): Efort required to harmonize, transform, and integrate heterogeneous data
sources  using the available metadata  .
• use(, ,  ): Efort required for users to interpret, and utilize , , and  for deriving
insights and crafting ML models.</p>
      <p>• meta( ): Efort required to create and maintain metadata  .</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        Semantic data management (SDM) is one way of formalizing the context and domain
knowldege of data sources [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. SDM proposes the linkage of metadata to knowledge graphs (KG)
based on the Linked Data principles [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] to provide more meaning to the data in the lake by
establishing an additional semantic layer between the data and the knowledge layer [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. A
semantic layer can be used not only for data management but also to address the challenge of
integrating data from heterogeneous sources [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        Semantic data lakes store and manage this serialized semantics between data sources. They
are a specific form of traditional data lakes that extend the capabilities through a semantic
layer that enriches and connects the stored data semantically. The semantic data lake explicitly
integrates semantic descriptions into its data management and governance capability [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ],
where an ontology or KG serves as a universal data model, ofering a conceptual representation
of an organization’s data assets. In Figure 1, we propose a four-layered data lake architecture in
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], where especially metadata-related functions are enriched with semantics. For example,
a semantic labeling component in the ingestion layer adds semantic labels to the extracted
metadata elements. The semantic information (labels, models, KG, etc.) is managed in the storage
layer in an extended semantic metadata repository. To facilitate the usage and interpretation
of data the interaction layer has several additional components, e.g., for browsing the KG
and semantic models and editors for refining the semantic mappings and models. Figure 2
represents a particular instantiation of this architecture (see C2) illustrating the various utilized
technologies along the four layers.
      </p>
      <p>
        Data Management for ML has been well-researched for at least ten years [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and one
subfield is also known as MLOps [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Hai et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] underline the importance of ML-driven
metadata management and in-lake ML which means supporting the training and inference
process directly inside the data lake platform. Zhao et al. and Schlegel et al. [
        <xref ref-type="bibr" rid="ref19 ref8">8, 19</xref>
        ] present
metadata models for data lakes to capture not only descriptive but also analytical info about
datasets and performed analyses. MLSea [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] is a resource consisting of MLSO, an ontology to
model ML pipelines, MLST, a collection of taxonomies of ML-related concepts, and MLSea-KG, a
KG containing ML datasets, pipelines and scientific works from diverse sources. By leveraging
semantic technologies MLSea integrates ML datasets, experiments, software and scientific works
for improving the search, explainability and reproducibility.
      </p>
      <p>
        Large Language Models (LLMs) are expected to have a major impact on the landscape of data
utilization and exchange. LLMs have demonstrated remarkable capabilities in understanding,
generating, and processing vast amounts of textual data [
        <xref ref-type="bibr" rid="ref21 ref22 ref23">21, 22, 23</xref>
        ]. Promising fields of LLM
application are the integration of heterogeneous data sources in the sense of SDM [
        <xref ref-type="bibr" rid="ref24 ref25">24, 25</xref>
        ] and
automated machine learning (AutoML) [
        <xref ref-type="bibr" rid="ref26 ref27">26, 27</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Research Questions</title>
      <p>
        Closer collaboration between human-machine and machine-machine systems has revolutionized
the current industrial landscape, leading to Industry 4.0 [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. Here, challenges in terms of data
management are to be addressed [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The advantage of employing a data lake system lies in
the centralized management of (meta-) data and analytics. Thus, all model artifacts and their
associated datasets, are accessible, registered, documented, and understandable by both humans
and machines. The main goal of this research is to install such a prototype in the industrial
chemistry context of the HIT leading to the following research questions and related hypotheses:
RQ1: What role does SDM play in improving the integration and usability of
heterogeneous data generated in an industrial context, particularly facilitated within a semantic data lake?
H1: SDM facilitates the integration of heterogeneous data sources and enhances data usability by
providing a unified structure and enabling interoperability based on Linked Data principles.
To manage heterogeneous data it is important to have a clear and logical structure when
presenting this information. This demands a common understanding across the data landscape,
i.e., a lingua franca for data moderation [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] based on the Linked Data principles.
RQ2: How can LLMs be utilized to identify and formalize the context of given datasets,
creating a full semantic model?
H2: LLMs automate substeps in semantic model creation, in particular semantic labeling.
Automating the semantic modeling task is complex, because creating semantic models entails
deciphering the existing data source and establishing connections between data attributes and
concepts drawn from a KG. Open questions remain on how to utilize the LLM for individual
tasks along a pipeline or instead prepare the LLM for the entire task.
      </p>
      <p>RQ3: How can semantic descriptions of data sources, ML pipelines, and their context be
used to enhance data analytics within the data lake?
H3: Structured semantic knowledge about ML pipelines improves accuracy and eficiency of
contemporary methods for automating ML workflows.</p>
      <p>
        While the demand is increasing, ML models are still often manually created by humans, because
the need for statistical and technical knowledge pose significant challenges for non-technical
users [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. Current methods are only capable of assisting in the substep of model creation [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ],
but data integration is a major obstacle [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Research Methods</title>
      <p>
        Contribution 1 (C1): Systematic Literature Review: We systematically reviewed the literature of
the last 20 years of research in the field of SDM related to semantic data lakes in particular [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
The article gives an overview of semantics-based methods for data management, access, and
integration and related those findings to current semantic data lake proposals. Furthermore, we
identified a gap in today’s landscape between present data lakes, semantic technologies for data
accessing, and the semantic modeling of heterogeneous datasets.
      </p>
      <p>
        Contribution 2 (C2): The Semantic Data Reservoir (SEDAR) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] is an implementation to bridge
this gap. SEDAR is a prototype (see Figure 2) of a semantic data lake built on existing open-source
technologies in the area of big data management. For the implementation of SEDAR we were
inspired by the SDM pipeline (see Figure 3). The pipeline is designed for modeling data at the
schema level and the first phase after extracting those schemas is automated semantic labeling,
because semantic labels are a prerequisite for deriving a full semantic model automatically
followed by semantic refinement, i.e. manual oversight to verify the automated outcomes. We
then extended the pipeline and reinterpreted the storage phase conceptually, in the sense that
we convert the resulting semantic model into RML mappings [33] to be used for
Ontologybased data access (OBDA) [34]. OBDA allows for on-demand translation of queries against
heterogeneous data sources directly in their original form without having to know how the
data is organized physically, which is particular attractive in data lake environments. Thus,
SEDAR implements a polystore with semantic query processing engine grounded on semantic
models. The synergy between the automation platform at the HIT and SEDAR has been utilized
in production and presented as original research at the ICPS’24 conference [35].
Contribution 3 (C3): Automated Semantic Labeling using LLMs. In a publication for the ESWC
conference [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] we conduct experiments demonstrating the applicability of LLM for semantic
labeling and propose directions to address discovered challenges.
      </p>
      <p>Contribution 4 (C4): Standardizing ML pipelines. Recently we have continued progressing SEDAR
towards the support of standard ML pipelines with higher degrees of automation [36].</p>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation</title>
      <p>
        This PhD is already in a later stage, hence some research questions can already be addressed.
Through C1, we have illuminated how SDM can help with the management of heterogeneous
process data and gained knowledge of the current state of the field to understand how other
researchers aim to implement particular SDM techniques. Open questions remain on how to
convert these formal ideas into a particular implementation. Through C2, we proved to a wider
audience how semantic processing can meet modern big data requirements. Therefore, we
accept H1 by providing a comprehensive field survey and demonstrating practically how the
semantic layer of SEDAR enables more expressive data management, integration, and access.
Through C3 we address the applicability of LLMs for the first steps in the semantic model
creation process. The experiments demonstrate the feasibility of utilizing LLMs for semantic
type detection with a fixed or limited set of labels derived from legacy KGs. The findings further
suggest that LLMs can efectively engage in semantic type detection tasks even when presented
with new, unfamiliar, or arbitrary domain ontologies, by leveraging their inherent knowledge
and understanding of language and as well as additional contextual information that is possibly
provided alongside the ontology. Therefore, we accept the premise of H2. Through C4, we have
been progressing towards standardizing ML pipelines. In the future, we plan to research how
to perform a fusion between the SDM techniques and the existing works towards automating
ML. To this end, we want to propose a software system that allows to reuse and generalize
data analytics for arbitrary use cases. The goal is to answer RQ3 by incorporating structured
semantic knowledge about previously conducted ML experiments, such as the MLSea KG [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]
to improve the eficiency and accuracy of current automated ML methods. By addressing the
more challenging preceding phases of any ML project, i.e. business &amp; data understanding, and
especially data preparation &amp; integration [37], this research agenda will advance the SOTA.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>This doctoral research addresses the challenge of managing diverse data sources and their
integration into common ML pipelines semantically. So far, to face this issue, we first conducted
a systematic literature review, then presented SEDAR, an open-source data management
platform. We then proceeded to investigate the applicability of LLMs for semantic labeling and
to enhance SEDAR to standardize ML pipelines by integrating principles from AutoML and
MLOps. As this Ph.D. is already in a later stage, through these contributions we were able to
answer the two out of three research questions. The remaining phase will focus on integrating
semantically standardized ML pipelines to improve the efciency of automated ML methods.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>The author thanks Maribel Acosta and Christoph Quix for reviewing this article. This work
has been sponsored by the German Federal Ministry of Education and Research in the funding
program “Forschung an Fachhochschulen”, project 2 (grant no. 13FH557KX0).
[32] A. Paulus, A. Burgdorf, A. Pomp, T. Meisen, Recent advances and future challenges of
semantic modeling, in: 2021 IEEE 15th International Conference on Semantic Computing
(ICSC), IEEE, 2021, pp. 70–75.
[33] A. Dimou, M. Vander Sande, P. Colpaert, R. Verborgh, E. Mannens, R. Van de Walle, Rml:</p>
      <p>A generic language for integrated rdf mappings of heterogeneous data., Ldow 1184 (2014).
[34] G. Xiao, D. Calvanese, R. Kontchakov, D. Lembo, A. Poggi, R. Rosati, M. Zakharyaschev,
Ontology-based data access: A survey, International Joint Conferences on Artificial
Intelligence, 2018.
[35] S. Hoseini, et. al., Coatings intelligence: Data-driven automation for chemistry 4.0, in:
2024 IEEE 7th (ICPS), 2024, pp. 1–8. In-press.
[36] S. Hoseini, M. Ibbels, C. Quix, Enhancing machine learning capabilities in data lakes with
AutoML and LLMs, in: European Conference on Advances in Databases and Information
Systems, Springer, 2024. Accepted.
[37] R. Wirth, J. Hipp, Crisp-dm: Towards a standard process model for data mining, in:
Proceedings of the 4th international conference on the practical applications of knowledge
discovery and data mining, volume 1, Manchester, 2000, pp. 29–39.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. A. C.</given-names>
            <surname>Beck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Carothers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. R.</given-names>
            <surname>Subramanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pfaendtner</surname>
          </string-name>
          ,
          <article-title>Data science: Accelerating innovation and discovery in chemical engineering</article-title>
          ,
          <source>AIChE Journal</source>
          <volume>62</volume>
          (
          <year>2016</year>
          ). doi:http: //dx.doi.org/10.1002/aic.15192.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Strehmel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cremanns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Göttert</surname>
          </string-name>
          ,
          <article-title>Photochemistry with cyanines in the near infrared: A step to chemistry 4.0 technologies, Chemistry-</article-title>
          A
          <source>European Journal</source>
          <volume>25</volume>
          (
          <year>2019</year>
          )
          <fpage>12855</fpage>
          -
          <lpage>12864</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Hai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Koutras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Quix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jarke</surname>
          </string-name>
          ,
          <article-title>Data lakes: A survey of functions and systems</article-title>
          ,
          <source>IEEE TKDE 35</source>
          (
          <year>2023</year>
          )
          <fpage>12571</fpage>
          -
          <lpage>12590</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Nargesian</surname>
          </string-name>
          , E. Zhu,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Pu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. C.</given-names>
            <surname>Arocena</surname>
          </string-name>
          ,
          <article-title>Data lake management: challenges and opportunities</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>12</volume>
          (
          <year>2019</year>
          )
          <fpage>1986</fpage>
          -
          <lpage>1989</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T. P.</given-names>
            <surname>Raptis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passarella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Conti</surname>
          </string-name>
          ,
          <article-title>Data management in industry 4.0: State of the art and open challenges</article-title>
          ,
          <source>IEEE Access 7</source>
          (
          <year>2019</year>
          )
          <fpage>97052</fpage>
          -
          <lpage>97093</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2019</year>
          .
          <volume>2929296</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W.</given-names>
            <surname>Brackenbury</surname>
          </string-name>
          , R. Liu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mondal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Elmore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Franklin</surname>
          </string-name>
          ,
          <article-title>Draining the data swamp: A similarity-based approach</article-title>
          ,
          <source>in: Proceedings of the Workshop on HumanIn-the-Loop Data Analytics, HILDA '18</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2018</year>
          . doi:
          <volume>10</volume>
          .1145/3209900.3209911.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sawadogo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Darmont</surname>
          </string-name>
          ,
          <article-title>On data lake architectures and metadata management</article-title>
          ,
          <source>JJIS</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          , et al.,
          <article-title>Analysis-oriented metadata for data lakes</article-title>
          ,
          <source>in: Proceedings of the 25th IDEAS, ACM</source>
          ,
          <year>2021</year>
          , p.
          <fpage>194</fpage>
          -
          <lpage>203</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , G. Cheng, A.
          <string-name>
            <surname>Soylu</surname>
          </string-name>
          , E. Kharlamov,
          <article-title>Executable knowledge graphs for machine learning: a bosch case of welding monitoring</article-title>
          , in: International Semantic Web Conference, Springer,
          <year>2022</year>
          , pp.
          <fpage>791</fpage>
          -
          <lpage>809</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hoseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Theissen-Lipp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Quix</surname>
          </string-name>
          ,
          <article-title>A survey on semantic data management as intersection of ontology-based data access, semantic modeling and data lakes</article-title>
          ,
          <source>Journal of Web Semantics</source>
          <volume>81</volume>
          (
          <year>2024</year>
          )
          <article-title>100819</article-title>
          . doi:https://doi.org/10.1016/j.websem.
          <year>2024</year>
          .
          <volume>100819</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          ,
          <article-title>Linked data: The story so far, in: Semantic services, interoperability and web applications: emerging concepts</article-title>
          ,
          <source>IGI global</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>205</fpage>
          -
          <lpage>227</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pomp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paulus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kirmse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kraus</surname>
          </string-name>
          , T. Meisen,
          <article-title>Applying semantics to reduce the time to analytics within complex heterogeneous infrastructures</article-title>
          ,
          <source>Technologies</source>
          <volume>6</volume>
          (
          <year>2018</year>
          )
          <fpage>86</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hoseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shaker</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Quix, SEDAR: A semantic data reservoir for heterogeneous datasets</article-title>
          ,
          <source>in: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management</source>
          , Birmingham, UK,
          <source>October 21-25</source>
          ,
          <year>2023</year>
          , ACM,
          <year>2023</year>
          , pp.
          <fpage>5056</fpage>
          -
          <lpage>5060</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>H.</given-names>
            <surname>Dibowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Svetashova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Henson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <article-title>Using semantic technologies to manage a data lake: Data catalog, provenance and access control</article-title>
          ,
          <source>in: Proc. Scalable Semantic Web Knowledge Base Systems Workshop</source>
          , volume
          <volume>2757</volume>
          <source>of CEUR WS</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>65</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Usmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Breslin</surname>
          </string-name>
          , E. Curry,
          <article-title>Towards multimodal knowledge graphs for data spaces</article-title>
          ,
          <source>in: Companion Proceedings of the ACM Web Conference</source>
          <year>2023</year>
          , WWW '23 Companion, Association for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>1494</fpage>
          -
          <lpage>1499</lpage>
          . doi:
          <volume>10</volume>
          .1145/3543873.3587665.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>C.</given-names>
            <surname>Quix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hai</surname>
          </string-name>
          ,
          <article-title>Data lake</article-title>
          ,
          <source>in: Encyclopedia of Big Data Technologies</source>
          , Springer,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -63962-8\_{7}{-}{1}.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>C.</given-names>
            <surname>Chai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Data management for machine learning: A survey</article-title>
          ,
          <source>IEEE TKDE 35</source>
          (
          <year>2023</year>
          )
          <fpage>4646</fpage>
          -
          <lpage>4667</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Alla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Adari</surname>
          </string-name>
          , What Is MLOps?, Apress, Berkeley, CA,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Schlegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sattler</surname>
          </string-name>
          ,
          <article-title>Extracting provenance of machine learning experiment pipeline artifacts</article-title>
          ,
          <source>in: 27th ADBIS Conference</source>
          , Barcelona, Spain, volume
          <volume>13985</volume>
          <source>of LNCS</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>238</fpage>
          -
          <lpage>251</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>I.</given-names>
            <surname>Dasoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <article-title>Mlsea: A semantic layer for discoverable machine learning</article-title>
          , in: A.
          <string-name>
            <surname>Meroño-Peñuela</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Troncy</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Hartig</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Acosta</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Alam</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , P. Lisena (Eds.),
          <source>The Semantic Web - 21st International Conference, ESWC</source>
          <year>2024</year>
          , Hersonissos, Crete, Greece, May
          <volume>26</volume>
          -30,
          <year>2024</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          , volume
          <volume>14665</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2024</year>
          , pp.
          <fpage>178</fpage>
          -
          <lpage>198</lpage>
          . URL: https://doi.org/10.1007/ 978-3-
          <fpage>031</fpage>
          -60635-9_
          <fpage>11</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -60635-9\_
          <fpage>11</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Teng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Evaluating the logical reasoning ability of chatgpt and gpt-4</article-title>
          , arXiv preprint arXiv:
          <volume>2304</volume>
          .03439 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>T.</given-names>
            <surname>Krüger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gref</surname>
          </string-name>
          ,
          <article-title>Performance of large language models in a computer science degree program</article-title>
          ,
          <source>in: Artificial Intelligence. ECAI 2023 International Workshops</source>
          , Springer Nature Switzerland, Cham,
          <year>2024</year>
          , pp.
          <fpage>409</fpage>
          -
          <lpage>424</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <article-title>A survey on evaluation of large language models</article-title>
          ,
          <source>ACM Trans. Intell. Syst. Technol</source>
          . (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .1145/3641289, just Accepted.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>K.</given-names>
            <surname>Korini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <article-title>Column type annotation using chatgpt</article-title>
          ,
          <source>arXiv preprint arXiv:2306.00745</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hoseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Burgdorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paulus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Meisen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Quix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pomp</surname>
          </string-name>
          ,
          <article-title>Towards llm-augmented creation of semantic models for dataspaces</article-title>
          , in: J.
          <string-name>
            <surname>Theissen-Lipp</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Colpaert</surname>
            ,
            <given-names>S. K.</given-names>
          </string-name>
          <string-name>
            <surname>Sowe</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Curry</surname>
          </string-name>
          , S. Decker (Eds.),
          <source>Proceedings of the Second International Workshop on Semantics in Dataspaces (SDS</source>
          <year>2024</year>
          )
          <article-title>co-located with the 21st Extended Semantic Web Conference (ESWC</article-title>
          <year>2024</year>
          ), Hersonissos, Greece, May
          <volume>26</volume>
          ,
          <year>2024</year>
          , volume
          <volume>3705</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>M. M. Hassan</surname>
          </string-name>
          , , et. al.,
          <article-title>Chatgpt as your personal data scientist</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2305</volume>
          .
          <fpage>13657</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>N.</given-names>
            <surname>Hollmann</surname>
          </string-name>
          , et al.,
          <article-title>Large language models for automated data science: Introducing caafe for context-aware automated feature engineering</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ustundag</surname>
          </string-name>
          , E. Cevikcan,
          <source>Industry 4.0: Managing the Digital Transformation</source>
          , Springer,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <source>Semantic Integration and Interoperability</source>
          , Springer International Publishing, Cham,
          <year>2022</year>
          , pp.
          <fpage>195</fpage>
          -
          <lpage>210</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -93975-5_
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Karmaker</surname>
          </string-name>
          , et al.,
          <article-title>Automl to date and beyond: Challenges and opportunities</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>54</volume>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bozzon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hai</surname>
          </string-name>
          ,
          <article-title>Amalur: Data integration meets machine learning</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>