Semantic Data Management for Managing Heterogeneous Data Sources in Chemistry 4.0

Semantic Data Management for Managing Heterogeneous Data Sources in Chemistry 4.0 SayedHoseini sayed.hoseini@hs-niederrhein.de Department of Electr. Engineering and Computer Science Niederrhein University of Applied Sciences

Krefeld Germany

Semantic Data Management for Managing Heterogeneous Data Sources in Chemistry 4.0 1613-0073 D712DA4C102AD129DA05375A8B673506 GROBID - A machine learning software for extracting information from scholarly documents Semantic Data Management Semantic Data Lakes Semantic Machine Learning

Managing large volumes of data poses significant challenges due to the variety of formats, distribution across departments, and different governance structures within organizations. In research and industry environments, this complexity is compounded by the need for streamlined data handling processes to support automated workflows and machine learning (ML) applications. Integrating implicit contextual knowledge alongside data artifacts is critical, especially for non-expert users accessing the data. Data lakes provide a scalable solution by aggregating raw data from disparate sources with minimal upfront integration costs. However, without proper integration, data analysis and interpretation is hindered, rendering the data lake effectively inoperable. This PhD research addresses these challenges by applying semantic data management (SDM) techniques inside a semantic data lake. While initial milestones have been achieved through a systematic literature review and a concrete implementation, further efforts lie ahead. First, the emergence of large language models offers numerous opportunities for automating previously manual processes. Leveraging these models can significantly improve the efficiency of common SDM tasks. Second, extending the application of SDM techniques to data analytics can facilitate the integration of diverse data sources into ML pipelines. Ultimately, we aim to bridge the gap between Big Data and Semantic Web technologies, anticipating the development of advanced semantic data lake solutions in the foreseeable future.

Problem statement

Large amounts of data are generated every second to enable the subsequent collection, storage, usage and analysis for various applications. However, managing data can be challenging not only due to the variety of data formats, but additionally, it is often distributed across different departments within a company, under different governance regimes, network topologies and data models. At the Institute for Surface Technology of the Niederrhein University of Applied Sciences (HIT), where this research takes place, similar challenges need to be addressed [1]. There, an automation platform for material development is in operation and various datasets of experiments for paints and coatings need to be captured, cataloged, and made available for ML to be applied in chemistry for experiment suggestions and analysis. Data sources are very heterogeneous, e.g., streams of machine sensors, interfaces of control software, databases and images of microscopes, and scripts and model checkpoints from ML. As a result, tasks for analyzing data, such as collecting, accessing, searching, understanding and processing data, become very time-consuming. This makes it difficult to realize visions such as Chemistry 4.0, which refers to the digital transformation of the chemical industry and emphasizes the integration of data-driven systems for increasing degrees of automation [2]. The centralized management of all (meta-) data with integrated data analytics using a uniform data management system is thus very attractive and actively researched [3,4,5].

Data lakes are scalable schema-less repositories to ingest raw data in its original format from heterogeneous data sources. Only minimal effort is required for ingesting data into a data lake making it an efficient tool for collecting, storing, linking, and transforming datasets [4]. However, this approach only postpones the upfront cost of integration, which is why they suffer from the risk of turning into a data swamp [6,7]. In addition, many existing systems lack matured functions to support data analytics [3,8]. Furthermore, industrial ML suffers from low transparency of ML towards non-ML experts, poor and non-unified descriptions of ML practices for reviewing or comprehension due to custom-made ad-hoc solutions tailored only to specific applications affecting their re-usability [9].

The main goal of this research is to develop a prototype for the industrial chemistry context of the HIT that not only manages the various (meta-)data assets, but also facilitates data integration, ultimately empowering users unfamiliar with data analytics to derive ML models.

The importance of data integration is rooted in the fact that those users, who ingest the data in the lake and are responsible for the data, may not belong to the group of data scientists, who are going to use the data later on. Likewise, a data scientist crafting a specific model seeks clarity and ease of understanding the detail about the design. Thus, the implicit context knowledge needs to be committed alongside any created artifacts to assist a third party with limited domain knowledge to interpret and use the received assets later on.

The problem statement can be formulated mathematically. Let:

• 𝐷 = {𝑑 1 , 𝑑 2 , . . . , 𝑑 𝑛 }: the set of heterogeneous data sources,

• 𝐴 = {𝑎 1 , 𝑎 2 , . . . , 𝑎 𝑘 }: the set of analytical models applied to 𝐷 for generating insights,

• 𝑀 = {𝑚 1 , 𝑚 2 , . . . , 𝑚 𝑚 }: the set of metadata artefacts that describe and link data items managed and stored by a data lake. The objective is to minimize the human effort required to prepare and integrate heterogeneous data sources through metadata, leveraging the capability of the lake to derive insights from ML with maximum automation and smart assistance.

Minimize:

𝐸 total (𝐷, 𝐴, 𝑀 ) = 𝐸 prep (𝐷, 𝑀 ) + 𝐸 use (𝐷, 𝐴, 𝑀 ) + 𝐸 meta (𝑀 )

• 𝐸 prep (𝐷, 𝑀 ): Effort required to harmonize, transform, and integrate heterogeneous data sources 𝐷 using the available metadata 𝑀 .

• 𝐸 use (𝐷, 𝐴, 𝑀 ): Effort required for users to interpret, and utilize 𝐷, 𝐴, and 𝑀 for deriving insights and crafting ML models.

• 𝐸 meta (𝑀 ): Effort required to create and maintain metadata 𝑀 .

Related work

Semantic data management (SDM) is one way of formalizing the context and domain knowldege of data sources [10]. SDM proposes the linkage of metadata to knowledge graphs (KG) based on the Linked Data principles [11] to provide more meaning to the data in the lake by establishing an additional semantic layer between the data and the knowledge layer [12]. A semantic layer can be used not only for data management but also to address the challenge of integrating data from heterogeneous sources [13].

Semantic data lakes store and manage this serialized semantics between data sources. They are a specific form of traditional data lakes that extend the capabilities through a semantic layer that enriches and connects the stored data semantically. The semantic data lake explicitly integrates semantic descriptions into its data management and governance capability [14,15], where an ontology or KG serves as a universal data model, offering a conceptual representation of an organization's data assets. In Figure 1, we propose a four-layered data lake architecture in [16], where especially metadata-related functions are enriched with semantics. For example, a semantic labeling component in the ingestion layer adds semantic labels to the extracted metadata elements. The semantic information (labels, models, KG, etc.) is managed in the storage layer in an extended semantic metadata repository. To facilitate the usage and interpretation of data the interaction layer has several additional components, e.g., for browsing the KG and semantic models and editors for refining the semantic mappings and models. Figure 2 represents a particular instantiation of this architecture (see C2) illustrating the various utilized technologies along the four layers.

Data Management for ML has been well-researched for at least ten years [17] and one subfield is also known as MLOps [18]. Hai et al. [3] underline the importance of ML-driven metadata management and in-lake ML which means supporting the training and inference process directly inside the data lake platform. Zhao et al. and Schlegel et al. [8,19] present metadata models for data lakes to capture not only descriptive but also analytical info about datasets and performed analyses. MLSea [20] is a resource consisting of MLSO, an ontology to model ML pipelines, MLST, a collection of taxonomies of ML-related concepts, and MLSea-KG, a KG containing ML datasets, pipelines and scientific works from diverse sources. By leveraging semantic technologies MLSea integrates ML datasets, experiments, software and scientific works for improving the search, explainability and reproducibility.

Large Language Models (LLMs) are expected to have a major impact on the landscape of data utilization and exchange. LLMs have demonstrated remarkable capabilities in understanding, generating, and processing vast amounts of textual data [21,22,23]. Promising fields of LLM application are the integration of heterogeneous data sources in the sense of SDM [24,25] and automated machine learning (AutoML) [26,27].

Research Questions

Closer collaboration between human-machine and machine-machine systems has revolutionized the current industrial landscape, leading to Industry 4.0 [28]. Here, challenges in terms of data management are to be addressed [5]. The advantage of employing a data lake system lies in the centralized management of (meta-) data and analytics. Thus, all model artifacts and their associated datasets, are accessible, registered, documented, and understandable by both humans and machines. The main goal of this research is to install such a prototype in the industrial chemistry context of the HIT leading to the following research questions and related hypotheses:

RQ1: What role does SDM play in improving the integration and usability of heterogeneous data generated in an industrial context, particularly facilitated within a semantic data lake? H1: SDM facilitates the integration of heterogeneous data sources and enhances data usability by providing a unified structure and enabling interoperability based on Linked Data principles.

To manage heterogeneous data it is important to have a clear and logical structure when presenting this information. This demands a common understanding across the data landscape, i.e., a lingua franca for data moderation [29] based on the Linked Data principles.

RQ2: How can LLMs be utilized to identify and formalize the context of given datasets, creating a full semantic model? H2: LLMs automate substeps in semantic model creation, in particular semantic labeling.

Automating the semantic modeling task is complex, because creating semantic models entails deciphering the existing data source and establishing connections between data attributes and concepts drawn from a KG. Open questions remain on how to utilize the LLM for individual tasks along a pipeline or instead prepare the LLM for the entire task.

RQ3: How can semantic descriptions of data sources, ML pipelines, and their context be used to enhance data analytics within the data lake? H3: Structured semantic knowledge about ML pipelines improves accuracy and efficiency of contemporary methods for automating ML workflows.

While the demand is increasing, ML models are still often manually created by humans, because the need for statistical and technical knowledge pose significant challenges for non-technical users [30]. Current methods are only capable of assisting in the substep of model creation [30], but data integration is a major obstacle [31].

Research Methods

Contribution 1 (C1): Systematic Literature Review: We systematically reviewed the literature of the last 20 years of research in the field of SDM related to semantic data lakes in particular [10]. The article gives an overview of semantics-based methods for data management, access, and integration and related those findings to current semantic data lake proposals. Furthermore, we identified a gap in today's landscape between present data lakes, semantic technologies for data accessing, and the semantic modeling of heterogeneous datasets. Contribution 2 (C2): The Semantic Data Reservoir (SEDAR) [13] is an implementation to bridge this gap. SEDAR is a prototype (see Figure 2) of a semantic data lake built on existing open-source technologies in the area of big data management. For the implementation of SEDAR we were inspired by the SDM pipeline (see Figure 3). The pipeline is designed for modeling data at the schema level and the first phase after extracting those schemas is automated semantic labeling, because semantic labels are a prerequisite for deriving a full semantic model automatically followed by semantic refinement, i.e. manual oversight to verify the automated outcomes. We then extended the pipeline and reinterpreted the storage phase conceptually, in the sense that we convert the resulting semantic model into RML mappings [33] to be used for Ontologybased data access (OBDA) [34]. OBDA allows for on-demand translation of queries against heterogeneous data sources directly in their original form without having to know how the data is organized physically, which is particular attractive in data lake environments. Thus, SEDAR implements a polystore with semantic query processing engine grounded on semantic models. The synergy between the automation platform at the HIT and SEDAR has been utilized in production and presented as original research at the ICPS'24 conference [35]. Contribution 3 (C3): Automated Semantic Labeling using LLMs. In a publication for the ESWC conference [25] we conduct experiments demonstrating the applicability of LLM for semantic labeling and propose directions to address discovered challenges. Contribution 4 (C4): Standardizing ML pipelines. Recently we have continued progressing SEDAR towards the support of standard ML pipelines with higher degrees of automation [36].

Evaluation

This PhD is already in a later stage, hence some research questions can already be addressed. Through C1, we have illuminated how SDM can help with the management of heterogeneous process data and gained knowledge of the current state of the field to understand how other researchers aim to implement particular SDM techniques. Open questions remain on how to convert these formal ideas into a particular implementation. Through C2, we proved to a wider audience how semantic processing can meet modern big data requirements. Therefore, we accept H1 by providing a comprehensive field survey and demonstrating practically how the semantic layer of SEDAR enables more expressive data management, integration, and access. Through C3 we address the applicability of LLMs for the first steps in the semantic model creation process. The experiments demonstrate the feasibility of utilizing LLMs for semantic type detection with a fixed or limited set of labels derived from legacy KGs. The findings further suggest that LLMs can effectively engage in semantic type detection tasks even when presented with new, unfamiliar, or arbitrary domain ontologies, by leveraging their inherent knowledge and understanding of language and as well as additional contextual information that is possibly provided alongside the ontology. Therefore, we accept the premise of H2. Through C4, we have been progressing towards standardizing ML pipelines. In the future, we plan to research how to perform a fusion between the SDM techniques and the existing works towards automating ML. To this end, we want to propose a software system that allows to reuse and generalize data analytics for arbitrary use cases. The goal is to answer RQ3 by incorporating structured semantic knowledge about previously conducted ML experiments, such as the MLSea KG [20] to improve the efficiency and accuracy of current automated ML methods. By addressing the more challenging preceding phases of any ML project, i.e. business & data understanding, and especially data preparation & integration [37], this research agenda will advance the SOTA.

Conclusion and Future Work

This doctoral research addresses the challenge of managing diverse data sources and their integration into common ML pipelines semantically. So far, to face this issue, we first conducted a systematic literature review, then presented SEDAR, an open-source data management platform. We then proceeded to investigate the applicability of LLMs for semantic labeling and to enhance SEDAR to standardize ML pipelines by integrating principles from AutoML and MLOps. As this Ph.D. is already in a later stage, through these contributions we were able to answer the two out of three research questions. The remaining phase will focus on integrating semantically standardized ML pipelines to improve the effciency of automated ML methods.

Figure 1 :1Figure1: Semantic data lake architecture from[10]

Figure 2 :2Figure 2: SEDAR architecture from[13]

Figure 3 :3Figure 3: SDM pipeline inspired by [32] and extended by Ontology-based data access (OBDA)

Acknowledgments

The author thanks Maribel Acosta and Christoph Quix for reviewing this article. This work has been sponsored by the German Federal Ministry of Education and Research in the funding program "Forschung an Fachhochschulen", project 𝑖 2 𝐷𝐴𝐶𝐻 (grant no. 13FH557KX0).

Data science: Accelerating innovation and discovery in chemical engineering DA CBeck JMCarothers VRSubramanian JPfaendtner 10.1002/aic.15192 doi: AIChE Journal 62 2016 Photochemistry with cyanines in the near infrared: A step to chemistry 4.0 technologies BStrehmel CSchmitz KCremanns JGöttert Chemistry-A European Journal 25 2019 Data lakes: A survey of functions and systems RHai CKoutras CQuix MJarke IEEE TKDE 35 2023 Data lake management: challenges and opportunities FNargesian EZhu RJMiller KQPu PCArocena Proc. VLDB Endow VLDB Endow 2019 12 Data management in industry 4.0: State of the art and open challenges TPRaptis APassarella MConti 10.1109/ACCESS.2019.2929296 IEEE Access 7 2019 Draining the data swamp: A similarity-based approach WBrackenbury RLiu MMondal AJElmore BUr KChard MJFranklin 10.1145/3209900.3209911 Proceedings of the Workshop on Human-In-the-Loop Data Analytics, HILDA '18 the Workshop on Human-In-the-Loop Data Analytics, HILDA '18

New York, NY, USA

Association for Computing Machinery 2018 PSawadogo JDarmont On data lake architectures and metadata management JJIS 2021 Analysis-oriented metadata for data lakes YZhao Proceedings of the 25th IDEAS, ACM the 25th IDEAS, ACM 2021 Executable knowledge graphs for machine learning: a bosch case of welding monitoring ZZheng BZhou DZhou XZheng GCheng ASoylu EKharlamov International Semantic Web Conference Springer 2022 A survey on semantic data management as intersection of ontology-based data access, semantic modeling and data lakes SHoseini JTheissen-Lipp CQuix 10.1016/j.websem.2024.100819 doi: Journal of Web Semantics 81 100819 2024 Linked data: The story so far CBizer THeath TBerners-Lee Semantic services, interoperability and web applications: emerging concepts IGI global 2011 Applying semantics to reduce the time to analytics within complex heterogeneous infrastructures APomp APaulus AKirmse VKraus TMeisen Technologies 6 86 2018 SEDAR: A semantic data reservoir for heterogeneous datasets SHoseini AAli HShaker CQuix Proceedings of the 32nd ACM International Conference on Information and Knowledge Management the 32nd ACM International Conference on Information and Knowledge Management

Birmingham, UK

ACM October 21-25, 2023. 2023 Using semantic technologies to manage a data lake: Data catalog, provenance and access control HDibowski SSchmid YSvetashova CHenson TTran Proc. Scalable Semantic Web Knowledge Base Systems Workshop CEUR WS Scalable Semantic Web Knowledge Base Systems Workshop 2020 2757 Towards multimodal knowledge graphs for data spaces AUsmani MJKhan JGBreslin ECurry 10.1145/3543873.3587665 Companion Proceedings of the ACM Web Conference 2023

New York, NY, USA

Association for Computing Machinery 2023 WWW '23 Companion CQuix RHai 10.1007/978-3-319-63962-8_{7}{-}{1} Data lake, in: Encyclopedia of Big Data Technologies Springer 2019 Data management for machine learning: A survey CChai JWang YLuo ZNiu GLi IEEE TKDE 35 2023 What Is MLOps? SAlla SKAdari 2021 Apress Berkeley, CA Extracting provenance of machine learning experiment pipeline artifacts MSchlegel KSattler 27th ADBIS Conference

Barcelona, Spain

Springer 2023 13985 Mlsea: A semantic layer for discoverable machine learning IDasoulas DYang ADimou 10.1007/978-3-031-60635-9_11 doi: The Semantic Web -21st International Conference, ESWC 2024 Lecture Notes in Computer Science AMeroño-Peñuela ADimou RTroncy OHartig MAcosta MAlam HPaulheim PLisena

Hersonissos, Crete, Greece

Springer May 26-30, 2024. 2024 14665 Proceedings, Part II Evaluating the logical reasoning ability of chatgpt and gpt-4 HLiu RNing ZTeng JLiu QZhou YZhang arXiv:2304.03439 2023 arXiv preprint Performance of large language models in a computer science degree program TKrüger MGref Artificial Intelligence. ECAI 2023 International Workshops

Nature Switzerland, Cham

Springer 2024 A survey on evaluation of large language models YChang XWang JWang YWu LYang KZhu HChen XYi CWang YWang WYe YZhang YChang PSYu QYang XXie 10.1145/3641289 ACM Trans. Intell. Syst. Technol 2024 just Accepted KKorini CBizer arXiv:2306.00745 Column type annotation using chatgpt 2023 arXiv preprint Towards llm-augmented creation of semantic models for dataspaces SHoseini ABurgdorf APaulus TMeisen CQuix APomp Proceedings of the Second International Workshop on Semantics in Dataspaces (SDS 2024) co-located with the 21st Extended Semantic Web Conference (ESWC 2024) CEUR Workshop Proceedings JTheissen-Lipp PColpaert SKSowe ECurry SDecker the Second International Workshop on Semantics in Dataspaces (SDS 2024) co-located with the 21st Extended Semantic Web Conference (ESWC 2024)

Hersonissos, Greece

May 26, 2024. 2024 3705 MMHassan arXiv:2305.13657 Chatgpt as your personal data scientist 2023 Large language models for automated data science: Introducing caafe for context-aware automated feature engineering NHollmann Advances in Neural Information Processing Systems 36 2024 AUstundag ECevikcan Industry 4.0: Managing the Digital Transformation Springer 2018 Semantic Integration and Interoperability SAuer 10.1007/978-3-030-93975-5_12 2022 Springer International Publishing Cham Automl to date and beyond: Challenges and opportunities SKKarmaker ACM Comput. Surv 54 2021 Amalur: Data integration meets machine learning ZLi WSun DZhan YKang LChen ABozzon RHai IEEE Transactions on Knowledge and Data Engineering 2024 Recent advances and future challenges of semantic modeling APaulus ABurgdorf APomp TMeisen IEEE 15th International Conference on Semantic Computing (ICSC), IEEE 2021. 2021 Rml: A generic language for integrated rdf mappings of heterogeneous data ADimou MVander Sande PColpaert RVerborgh EMannens RVan De Walle Ldow 1184 2014 Ontology-based data access: A survey GXiao DCalvanese RKontchakov DLembo APoggi RRosati MZakharyaschev International Joint Conferences on Artificial Intelligence 2018 Coatings intelligence: Data-driven automation for chemistry 4 SHoseini IEEE 7th (ICPS) 2024. 2024 0 In-press Enhancing machine learning capabilities in data lakes with AutoML and LLMs SHoseini MIbbels CQuix European Conference on Advances in Databases and Information Systems Springer 2024 Accepted Crisp-dm: Towards a standard process model for data mining RWirth JHipp Proceedings of the 4th international conference on the practical applications of knowledge discovery and data mining the 4th international conference on the practical applications of knowledge discovery and data mining

Manchester

2000 1