<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CLARUS MLOps: Boosting Collaboration in Manufacturing Data Value Chains</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aitor Agirre</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ana Adell</string-name>
          <email>aadell@ikerlan.es</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Blanca Kremer</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francisco Fraile</string-name>
          <email>ffraile@cigip.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ikerlan S.Coop.</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arizmendiarrieta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mondragón</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Spain</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universitat Politècnica de València</institution>
          ,
          <addr-line>Valencia</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>1 This paper presents the MLOps framework adopted in the EU-funded CLARUS Project. This framework combines the Industrial Data Spaces (IDS) framework, a secure, sovereign system of data sharing, with state-of-the-art Machine Learning (MLOps) services aimed at deploying and maintaining machine learning models in production. This innovative stack allows stakeholders (for instance manufacturers and external data scientists) to conform to effective value chains to leverage data in an efficient way.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;MLOps</kwd>
        <kwd>IDS</kwd>
        <kwd>Data Space</kwd>
        <kwd>Clarus</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>MLOps (Machine Learning Operations) is a set of practices, principles and methods aimed at
enhancing the lifecycle management of machine learning (ML) models in production environments,
streamlining their training, tracking, deployment, monitoring, and maintenance in operation. It can
be seen as an extension of DevOps (Development and Operations) principles applied specifically to
the field of machine learning.</p>
      <p>MLOps is based on automation technologies to ensure the use of the right model in the right place
at the right time. MLOps abstract the developer from manual, repetitive and error prone plumbing
work that consumes much time and does not provide added value, helping the developer in focusing
on the development of the AI models themselves.</p>
      <p>When developing a solution based on machine learning technologies, there are two main stages
to be considered: model training and model execution (inference). Whilst model training usually
needs more computing resources (even specialized hardware) and thus is likely to be performed in
the cloud, the model inference is not so heavy process and can benefit from the advantages of edge
computing. Thus, the edge-to-cloud computing continuum space becomes a natural solution to
develop, store, deploy, and execute the AI models.</p>
      <p>But this raises a big challenge that must be addressed to enable an effective deployment of AI
based industrial solutions on the Edge: data openness. On the one hand, the MLOps architecture must
access the data that is generated on the Edge (floor plant). On the other hand, the models running on
the Edge require updates when new model versions are available, or when concept drift is detected.
Thus, a reliable resource (i.e. data and models) exchange mechanism needs to be integrated between
the Edge, where the data is produced and the models are executed, and the Cloud, where the models
are trained and stored.</p>
      <p>Traditionally, this interchange of resources and artifacts between the different stakeholders of the
value chain has been made manually, leading to costly and inefficient processes that lack the
advantages that a real automated workflow can provide. In this sense, IDSA (International Data
Spaces Association) has positioned [1] to provide a trustable dataspace that considers security,
privacy and data sovereignty as first class citizens, following the trend stated in the European Strategy
for Data [2].</p>
      <p>This work proposes a cutting edge MLOps framework for the edge to cloud continuum, integrated
with IDS, which enables a contract-based resource exchange that guarantees data privacy, security,
and sovereignty. This way, an effective and trustable data space is stablished between the different
stakeholders, e.g. the factories producing the data and the ML model developers. Thus, the factories
can expose the data to be consumed by the ML model providers, and, vice versa, the AI providers may
make the models available to the end users, in a cost effective and trustable way.</p>
    </sec>
    <sec id="sec-2">
      <title>2. CLARUS MLOps Framework</title>
      <p>
        The CLARUS MLOps framework [3] comprises an Edge2Cloud solution that includes both (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) a
workflow that covers all the most usual stages that the MLOps concept defines and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) the toolkit
support based on the integration of several opensource tools and new components developed in the
CLARUS EU Project. Figure 1 shows the architecture of the CLARUS MLOps Framework.
      </p>
      <p>The left part details the cloud services of the framework, whilst the right part depicts the services
and components on the Edge. The workflow comprises several stages over the Edge2Cloud
continuum. The stages related to the training, tracking, storage, and registration of the model are
executed in the Cloud training platform, whilst the stages related to the model update and model
inference are executed on the Edge.</p>
      <p>Briefly explained, the workflow enabled by the framework is as follows: first, the Edge exposes
the data obtained from the source (e.g. an OPC-UA server in the plant floor), through the IDS
connector of the Edge, registering it as a resource. Then, when the Airflow pipeline is triggered, either
manually through the Airflow web UI or automatically through its API, the different steps
implemented in the Airflow pipeline are executed.</p>
      <p>The initial stage Read Data accesses the IDS connector in the cloud, which in turn communicates
with the IDS connector on the Edge to get the previously registered resource (the source data). After
that, a function that preprocesses this dataset (e.g. a .csv file of raw data) is executed prior to the
training stage, to perform the data preparation. Using this processed data, the training stage could
train one or more models in parallel, as far as the CLARUS MLOps framework executes over a
Kubernetes cluster. These models are tracked by mlflow [4] and finally, once the best model is selected
according to the considered metrics, the model is registered in the cloud IDS connector, thus making
it ready to download to the Edge.</p>
      <p>The second part of the story involves the Edge. The models published in the cloud can be
downloaded to the edge using the clarus-agent microservice. It downloads the models using the IDS
connector of the edge, which is connected to the IDS connector in the cloud. Once downloaded and
persisted, the updated model is available forexecution by the Clarus inference engine.
2.1.Cloud services</p>
      <p>The left part of Figure 1 depicts the cloud services of the MLOps framework. They are provided
by the integration of several open-source tools and other artifacts provided by CLARUS, which
facilitates the integration of the user project GIT repositories into the MLOps framework.</p>
      <p>
        From a user perspective, the usage of the platform is quite straightforward. If the repository is
public, the only two artifacts to include in the user repo is (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) a file that implements the Airflow DAG
(Directed Acyclic Graph) [5] template provided by CLARUS and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) a file to enable the
synchronization of the user repo with the central GIT repo hosted by the MLOps platform. If the user
repo is private, additionally, the user must generate an access token to give access rights to the MLOps
platform.
      </p>
      <p>From an admin perspective, a utility standalone installer is provided to ease the seamless
installation of the framework. It includes the installation of Kubernetes and the rest of the
components that follow:</p>
      <p>Airflow: Airflow [6] is the pipeline orchestrator. It coordinates the execution of the pipeline,
which is defined in a user´s DAG file. The source code of the AI project can be hosted in a private
user repository, independent from other users. Moreover, the repository can be either public or
private, as the MLOps framework supports token based mutual authentication.</p>
      <p>mlflow: the framework uses mlflow to track experiments, store model artifacts, and provide a
user interface for monitoring and managing models. Mlflow uses PostgreSQL and MinIO as backend
services.</p>
      <p>PostgreSQL: a relational database used for storing MLflow tracking data and other information
used by the framework, as e.g. the id of the best model selected and stored after parallel training.</p>
      <p>MinIO: an object storage server compatible with Amazon S3. It is used to store the models
resulting of the training, altough the access to them from the ENG´s IDS TrueConnector [7] is
performed through the mlflow API.</p>
      <p>Redis: Redis is an open source, high-performance in-memory data storage system. It is used in
the DAG template to exchange data frames between airflow tasks, as far as they are dockerized and
could be executed in different hardware nodes of the Kubernetes cluster. This is the recommended
(optional) data exchange mechanism. Compared to the previous version [8], now there is no size limit
for the data passed between tasks.</p>
      <p>Cloud IDS connector: the IDS connector in the cloud serves two main purposes. On the one
hand, it is used by the Read Data task to access the data exposed by the edge connector through the
IDS ecosystem. This way, the data can be securely exchanged, following a contract that specifies the
data usage terms and ensures the required levels of privacy and sovereignty. On the other hand, the
cloud connector exposes as a resource the trained models and makes them available for downloading.</p>
      <p>Note: The Jupyter Notebooks [9] shown in Figure 1 are not installed by the last version v3.0.0 of the
CLARUS MLOps Framework setup script, as it is an optional component that is not used for automation
but for experimentation when developing the AI models. Nevertheless, it can be manually added to the
toolkit, following the readme in the version 2.0.0 of the CLARUS MLOps framework [8].</p>
      <p>Apart from these services, the framework provides the user with a template for implementing a
CLARUS compatible DAG. In this template DAG the user can configure the execution of each AI task
that conforms the pipeline. More specifically, the user can (optionally) specify, for each task, both the
docker image and the hardware node of the Kubernetes cluster where the task will be executed. This
way, specialized hardware (e.g. GPU-s) that is available in the cluster can be used for the tasks
requiring it, and ad-hoc docker images can be used for different task needs, such as e.g. a specific
Python version or library.</p>
    </sec>
    <sec id="sec-3">
      <title>2.2.Edge services</title>
      <p>The right part of Figure 1 depicts the edge services of the MLOps framework. The access to the
raw data source is represented by the components at the bottom of the figure and is out of the scope
of the framework. As an example, they could access e.g. an OPC-UA server or a CAN bus, harmonize
the data in .csv format and store the dataset locally to be further consumed by the cloud training
platform.</p>
      <p>At this point, the clarus-hmi service comes into play. It provides a REST API that allows the
registration of the dataset resource in the edge IDS connector. Thereafter, when the airflow pipeline
is triggered, the Read Data task could retrive this dataset from the edge, using the IDS connector.</p>
      <p>The other way back, the clarus-agent provides a REST API to enable the downloading of the
trained models that are provided by the cloud IDS connector. At its current version, the framework
supports the downloading of AI models in pickle format [10].</p>
      <p>Once a model is downloaded, the clarus-agent stores the model in a local database and uses the
REST API provided by the clarus-inference-engine to update the active model for local edge
inference. This inference can also be triggered through this API.</p>
      <p>The clarus-concept-drift-detection service (still in development) uses updated raw data
provided by the data harmonization component to eventually detect the concept drift of the AI model.
At its current state, it uses the evidently library [11] to detect only data drift. When a drift in the
source data is detected, the service may eventually trigger a retraining of the model in the cloud.</p>
      <p>Finally, an edge IDS connector has been developed to achieve two objectives. First, it enables the
registration and makes available the data coming from the plant floor. These data can be stored in
any local database on the edge, although currently the edge IDS connector limits to MinIO and http
servers as storage mediums for the data files. Second, the edge IDS connector is used by the
clarusagent to download trained models, retrieving them from the cloud IDS connector.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Conclusions</title>
      <p>The paper presents the CLARUS MLOps framework, which provides innovative MLOps
functionalities over a trustable dataspace for the edge to cloud continuum. To achieve this goal, a
novel approach that integrates an Edge2Cloud MLOps workflow over IDS is proposed. As the
framework integrates with IDS, it natively considers relevant aspects as security, privacy, and data
sovereignty. This enables a trustable playground where all the involved stakeholders can contribute
to effectively industrialize a real Edge AI.</p>
      <p>The framework leverages a set of well-known and widely used opensource tools such as Airflow
or MLFlow and provides additional services to enable a seamless integration of the MLOps workflow
over IDS. Moreover, the framework supports useful features as e.g., parallel training over a
Kubernetes cluster or unlimited sized data frame passing between tasks and can be installed either on
premise (to take advantage of owned specialized hardware) or on the public cloud.</p>
      <p>Now of writing, the framework is still under development, although version 3 is already available
at [3]. Preliminary results show that the proposed MLOps platform can be easily used by data
scientists that are not necessarily software engineers, and can help in managing, organizing, and
optimizing access to the hardware infrastructure on premises. By using the platform, the AI model
developers can focus on data engineering and model training and validation, obviating the
industrialization aspects of the AI models.</p>
      <p>
        Nevertheless, several aspects can be enhanced and will be addressed soon. (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Regarding the
infrastructure, the installation of the framework over hybrid on premise/public clouds will provide
more flexibility and scalability when on premise hardware resources are not enough. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) The concept
drift service will also be integrated in the platform as part of the workflow, to eventually trigger the
model retraining automatically. (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) To enhance the user experience, a graphical tool to generate the
Airflow DAG is being developed. By using this zero-code tool, the AI model developer will be able to
easily define the pipeline of the AI related tasks, such as data gathering, data processing or model
training. For each task, only the library dependencies (e.g. PyTorch, TensorFlow or Scikit-learn),
Python version and required hardware shall be specified, and the platform will manage this
information to provision the required docker images and machines. (4) Also, from the library
dependencies specification, the platform will automatically build the docker image to be deployed on
the edge for model inferencing.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This work has been supported by the project “CLARUS”, which has received funding from the
European Union’s Horizon Europe research and innovation program under grant agreement No.
101070076.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[4] Linux Foundation, "mlflow," [Online]. Available: https://mlflow.org/.
[5] Apache, “DAGs - Airflow Documentation - Apache Airflow,” [Online]. Available:
https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html.
[6] Apache, “Airflow,” [Online]. Available: https://airflow.apache.org/.
[7] ENGineering, "True Connector," [Online]. Available:
https://github.com/Engineering-Researchand-Development/true-connector.
[8] Ikerlan, "CLARUS MLOps framework v2.0.0," 2023. [Online]. Available:
https://github.com/CLARUS-HE-Project/ai-toolkit/tree/v2.0.0-main/Cloud.
[9] Jupyter, "Jupyter Notebooks," [Online]. Available: https://jupyter.org/.
[10] Python.org, "Pickle format," [Online]. Available: https://docs.python.org/3/library/pickle.html.
[11] Evidently AI, "Evidently library," [Online]. Available: https://www.evidentlyai.com/.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>International</given-names>
            <surname>Data Spaces</surname>
          </string-name>
          <article-title>Association (IDSA), "Implementing the European Strategy of Data (position paper),"</article-title>
          <source>April</source>
          <year>2020</year>
          . [Online]. Available: https://internationaldataspaces.org/wpcontent/uploads/IDSA-Position-
          <article-title>Paper-implementing-european-data-strategy-role-of-IDS.pdf</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>European</given-names>
            <surname>Commission</surname>
          </string-name>
          ,
          <article-title>"A European strategy for data,"</article-title>
          <year>2020</year>
          . [Online]. Available: https://eurlex.europa.eu/legal-content/EN/TXT/?uri=CELEX%
          <fpage>3A52020DC0066</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Ikerlan</surname>
          </string-name>
          ,
          <article-title>"</article-title>
          <source>CLARUS MLOps framework v3.0</source>
          .0,"
          <year>2024</year>
          . [Online]. Available: https://github.com/CLARUS-Project/ai-toolkit
          <source>/tree/v3.0</source>
          .0-main.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>