<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>October</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Towards a Process Reference Model for Machine Learning Applications: Challenges and Opportunities</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Antonio Crespi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Monserrat</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonia Mas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antoni-Lluís Mesquida</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universitat de les Illes Balears (UIB)</institution>
          ,
          <addr-line>Cra. De Valldemossa, km 7.5, 07122 Palma</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>25</volume>
      <issue>2025</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>The development of machine learning applications is increasingly subject to quality, compliance, and maintainability demands that exceed the capabilities of current workflow-oriented approaches. While numerous lifecycle models for machine learning have been proposed, they lack the formal structure required to support process assessment, standardization, or maturity evaluation. In contrast, software engineering has long benefited from process reference models grounded in international standards such as ISO/IEC/IEEE 24774 and ISO/IEC 33004. This paper presents the initial foundation for a process reference model for machine learning, developed after a systematic analysis of existing machine learning lifecycle models and aligned with ISO-style process specification practices. The approach formalizes machine learning processes in terms of purpose, inputs, outputs, and outcomes, and supports their eventual use in capability and maturity frameworks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Machine Learning</kwd>
        <kwd>Artificial Intelligence</kwd>
        <kwd>Lifecycle Models</kwd>
        <kwd>ISO Standards</kwd>
        <kwd>Process Reference Model</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Unlike traditional software engineering, where process-oriented development is supported by decades
of standardization eforts, machine learning (ML) development remains largely unstructured. Most
current practices rely on high-level workflows or tool-driven pipelines that ofer limited support for
systematic process definition, assessment, or improvement. Although numerous lifecycle models for
ML have been proposed [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], they typically lack formal specifications of process purpose, inputs, outputs,
and outcomes. This impedes reproducibility, complicates quality assurance, and hinders alignment with
compliance requirements in regulated domains.
      </p>
      <p>
        Eforts to address this gap have emerged from both academic and industrial initiatives. Frameworks
such as CRISP-DM [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and TDSP [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] have been widely used to organize ML workflows, and recent
standards such as ISO/IEC 5338 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] represent important steps toward lifecycle-oriented guidance for
Artificial Intelligence (AI) systems. Nonetheless, these models stop short of providing the type of process
reference model (PRM) needed to support structured development, cross-organizational alignment, or
process capability assessment. Existing ML lifecycle models are diverse in terminology, inconsistent in
structure, and not readily compatible with established software engineering standards.
      </p>
      <p>
        This paper presents the initial foundation for a PRM tailored to the development of ML applications.
The proposed approach draws on ISO/IEC 33004 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and ISO/IEC/IEEE 24774 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which define the
structure and attributes required for process reference and assessment models. By reinterpreting
lifecycle models identified in a prior systematic literature review [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] through the lens of ISO-style
process specification, this work defines a methodology to formalize ML processes in a consistent
and evaluable manner. The aim is not to create a new lifecycle model, but to provide a structured
and standard-aligned basis for process modeling, assessment, and future maturity modeling in ML
engineering.
      </p>
      <p>The paper is structured as follows. Section 2 reviews the background and related work, including
existing ML lifecycle models, the role of standardization in software and systems engineering, and
recent eforts to structure ML development through international standards. Section 3 outlines gaps
and opportunities for a standardized PRM. Section 4 introduces the methodology used to define the
PRM. Section 5 discusses the rationale behind the model’s structure, its alignment with the literature,
and illustrates how a specific process would be described. Finally, section 6 summarizes the main
contributions of the paper and outlines the planned phases of future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <p>This section provides an overview of prior work relevant to the definition and structuring of processes
in ML development. It begins with a summary of existing ML lifecycle models, drawing on a systematic
literature review that catalogues and synthesizes their phases and activities. It then introduces the
concept of process models as used in software engineering, describing their structure and function
within established lifecycle standards. Finally, it outlines the main international standardization eforts
in ML and AI, focusing on the most prominent ISO/IEC initiatives and their contributions to formalizing
processes in this domain.</p>
      <sec id="sec-2-1">
        <title>2.1. ML Lifecycle Models</title>
        <p>
          The systematic literature review conducted in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] identified and analyzed 18 primary studies, from
which 14 distinct ML lifecycle models and 4 sub-models were extracted. These models vary in structure
and scope, but collectively reflect increasing eforts to address the specific demands of ML development.
        </p>
        <p>A key finding of the review is the considerable variability in how lifecycle models structure and define
development processes. Some models are narrowly scoped, focusing primarily on model development
and deployment, while others attempt to address broader concerns such as compliance, collaboration,
or long-term monitoring. However, few models provide an integrated view that spans the full range of
activities—from early-stage planning and data governance to post-deployment adaptation. Notably,
activities related to model updating, ethical considerations, and organisational roles are underrepresented
across the literature. This heterogeneity suggests a lack of convergence on the necessary components
of a complete ML lifecycle and highlights the absence of a shared conceptual foundation.</p>
        <p>The review also categorised lifecycle activities into four broad groups: (1) objective and scope
definition, (2) data, (3) model, and (4) operation activities. While some categories, such as data preparation
and model training, are frequently addressed, others—such as model monitoring, project review, or risk
analysis—appear only in a subset of models. These inconsistencies indicate that current ML lifecycle
models tend to emphasize specific concerns (e.g., technical implementation or data workflows) while
neglecting others (e.g., governance, transparency, or traceability). This fragmented coverage complicates
comparisons between models and limits their applicability as general-purpose development frameworks.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Process Models in Software Engineering</title>
        <p>Process models in software engineering provide structured frameworks that define how software
is developed, validated, deployed, and maintained. Their main objective is to support systematic
development by describing the activities, roles, artefacts, and relationships involved in producing and
evolving software systems. These models serve as a foundation for planning, execution, monitoring,
and improvement, helping teams achieve consistent results, maintain traceability, and ensure quality
across the software lifecycle.</p>
        <p>
          In the literature and in practice, the term lifecycle model often refers to high-level development
strategies that organize the overall sequence of phases in a project, such as analysis, design, implementation,
and maintenance. Common lifecycle models include the Waterfall model [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], which follows a linear and
sequential flow; the Spiral model [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], which incorporates iterative development with risk management;
and the V-model [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], which emphasizes parallel validation and verification for each development phase.
Later approaches such as incremental and iterative models, Agile methodologies [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], and DevOps
practices [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] have placed more emphasis on adaptability, collaboration, and automation.
        </p>
        <p>These lifecycle models are often supported or operationalized through process models, which describe
in more detail the individual processes that occur within each phase. A process model defines what each
process is intended to achieve (its purpose), what outcomes should be produced, and what activities,
inputs, and outputs are involved. Process models provide the structure needed to implement a lifecycle
model consistently, and they allow teams to tailor practices, assign responsibilities, and measure
performance.</p>
        <p>
          The ISO/IEC/IEEE 12207 [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] standard is a prominent example of a process model framework.
It defines a comprehensive set of processes for software lifecycle management, including primary
(development), supporting (verification, validation), and organizational (management, improvement)
processes. ISO/IEC/IEEE 12207 is designed to be compatible with various lifecycle models and can
be used in combination with ISO/IEC/IEEE 15288 [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] for systems engineering. It has been widely
adopted in both industry and public sectors and serves as a basis for software process assessments and
certifications.
        </p>
        <p>Process models contribute to software development in several important ways. First, they support
standardization by providing a shared terminology and structure for organizing development activities.
This reduces ambiguity and improves communication within teams and across organizational boundaries.
It also facilitates training, knowledge transfer, and integration with external stakeholders.</p>
        <p>Second, process models promote repeatability and traceability. When activities are clearly defined, it
becomes easier to execute them consistently across projects. Artefacts, decisions, and responsibilities
can be tracked more efectively, which is particularly important in large or long-lived systems where
documentation and auditability are necessary.</p>
        <p>A third key contribution is to quality assurance. Many models, such as the V-Model and ISO/IEC/IEEE
12207, explicitly include verification and validation steps within the process structure. This supports
early detection of defects and helps ensure that the final software meets its intended requirements. In
regulated domains, such integration is often essential for demonstrating compliance with safety or
reliability standards.</p>
        <p>
          Process models also enable process improvement. When the processes are well-defined, they can be
assessed and improved using formal frameworks such as the Capability Maturity Model Integration
(CMMI) [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] or the ISO/IEC 33000 series. These assessments help organizations identify weaknesses,
benchmark their practices, and establish goals for continuous improvement.
        </p>
        <p>Finally, in many domains, structured processes are a prerequisite for regulatory compliance. Sectors
such as aerospace, automotive, finance, and healthcare require formal lifecycle documentation and
traceability to meet legal or certification requirements. In these contexts, adopting a recognized process
model is not only a best practice but often a necessity.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. ISO Standards for ML and AI</title>
        <p>Recent eforts to formalize the development of ML and AI systems have led to the emergence of several
ISO/IEC standards specifically addressing lifecycle processes, data quality, and governance in AI. These
standards are primarily developed under the ISO/IEC JTC 1/SC 42 subcommittee, which focuses on
AI-specific standardization.</p>
        <p>A central contribution is ISO/IEC 5338, which defines lifecycle processes for AI systems and explicitly
adapts lifecycle models such as those in ISO/IEC/IEEE 12207 and ISO/IEC/IEEE 15288 to AI-specific
contexts. ISO/IEC 5338 introduces novel processes such as AI data engineering, continuous validation,
and iterative model updates, accounting for the dynamic nature of ML systems. While it follows the
structure of software lifecycle standards, it also incorporates AI-specific characteristics such as data
dependencies, probabilistic behavior, and retraining.</p>
        <p>
          Other standards address supporting concerns. ISO/IEC 5259-1 [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] focuses on data quality for
analytics and ML, establishing terminology and practices to ensure reliable datasets. ISO/IEC 22989 [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]
and ISO/IEC 23053 [17] contribute to terminological consistency and provide high-level conceptual
frameworks for describing AI systems using ML.
        </p>
        <p>Governance and risk management are addressed by ISO/IEC 42001 [18], which defines a management
system for AI and provides organizational processes to ensure compliance, accountability, and ethical
oversight. It aligns structurally with ISO 9001 [19], but introduces AI-specific provisions such as risk
and impact assessment, lifecycle documentation, and fairness assurance. Yet, like ISO/IEC 5338, it does
not ofer detailed guidance for change control, stakeholder responsibility, or lifecycle traceability.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Gaps and Opportunities</title>
      <p>While the previous section described recent advances in ML lifecycle models and the development
of ISO standards for AI, important limitations remain. Even though there is growing interest [20]
in formalizing and standardizing ML development, current models and standards still have several
problems that reduce their usefulness in practice. This section summarizes the main issues found in
existing ML lifecycles and in the ISO standards that aim to support ML and AI systems.</p>
      <p>Regarding lifecycle models, many have a limited scope. They often focus on model development and
deployment, but give little attention to earlier stages like project planning or later stages like monitoring
and maintenance. As a result, important aspects such as traceability, documentation, and compliance
with regulations are not well supported [21][22]. Even when feedback loops or post-deployment steps
are included, they are usually described in general terms and not as clear, structured processes. In
addition, the terminology and grouping of activities vary a lot across models, which makes comparison
dificult and prevents a shared understanding of the lifecycle.</p>
      <p>Post-deployment activities are another weak point [23]. Only a few models clearly describe how
to monitor, update, or retrain models once they are deployed. These steps are needed to efectively
deal with problems like model drift or performance drop over time. Ethical concerns such as fairness,
transparency, and explainability are also not well represented [24]. They are sometimes mentioned,
but rarely included as formal activities in the lifecycle. Furthermore, many models do not consider the
diferent roles in ML teams, such as MLOps engineers or data engineers, which leads to an incomplete
view of how real ML projects are organized and managed.</p>
      <p>Similar gaps can be found in the current ISO/IEC standards for ML and AI. Standards like ISO/IEC
5338 and ISO/IEC 42001 introduce useful structures for lifecycle management, and governance, but they
are not yet as detailed or mature as traditional software engineering standards like ISO/IEC/IEEE 12207
or the ISO/IEC 33000 series. In particular, there is no standard that defines how to assess or improve
the maturity of ML development processes. This makes it dificult for organizations to evaluate their
practices or track progress over time.</p>
      <p>There are also missing definitions in the standards. Important terms and roles used in ML
development—such as federated learning, data versioning, or MLOps—are not included in current terminology
standards like ISO/IEC 22989 or ISO/IEC 23053. ISO/IEC 5338 describes the iterative nature of AI
development and includes processes such as data engineering and validation, but it does not clearly
define how to manage changes, assign responsibilities, or ensure accountability. Similarly, ISO/IEC
42001 deals with AI-specific quality and risk issues but does not explain how to include these concerns
in daily development activities. Overall, while the standardization of ML processes is improving, it still
lacks the completeness and practical guidance found in established software engineering standards.</p>
      <p>A standardized PRM for ML could help address many of these gaps by providing a clear and consistent
structure for defining ML development processes. Unlike current lifecycle models, which vary in scope
and terminology, a PRM would define a common set of processes, each with a clear purpose, expected
inputs and outputs, and recommended practices. This would make it easier for teams to align their
workflows, improve collaboration, and reduce misunderstandings about responsibilities or development
steps.</p>
      <p>A PRM could also support traceability and compliance by making sure that important activities—such
as documentation, risk analysis, monitoring, and model updates—are formally included in the lifecycle.
This would help teams manage technical debt, improve reproducibility, and meet regulatory
requirements, especially in high-risk domains like healthcare or finance. Including ethical and
governancerelated processes in the PRM would also encourage teams to treat fairness, transparency, and
accountability as core development tasks, not just optional concerns.</p>
      <p>In addition, a PRM designed according to ISO principles could support process assessment and
continuous improvement. By defining capability levels or maturity indicators for each process, the PRM
would allow organizations to evaluate their practices and identify areas for growth. This would fill the
current gap left by the absence of ML-specific process assessment standards, and provide a foundation
for more reliable, maintainable, and auditable ML systems.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Process Description Methodology</title>
      <p>This section presents the methodology followed to define a PRM for ML. The approach is based on two
ISO/IEC standards: ISO/IEC 33004:2015, which defines the requirements for building process reference
and assessment models, and ISO/IEC/IEEE 24774:2021, which provides a specification for writing process
descriptions.</p>
      <p>The first step in creating a PRM is to define its domain and scope. According to ISO/IEC 33004, the
PRM must clearly state which area it applies to. In this case, the area is development and management of
ML systems. It must also explain the purpose of the model and which community of users it is intended
for, such as developers, researchers, or organizations working with AI systems. If the PRM is meant to
reflect a shared view among practitioners, the model should also document how that agreement was
reached, or clarify if no formal consensus process was used.</p>
      <p>Each process in the PRM must include three required elements: a name, a purpose, and a set of
outcomes. These elements form the core of the model and are required to comply with ISO/IEC 33004.
The process name should be short and descriptive, typically ending with the word “process”. The purpose
is a high-level goal that explains why the process is needed, written in plain language beginning with
“The purpose of the X process is. . . ”. The outcomes are observable results that show when the purpose
has been achieved. Each outcome must describe a concrete and positive result, such as a decision being
made, a quality being verified, or a resource being delivered. These outcomes should be written clearly
and be easy to assess in practice.</p>
      <p>To support practical use of the model, we also include several optional elements recommended
by ISO/IEC/IEEE 24774. These help organizations implement and tailor the model to their needs. In
addition to purpose and outcomes, each process may define:
• Inputs and outputs, which describe what information or artefacts the process receives and
produces.
• Base practices, which are typical actions that help achieve the outcomes.
• Roles, which identify who is responsible for carrying out the process.
• Controls and constraints, such as policies, standards, or technical limitations that apply to the
process.</p>
      <p>This structure helps teams understand what each process is for, how to perform it, and how it connects
with other parts of the ML lifecycle. It also supports tailoring for diferent domains or organizations.</p>
      <p>An important part of the PRM is describing the relationships between processes. ISO/IEC 33004
requires that the PRM includes a process architecture, showing how processes are related. This can
include sequencing (e.g., that data preparation comes before model training), feedback loops (e.g., from
monitoring back to data collection), or dependencies (e.g., that one process needs the results of another
to start). This structure helps users navigate the model and adapt it to real workflows.</p>
      <p>Although this paper does not define a complete process assessment model (PAM), the structure of
the PRM is designed to support future assessment. In ISO/IEC 33004, processes must be described in a
way that allows their outcomes to be used as a basis for capability or maturity evaluation. This means
that each outcome must be written clearly enough to assess whether it has been achieved, providing a
solid foundation for later work on process measurement and improvement.</p>
      <p>Finally, the methodology supports the creation of process views, as described in ISO/IEC/IEEE 24774.
A process view is a customized version of the PRM for a specific context, such as a healthcare application
or a regulated environment. These views reuse the same processes but may highlight or modify specific
elements to suit domain-specific needs. This flexibility allows the PRM to be adapted without redefining
its core structure.</p>
      <p>In summary, this methodology combines the formal structure of ISO/IEC 33004 with the practical
guidance of ISO/IEC/IEEE 24774 to define a process reference model that is both standard-compliant
and useful for ML development. It provides a clear format for defining processes, supports consistent
implementation, and creates the basis for future evaluation and improvement.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conceptual Approach Towards a PRM for ML</title>
      <p>This section presents the structure of the proposed PRM for ML application development. It describes
how the model organizes processes into categories that align with both the findings from the systematic
literature review and the methodology detailed in the previous section. The goal is to define a modular
and assessable architecture that supports clarity, consistency, and future standardization. Finally, an
example is provided to illustrate how an individual process would be formally defined within the PRM.</p>
      <sec id="sec-5-1">
        <title>5.1. Domain and Scope</title>
        <p>The proposed PRM is defined for the domain of ML application development. It addresses the specific
needs of designing, implementing, deploying, and maintaining ML systems, including activities related
to data handling, model lifecycle management, validation, monitoring, and adaptation.</p>
        <p>The scope of this PRM includes both technical and organizational processes across the entire lifecycle
of ML systems. It covers project initiation, data acquisition and processing, model development, system
integration, deployment, and post-deployment activities such as monitoring, retraining, and governance.
The model is technology-agnostic with respect to ML algorithms and platforms, and it can be adapted
to support a broad range of application domains, including but not limited to healthcare, finance, and
scientific computing.</p>
        <p>The community of interest includes ML engineers, data scientists, software engineers, quality
assurance professionals, project managers, and compliance oficers engaged in the development or governance
of ML systems. This proposal reflects a synthesis of insights from academic literature via a systematic
review, practical lifecycle models, and process modeling standards. Although formal consensus has not
yet been established, the model is grounded in established practices and documented gaps identified in
the literature.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Process Categories and Architecture</title>
        <p>Drawing on the results of the systematic literature review, the proposed PRM structures ML development
into six major categories of related processes. These categories reflect the specific demands of ML
workflows while aligning with the structural requirements of ISO/IEC 33004. Each category groups
processes that share related concerns and lifecycle roles, building on the four high-level activity types
identified in the literature review (objective and scope definition, data-related activities, model-related
activities, and operation-related activities) but refining them into a more granular and assessable form.
This division also facilitates alignment with process architectures from traditional software engineering
standards such as ISO/IEC/IEEE 12207.</p>
        <p>• Initiation and Planning: This category includes processes such as project scoping, stakeholder
analysis, feasibility study, regulatory and ethical risk identification, and resource planning. These
processes establish the initial conditions under which ML work is defined and executed, and
provide an early alignment between business goals, legal requirements, and technical constraints.</p>
        <p>It refines the SLR category of objective and scope definition.
• Data Engineering: This category includes processes for data acquisition, validation, cleaning,
enrichment, versioning, and storage management. These processes aim to make data artefacts
that are reliable, traceable, and reusable. Treating data engineering as a standalone category
also allows for better alignment with MLOps and data-centric development practices, where
data quality and control are maintained independently from modelling activities. It groups the
data-related activities included in the SLR.
• Model Development: This category includes processes for designing, training, evaluating, and
selecting machine learning models. Specific processes include model selection and configuration,
hyperparameter tuning, evaluation against defined metrics, and interpretability assessment. These
processes are iterative and often experimental, and this category provides the necessary structure
to describe them in a reproducible and assessable way. It groups the model-related activities
included in the SLR.
• System Integration and Testing: This category covers the processes required to integrate the
model into a broader software system. It includes system-level testing, interface specification,
integration testing, and verification of technical and functional requirements. Although often
grouped under general “operation-related activities” in the SLR, these integration tasks require a
distinct treatment in the PRM due to their role in validating the readiness of ML components for
production use, especially in regulated or high-reliability environments.
• Deployment and Operation: This category focuses on the runtime configuration and management
of ML systems. Processes in this category include environment setup, version control of models
and configurations, deployment planning, performance monitoring, and incident management. It
includes processes from the operation-related category in the SLR.
• Monitoring, Adaptation, and Governance: This category introduces processes that support the
long-term maintenance and responsible use of ML systems. This contains model monitoring for
drift detection, performance degradation analysis, triggering of retraining cycles, documentation
of updates, compliance checks, and periodic reviews of fairness or bias. These processes are
increasingly emphasized in MLOps and responsible AI frameworks, and their explicit inclusion
responds to one of the main deficiencies identified in current lifecycle models.</p>
        <p>Each category in this architecture consists of multiple processes that will be defined according to
ISO/IEC 33004 and ISO/IEC/IEEE 24774. For each process, the PRM will provide a name, a purpose,
and a list of outcomes that must be achieved. Where appropriate, additional elements such as inputs,
outputs, base practices, and roles will be included to support the implementation and integration of the
PRM. All processes will be independent but connected through a process architecture that specifies the
dependencies, order, and feedback relationships between them.</p>
        <p>Each category will contain one or more processes, and each process will be defined by a name, a
purpose, and a set of outcomes, along with inputs, outputs, base practices, roles, and, where applicable,
controls and constraints.</p>
        <p>Base practices for each process will be identified using a combined standards-based and
evidencebased approach. Each practice will be defined to directly support one or more of the process outcomes
and to contribute to the generation of the expected outputs, supporting traceability and internal
consistency. To complement this outcome-driven approach, academic literature will be reviewed to
identify concrete activities described in real-world ML workflows and empirical studies. These activities
will be abstracted into generalized practices and incorporated when they align with the intended
outcomes and add practical value.</p>
        <p>This architecture supports modularity and allows for tailored views of the PRM, enabling adaptation
for specific organizational contexts, domains (e.g., healthcare, finance), or lifecycle models. It also
enables future capability evaluations aligned with ISO/IEC 33020 [25] and maturity models aligned
with ISO/IEC TS 33061 [26].</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Process Example: Data Cleaning</title>
        <p>This section illustrates the application of the methodology described in the previous section by defining
the Dta Cleaning process, one of the core processes in the Data Engineering category. The process
is presented below, including its purpose, outcomes, and additional descriptive elements to support
clarity, traceability, and practical use within the PRM.</p>
        <sec id="sec-5-3-1">
          <title>Name</title>
          <p>The purpose of the Data Cleaning Process is to detect, correct, or remove inaccuracies and inconsistencies
in raw data to ensure that datasets are suitable for training, validation, and deployment of machine
learning models.</p>
        </sec>
        <sec id="sec-5-3-2">
          <title>Outcomes</title>
          <p>• Cleaned datasets are produced and documented for use in downstream modelling or evaluation
processes.
• The cleaning decisions and applied transformations are recorded to support reproducibility and
auditability.</p>
        </sec>
        <sec id="sec-5-3-3">
          <title>Inputs</title>
          <p>Outputs
• Exploratory Data Analysis (EDA) Report to guide feature engineering, model selection, and other
downstream processes (output of the Data Analysis Process).
• An integrated dataset (output of the Data Integration Process).
• A Data Quality Analysis Report documenting recommendations for improving the dataset (output
of the Data Quality Analysis Process).
• Cleaned dataset: A cleaned dataset that meets the specified quality standards.
• Data cleaning documentation: Documentation detailing the identified issues and cleaning steps,
decisions, and any transformations applied to fix them.
• Data Cleaning Plan: A documented data cleaning plan detailing identified issues, selected methods,
and a timeline for execution.</p>
        </sec>
        <sec id="sec-5-3-4">
          <title>Roles and Responsibilities</title>
        </sec>
        <sec id="sec-5-3-5">
          <title>Base Practices</title>
          <p>• Data Engineer: Executes the data cleaning procedures and documents results.
• BP1. Plan Data Cleaning Activities [27]: Establish a structured plan for data cleaning by selecting
appropriate techniques, tools, and strategies based on the dataset’s characteristics and quality
requirements. This planning should take into account the EDA Report produced by the Data
Analysis Process, as well as the findings documented in the Data Quality Analysis Report.
• BP2. Handle Missing Values [28][29]: Address missing values in the dataset according to the data
type, modelling requirements, and decisions documented in the Data Cleaning Plan. Common
tasks include removing rows or columns with excessive missing values, imputing values using
statistical methods such as mean, median, or mode, or applying predictive models for more
complex imputations.
• BP3. Correct Data Inconsistencies [28]: Standardize formats, units, and categorical values to ensure
internal consistency across the dataset. This activity may involve correcting inconsistent date
formats, aligning units of measurement (e.g. converting all weights to kilograms), reconciling
variant labels in categorical variables (e.g. "yes", "Yes", "Y"), or making sure that numerical values
are within expected ranges.
• BP4. Remove Outliers and Noise [28][29]: Detect and handle outliers and noisy data points that
may negatively afect model training. Common techniques include threshold-based filtering,
domain-specific rules, or model-based approaches. Depending on the context, outliers may be
removed, capped, or retained with appropriate flags.
• BP5. Normalize and Scale Data [28][29]: Prepare numerical features for model input by applying
normalization or scaling techniques appropriate to the modelling context. Common methods
include min-max normalization, standardization, and robust scaling. The choice of method should
consider the distribution of the data and the requirements of downstream models (e.g. sensitivity
of distance-based algorithms to feature magnitude).
• BP6. Document Cleaning Process [27]: Maintain a detailed log of all actions taken during the data
cleaning process to support transparency, reproducibility, and auditability. The documentation
should include the type of transformation applied, the afected variables or records, the rationale
for each decision, and any thresholds or parameters used. This log should be linked to the Data
Cleaning Plan and stored alongside the cleaned dataset to ensure that the process can be reviewed,
repeated, or validated by other stakeholders.</p>
        </sec>
        <sec id="sec-5-3-6">
          <title>Controls and Constraints</title>
          <p>• Data cleaning procedures must comply with organizational data policies and regulatory
requirements (e.g. GDPR).
• Cleaning operations should be non-destructive where possible.</p>
          <p>• All actions must be reproducible and auditable, particularly for high-stakes or regulated domains.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>The PRM proposed in this work contributes to an emerging body of research that seeks to formalize the
development of ML applications by drawing on principles from software engineering and international
standards. Existing ML lifecycle frameworks like CRISP-DM and TDSP have been widely adopted
in practice, but they lack formal process structures. These models are typically defined as high-level
workflows with loosely specified activities, and they do not meet the criteria required for process
reference models as defined in ISO/IEC 33004, such as explicit declarations of purpose, outcomes, and
assessment-ready structure.</p>
      <p>In contrast, the proposed PRM is aligned with the structural and terminological foundations of
standards like ISO/IEC/IEEE 24774 and ISO/IEC/IEEE 12207. It formalizes ML development processes by
identifying their purpose, inputs, outputs, and expected outcomes, and bridges the gap between current
data science practices and the structured lifecycle management traditions of software and systems
engineering. The model complements the broader goals of emerging AI standards, particularly ISO/IEC
5338, by ofering a more granular and evaluable process structure that can support future assessments,
governance frameworks, and compliance eforts.</p>
      <p>However, the transition to such a structured process model presents several open challenges. One
major issue is domain specificity. ML applications vary significantly across sectors in terms of their
development constraints, system boundaries, and lifecycle characteristics. While the proposed PRM
is designed to be domain-independent, its practical application will likely require tailoring to address
regulatory requirements, risk profiles, and artifact expectations that difer across contexts. The tension
between generalizability and specificity remains unresolved.</p>
      <p>Another challenge concerns the integration of formal process models with the operational practices
of MLOps. Whereas the PRM is conceptual and process-oriented, MLOps frameworks emphasize
automation, monitoring, and infrastructure. Current industry tools and pipelines are typically organized
around deployment and retraining workflows, and it is not yet clear how these operational stages
map onto process reference models. This lack of alignment makes traceability between lifecycle-level
governance and implementation-level activities more dificult, particularly in iterative or continuous
delivery environments.</p>
      <p>Furthermore, there are adoption barriers associated with industry practices and culture. Many
ML teams operate with informal or ad hoc workflows driven by experimentation and rapid iteration.
Adhering to a structured PRM in such contexts may be perceived as burdensome unless clear benefits can
be demonstrated. This raises questions about incentives, tooling support, and the need for intermediate
representations that mediate between process models and practice.</p>
      <p>Future work will proceed in several phases. The first phase will focus on completing the full
specification of the PRM, defining all relevant ML lifecycle processes in compliance with ISO/IEC
standards. In the second phase, the model will be reviewed and refined based on feedback from its
application in multiple organizations and industrial domains. The third phase will explore how the PRM
can be customized to suit diferent environmental conditions, regulatory contexts, or sector-specific
requirements. Finally, the fourth phase will involve the development of a process assessment model and
a process maturity model for systematic evaluation and improvement of ML development practices.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Declaration on Generative AI</title>
      <p>The authors acknowledge the use of GPT-4o to assist with translation, as well as grammar and syntax
improvements in this manuscript. The final content, including its structure and arguments, remains the
authors’ own work, with all decisions regarding terminology and interpretation made independently.
[17] Framework for artificial intelligence (ai) systems using machine learning (ml), ISO/IEC 23053:2022
(????).
[18] Information technology - artificial intelligence - management system, ISO/IEC 42001 (????).
[19] Quality management systems - requirements, ISO 9001 (????).
[20] S. Martínez-Fernández, J. Bogner, X. Franch, M. Oriol, J. Siebert, A. Trendowicz, A. M. Vollmer,
S. Wagner, Software engineering for ai-based systems: A survey (2021). URL: http://arxiv.org/abs/
2105.01984http://dx.doi.org/10.1145/3487043. doi:10.1145/3487043.
[21] V. Chandrasekaran, H. Jia, A. Thudi, A. Travers, M. Yaghini, N. Papernot, Sok: Machine learning
governance, arXiv (2021). URL: http://arxiv.org/abs/2109.10870. doi:10.48550/arXiv.2109.
10870.
[22] P. Sugimura, H. Florian, Building a reproducible machine learning pipeline, arXiv (2018). doi:10.</p>
      <p>48550/arXiv.1810.04570.
[23] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J.-F.</p>
      <p>Crespo, D. Dennison, Hidden technical debt in machine learning systems, Neural Information
Processing Systems (2015).
[24] R. Tong, H. Li, J. Liang, Q. Wen, Developing and deploying industry standards for artificial
intelligence in education (aied): Challenges, strategies, and future directions (2024).
[25] Information technology - process assessment - process measurement framework for assessment of
process capability, ISO/IEC 33020:2019 (????).
[26] Information technology - process assessment - process assessment model for software life cycle
processes, ISO/IEC TS 33061:2021 (????).
[27] E. Breck, S. Cai, E. Nielsen, M. Salib, D. Sculley, The ml test score: A rubric for ml production
readiness and technical debt reduction, Proceedings of IEEE Big Data (2017).
[28] P. O. Côté, A. Nikanjam, N. Ahmed, D. Humeniuk, F. Khomh, Data cleaning and machine
learning: A systematic literature review, Automated Software Engineering 31 (2023). doi:10.
1007/s10515-024-00453-w.
[29] E. Rahm, H. H. Do, Data cleaning: Problems and current approaches, IEEE Data Engineering
Bulletin (2000).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Crespí</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mesquida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Monserrat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mas</surname>
          </string-name>
          ,
          <article-title>Lifecycle models in machine learning development</article-title>
          ,
          <source>Expert Systems</source>
          <volume>42</volume>
          (
          <year>2025</year>
          ). URL: https://onlinelibrary.wiley.com/doi/10.1111/exsy.70029. doi:
          <volume>10</volume>
          . 1111/exsy.70029.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Wirth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hipp</surname>
          </string-name>
          , Crisp-dm:
          <article-title>Towards a standard process model for data mining</article-title>
          ,
          <source>in: Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Learn</surname>
          </string-name>
          ,
          <article-title>Team data science process (tdsp</article-title>
          ), ???? URL: https://learn.microsoft.com/en-us/azure/ architecture/data-science-process.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Information</surname>
          </string-name>
          technology
          <article-title>- artificial intelligence - ai system life cycle processes</article-title>
          ,
          <source>ISO/IEC</source>
          <volume>5338</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Information</surname>
          </string-name>
          technology
          <article-title>- process assessment - requirements for process reference, process assessment and maturity models</article-title>
          , ISO/IEC 33004:
          <year>2015</year>
          (????).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>[6] Systems and software engineering - life cycle management - specification for process description</article-title>
          , ISO/IEC/IEEE 24774:
          <year>2021</year>
          (????).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W. W.</given-names>
            <surname>Royce</surname>
          </string-name>
          ,
          <article-title>Managing the development of large software systems</article-title>
          ,
          <source>Proceedings of IEEE WESCON</source>
          (
          <year>1970</year>
          )
          <fpage>328</fpage>
          -
          <lpage>388</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Boehm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Hansen</surname>
          </string-name>
          , Spiral Development: Experience, Principles, and
          <string-name>
            <surname>Refinements Spiral Development Workshop COTS-Based</surname>
            <given-names>Systems</given-names>
          </string-name>
          ,
          <source>Technical Report</source>
          ,
          <year>2000</year>
          . URL: http://www.sei.cmu. edu/publications/pubweb.html.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Forsberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mooz</surname>
          </string-name>
          ,
          <article-title>The relationship of system engineering to the project cycle</article-title>
          ,
          <source>INCOSE International Symposium</source>
          <volume>1</volume>
          (
          <year>1991</year>
          )
          <fpage>57</fpage>
          -
          <lpage>65</lpage>
          . doi:
          <volume>10</volume>
          .1002/j.2334-
          <fpage>5837</fpage>
          .
          <year>1991</year>
          .tb01484.x.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Beck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Beedle</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. van Bennekum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cockburn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Cunningham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fowler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Grenning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Highsmith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hunt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jefries</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Marick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mellor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Schwaber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sutherland</surname>
          </string-name>
          , D. Thomas,
          <source>Manifesto for agile software development</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bass</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Weber</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Zhu,</surname>
          </string-name>
          <article-title>DevOps: a software architect's perspective</article-title>
          , first ed.,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <article-title>Systems and software engineering-software life cycle processes</article-title>
          , ISO/IEC/IEEE 12207 (
          <year>2017</year>
          ). doi:
          <volume>10</volume>
          .1109/IEEESTD.
          <year>2017</year>
          .
          <volume>8100771</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <article-title>Systems and software engineering - system life cycle processes</article-title>
          , ISO/IEC/IEEE 15288 (????).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>M. B. Chrissis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Konrad</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Shrum</surname>
          </string-name>
          ,
          <article-title>CMMI for Development: Guidelines for Process Integration</article-title>
          and Product Improvement, third ed.,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Artificial</surname>
          </string-name>
          intelligence
          <article-title>- data quality for analytics and machine learning (ml)</article-title>
          ,
          <source>ISO/IEC 5259-1</source>
          :
          <year>2024</year>
          (????).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>[16] Information technology - artificial intelligence - artificial intelligence concepts</source>
          and terminology, ISO/IEC 22989 (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>