Monitoring Machine Learning Systems from the Point
                                of View of AI Ethics
                                Kai-Kristian Kemell1,∗ , Jukka K. Nurminen1 and Ville Vakkuri2
                                1
                                 University of Helsinki, Department of Computer Science, Yliopistonkatu 3, 00014 University of Helsinki, Finland
                                2
                                 School of Marketing and Communication, University of Vaasa, Wolffintie 32 FI-65200 Vaasa PL 700, Vaasa, FI-65101,
                                Finland


                                           Abstract
                                           The practical implementation of AI ethics remains a challenge. Guidelines and principles are numerous
                                           but converting them into practice appears difficult for organizations developing ML systems. It is argued
                                           that bringing AI ethics closer to software engineering practice could help. In this regard, monitoring
                                           of ML systems and metrics related to ethics could be one way of making ethics more tangible. While
                                           various existing papers discuss technical approaches to, for example, monitoring fairness, a more holistic
                                           view on monitoring AI ethics is lacking, as is discussion on MLOps and ethics. In this paper, we discuss
                                           AI ethics from the point of view of monitoring, building on existing research from AI ethics, software
                                           engineering, and machine learning, to propose a typology of metrics for monitoring ML systems during
                                           their operational lives. We then discuss monitoring ML systems from the point of view of AI ethics by
                                           building on this typology and using the Ethics Guidelines for Trustworthy AI (AI HLEG) as a framework
                                           to illustrate what monitoring AI ethics might mean in practice. In doing so, we highlight that (a) some
                                           issues related to AI ethics are hardly unique to AI ethics and are something frequently tackled in ML
                                           monitoring, (b) though AI ethics involves many high-level design decisions to be made early on in the
                                           development of a system, there are still various aspects of AI ethics that may be monitored. Overall, this
                                           paper presents initial discussion on the topic in hopes of encouraging further studies on it.

                                           Keywords
                                           AI ethics, monitoring, metrics, software engineering, ML development, ethical guidelines, MLOps


                                1. Introduction
                                While AI ethics has become a prominent topic of discussion that companies developing AI
                                systems are also increasingly aware of, AI ethics remains challenging in practice for various
                                reasons. Guidelines and principles, which are often utilized to approach AI ethics, are seen as
                                ineffective [1], and indeed seem to have little impact on practice [2, 3]. AI ethics overall is still
                                difficult to approach, and amidst the numerous guidelines and principles [4, 5], can appear fuzzy
                                to practitioners who may find it difficult to define exactly what it is they are trying to tackle.
                                It has been argued that one problem in this regard is that AI ethics is distant from Software
                                Engineering (SE) practice, and that bringing AI ethics closer to the typical work of software
                                engineers might help [6, 7].

                                7th Conference on Technology Ethics (TETHICS2024), November 6–7, 2024, Tampere, Finland
                                ∗
                                    Corresponding author.
                                Envelope-Open kai-kristian.kemell@helsinki.fi (K. Kemell); jukka.k.nurminen@helsinki.fi (J. K. Nurminen);
                                ville.vakkuri@uwasa.fi (V. Vakkuri)
                                Orcid 0000-0002-0225-4560 (K. Kemell); 0000-0001-5083-1927 (J. K. Nurminen); 0000-0002-1550-1110 (V. Vakkuri)
                                         © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   The operational life of AI systems remains a less explored point of view in AI ethics [8].
Conversely, SE has increasingly begun to emphasize continuity in SE through what has become
known as continuous SE [9]. Especially DevOps, a well-known continuous SE approach, has
made operations (or the operational life of the system) an integral part of the SE process [9].
Rarely does active development of a system stop after its initial release anymore, as new features
and new versions of the system are continuously deployed, especially with software increasingly
offered as a service [10]. This has also come to be the case for ML systems, especially with MLOps
[11] extending DevOps principles into ML development. Whereas DevOps is a portmanteau of
development and operations, MLOps is one of Machine Learning (ML) and operations.
   However, few AI ethics papers discuss continuous SE, MLOps, or even just monitoring of
ML systems, outside ML papers discussing technical solutions to ethics-related issues. For
example, and primarily, there are ML papers discussing monitoring aspects of algorithmic
fairness (e.g., [12]) and various safety and robustness aspects (which are certainly concerns
outside the specific point of view of ethics as well). Other types of AI ethics papers, on the other
hand, seldom discuss monitoring. To this end, Lu et al. [13] posit ”there is a strong desire for
continuously monitoring and validating AI systems post deployment for ethical requirements
but current operation practices provide limited guidance”. We also consider this a relevant gap
as monitoring, and especially the metrics used to monitor ML systems, are one way in which
AI ethics may very tangibly manifest in practice, making it more approachable to practitioners
struggling to implement it. Some existing papers have also begun to explore MLOps in relation
to ethics, using the concept sustainable MLOps (e.g., [14]) and presenting a link between MLOps
and AI ethics.
   Thus, in this paper, we discuss AI ethics from the point of view of monitoring ML systems, to
build on this nascent topic of study. This is a conceptual paper that aims to encourage further
studies into ML system monitoring and MLOps from the point of view of AI ethics. Based
on existing literature, we (1) propose a typology for categorizing metrics for monitoring ML
systems during their operational life, and include examples of various metrics for each category,
in order to provide a framework for future studies, and (2) discuss AI ethics from the point of
view of monitoring and metrics, using this typology and the Ethics Guidelines for Trustworthy
AI [15] as frameworks. This is a conceptual paper that takes on a broad perspective on the topic.


2. Background
In this section, we discuss relevant literature. In Section 2.1, we discuss monitoring in SE and
ML systems. In Section 2.2, we discuss MLOps. In Section 2.3, we discuss AI ethics.

2.1. Monitoring in Software Engineering and Machine Learning Systems
Software monitoring has historically focused largely on quality, specifically from the point of
view of errors and stability, and especially in security-critical systems [16]. This is a view on
monitoring still often adopted today (e.g., [17]). Schroder & Schulz [17] define monitoring as
”a process involving the end-to-end extraction, analysis, and interpretation of metrics of an object
under observation.” This definition builds on an earlier paper of Kitchenham & Walker [18] who
discuss the process of interpreting the results of monitoring as follows: (1) identification of
abnormal values, (2) determination of possible causes, and (3) possible corrective actions.
   While ensuring error-free operation remains an important aspect of monitoring, it can be
considered a narrow view on monitoring today. Users today expect systems to be continuously
developed past their initial release, with more value being added through continuous deployment
of updates [10]. Indeed, continuous SE, and especially DevOps, is now the de facto industry
standard [19]. From this follows that monitoring is also focused on improving the software based
on data, instead of simply maintaining it. Rodríguez et al. [10] highlight this shift: ”the main
objective of continuous monitoring is to constantly monitor and measure both business indicators
and infrastructure-related metrics in order to facilitate and improve business and technical decision-
making.” Overall, SE has recently drawn from the lean startup [20] philosophy that closely links
business aspects with software development [9], focusing on collecting data to determine user
needs in contexts where they may not be readily apparent (vs. commissioned projects).
   As for ML systems, ML system monitoring in and of itself is not a novel topic, with numerous
studies discussing various aspects of ML system monitoring. ML systems differ from traditional
systems with the addition of ML components. These ML model(s) have to be monitored [17].
Moreover, ML system monitoring places much emphasis on data-related metrics [17], as (lots
of) data is typically used to train the ML models. Overall, many of the studies discussing ML
system monitoring focus on issues unique or particularly relevant for ML systems, such as
concept drift and data drift (see, e.g., [21]).
   ML systems also pose unique challenges from the point of view of quality through model
accuracy, whereas conventional software merely needs to operate in an error-free manner at
its simplest. Though the system may be functioning correctly, the ML model may become less
accurate over time due to concept and/or data drift resulting from changes to the context the
system operates in [22], necessitating the deployment of new, updated models [11]. When done
in a continuous manner, this referred to as MLOps, which we discuss next. However, few studies
take on a more holistic view to ML system monitoring as we do in this paper. One example of a
more general paper is that of Naser & Alavi [23] who provide an extensive list of error metrics
and performance fitness indicators for ML systems.

2.2. MLOps
MLOps has been characterized as DevOps for ML [11]. It refers to the application of continuous
SE practices, namely the ones present in DevOps, to ML system development. MLOps is largely
concerned with improving collaboration between data scientists and other ML developers and
the rest of the development team [19], much like DevOps focuses on collaboration between the
development and operations teams [9]. Indeed, many of SE challenges in ML development stem
from a lack communication and collaboration issues between the ML developers and the rest of
the development team [24, 25].
   Monitoring is also discussed in the context of MLOps, with a focus on automating model
monitoring [11]. This is discussed in relation to, e.g., tooling and processes for measuring model
degradation in deployed models [11]. Mäkinen et al. [22] consider monitoring in MLOps to
include ”automation and monitoring at all steps of ML system development and deployment,
including integration, testing, releasing, deployment and infrastructure management.”
2.3. AI Ethics
The primary motivation behind this paper is to explore ML system monitoring from the point
of view of AI ethics. AI ethics is a long-standing area of research that has recently been highly
active following major practical advances in ML in the past decade [26]. Companies, researchers,
and governmental actors alike have begun to show interest in AI ethics. This interest is perhaps
best highlighted through the sheer number of guidelines for ethical AI that have been proposed
in recent years (see Jobin et al. [4] and Hagendorff [5] for reviews of guidelines).
   These guidelines typically approach ethical issues through principles that ethical AI should
adhere to. Such principles have become a common way of conceptualizing AI ethics. For
example, one commonly discussed principle is fairness, which is concerned with issues such
as bias, diversity, and equality [4]. Fairness is also the focus of various ML tools [27, 28].
However, abstract principles are challenging to implement in practice [1], and AI ethics is a
field characterized by a gap between research and practice, as companies seem to struggle to
implement AI ethics in practice [2, 3] (or are not interested in doing so [29]).
   Though various tools related to AI ethics have been introduced [27], their adoption seems to
not be widespread [2, 3, 29]. Some recent studies have proposed SE methods for implementing
AI ethics (e.g., ECCOLA [7], and RE4AI [30]), which take on a more holistic view of AI ethics,
looking past individual ethical principles. However, arguably much still needs to be done to
make AI ethics practical [7], and empirical studies in AI ethics are still sorely lacking [28]. This
paper aims to contribute towards making AI ethics more practical.


3. ML System Monitoring: A Typology
In this section, we propose a typology of metrics for ML system monitoring, focusing specifically
on systems that are already operational. There are various other metrics that may be relevant
while training the initial models for the system, for example, and while some of these (types of)
metrics may also be relevant for these purposes, our focus is on metrics that are relevant for
operational ML systems. To this end, we also focus on systems where new ML model versions
are deployed, possibly in a continuous manner utilizing MLOps. We propose the following
seven categories of metrics, which are further discussed in Sections 3.1 to 3.7:

   1. Data Metrics
   2. ML Model Metrics
   3. System/Infrastructure Metrics
   4. Process Metrics
   5. Business Metrics
   6. User Metrics
   7. Domain-specific Metrics

  This typology takes on a holistic point of view to ML system monitoring. There are various
ways to categorize metrics in the field of software engineering. In existing literature, metrics
are typically discussed in specific contexts. For example, there are papers discussing DevOps
metrics [31, 32], software startup metrics [33], error metrics and ”performance fitness indicators”
for ML systems [23], and user experience [34]. In the future, we would also like to see such
papers on ethics in software engineering, and ML development specifically. On the other hand,
a more general view on metrics such as what is presented here is more difficult to come by, as it
is arguably less useful in terms of practice due to how general it is as a result of its vast scope.
By focusing on specific contexts such as DevOps, it is possible to have a clearer goal for the
metrics presented in a paper or typology, resulting in more detail, and thus practical relevance.
   Our goal is to highlight how metrics related to ethics can be found across the different
categories, for which purposes such a more generic typology is useful for illustrative purposes.
To this end, this paper builds on a wide variety of existing literature on more specific categories
of metrics. One general way of approaching software metrics found in extant literature is
to consider product, process, and resource metrics [35]. This typology includes all of these
with some more granularity, given the ML context (data, ML model, system/infrastructure
for product, etc.), while also building on other more recent literature on metrics in software
engineering.
   Literature on ML system monitoring typically focuses on data metrics and model metrics, as
these are the types of metrics most focused on what separates ML systems from other software
systems: the ML models and their training data. We include five other categories of metrics that
are more generic, as these are also relevant for ML systems, given that much of ML development
is ultimately still just software development [36]).
   Yet this typology is not exhaustive, nor do we aim to provide an exhaustive list of metrics
for any of the categories. As Fenton & Neil [37] point out, there were already thousands of
SE metrics discussed in extant literature by the year 2000, making such a goal unrealistic. We
simply wish to present a typology that could be utilized to categorize relevant metrics, while
providing examples of metrics for each category, and to then use this as a starting point for
discussing metrics and monitoring in relation to AI ethics specifically.
   Finally, it should also be noted that the frequency of monitoring metrics varies between
(types of) metrics. Some metrics are continuously monitored in real-time during run-time, such
as ones measuring the live performance of the system (compute, uptime, etc.), while others may
be measured on a regular basis but not in real-time, such as some model and data metrics that
may be relevant to measure every time a new model version is trained or deployed. To this end,
some of the metrics we discuss are not necessarily continuously measured during run-time, but
are nonetheless measured occasionally during the operational life of an ML system.

3.1. Data Metrics
Given that ML systems are significantly data-driven, literature on ML system monitoring places
much emphasis on data metrics. The performance of ML models is highly dependent on the
quality and relevance of the data they are trained on, and the way data is processed and managed
also plays a notable role in ML development. ”Garbage in, garbage out” is a phrase often uttered
in relation to ML systems on the relation of training data and model output quality.
   This category focuses on metrics related to the data utilized by the system, and specifically
input data. While output data metrics are important, we consider them model metrics in this
typology as they reflect the attributes of an ML model. Thus, examples of data metrics include:
    • Data quality and integrity metrics (such as accuracy, consistency, completeness, and
      timeliness, etc.).
    • Data preprocessing metrics (feature extraction, normalization, imputation, etc.).
    • Data diversity metrics.
    • Data privacy metrics (such as the percentage of data anonymized, what is stored, where,
      for how long, etc.).
    • Data usage metrics (such as which data is ingested and processed, data velocity, etc.).
    • Data drift detection metrics (Kullback-Leibler divergence, Chi-squared test, etc.).

3.2. ML Model Metrics
ML model metrics measure attributes related to the ML models in an ML system, such as
performance or accuracy and fairness. The very conventional model metrics are focused on
accuracy, such as error rates, though more sophisticated ones have been adopted over time (e.g.,
precision and recall for classification tasks). Monitoring ML model performance is important as
data and/or concept drift can degrade model performance over time. Data drift refers to the
”gradual change in input data that impacts ML model performance” [38], while concept drift
refers to a situation where the ”statistical properties of the target variable, which the model
is trying to predict, change over time in unforeseen ways” [39]. Examples of model metrics
include:

    • Numerous performance metrics (such as accuracy, precision, recall, F1 score for binary
      and multiclass classification; Mean Absolute Error (MAE) and Root Mean Squared Error
      (RMSE) for regression tasks, etc.; including also subtypes such as probalistic metrics like
      log loss and brier score, as well as temporal metrics for concept drift detection.) (Naser &
      Alavi [23] provide an extensive review of such metrics.).
    • Fairness metrics (demographic parity, equal opportunity, equalized odds, etc.).
    • Training metrics (time, computational resources, etc.).
    • Model complexity (number of parameters, size, depth, etc.).
    • Versioning (version, training data (set(s)) used, etc.).
    • Interpretability and explainability metrics (LIME, counterfactual explanations, feature
      importance, etc.).

3.3. System/Infrastructure Metrics
System/infrastructure metrics evaluate the overall health and performance of the hardware and
software infrastructure that hosts and executes the machine learning components. An ML model
is ultimately just one part of a larger system [17]. System metrics are very conventional metrics
in software run-time monitoring, although the inclusion of ML components both adds new
metrics and may change the relative importance of existing ones. For example, ML components
may encourage the monitoring of model inference time, data storage requirements, and other
ML-specific infrastructure characteristics. Types of system/infrastructure metrics include:
    • Hardware metrics (such as CPU/GPU/TPU usage, memory utilization, disk I/O, network
      latency/bandwidth, etc.).
    • Software metrics (such as system load, application response time, error rates, model
      inference time, end-to-end latency, etc.).
    • System health metrics (such as uptime/downtime, availability, failure rate, etc.).
    • API metrics (such as response time, request rate, etc., if exposing ML models through an
      API, or for APIs utilized by the system itself).
    • Environmental impact metrics (energy efficiency, carbon footprint, e-waste, etc.).
    • Cybersecurity metrics (though such metrics may also be relevant for model and data
      categories in ML systems).

3.4. Process Metrics
In software engineering, various metrics are utilized to measure attributes related to the software
development processes. These include metrics related to the development approach (e.g.,
DevOps metrics [31, 32]) and metrics related to the performance of individual developers (e.g.,
many recent studies have explored the productivity impact of generative AI tools in software
development). Ultimately, much of ML system development is still software development [36],
making many of these metrics also relevant for ML system development. In ML development
contexts, these metrics also include metrics related to the ML workflows. Examples of process
metrics include:

    • Development metrics (such as code churn, code quality metrics, defect density, etc.).
    • Project management metrics (such as cycle time, lead time, story points completed,
      burn-up/burn-down charts, requirements engineering metrics related to user stories, etc.).
    • Operations metrics (such as (model) deployment frequency, mean time to recovery, model
      retraining frequency, model rollback rate, model deployment success rate, etc.).
    • Team metrics (such as team velocity, turnover rate, employee satisfaction, number of
      meetings, code reviews, etc.).

3.5. Business Metrics
In this context, business metrics aim to measure the economic and strategic performance of ML
systems. These metrics often capture the alignment of the ML system with business objectives,
customer satisfaction, and financial performance, and thus provide valuable insight for decision-
making at the strategic and operational levels. The importance of business aspects in SE is
well-acknowledged especially in the context of continuous SE [9, 10]. While these are largely
generic business metrics (e.g., return-on-investment), some ML-specific metrics can still be
identified. These should still be related to the system at hand, however, rather than the overall
business of the organization, although such a distinction can be difficult to make in a situation
where the one system is the entire business of a small software company, for example. For
generic business metrics for SE, and specifically for software startups, Kemell et al. [33] provide
an extensive list. Examples of business metrics include:

    • Financial metrics (such as ROI, development cost, revenue generated, model training,
      deployment, and maintenance costs, data acquisition and storage costs, etc.).
    • System impacts (such as process efficiency improvement, cost savings, employee produc-
      tivity, etc., either internally or in external organizations using the system).
   Many key business metrics in SE are arguably related to users. However, in this typology, we
feel that it is prudent to separate user metrics into their own category (Section 3.6).

3.6. User Metrics
User metrics measure the interactions between the system and its users. Current SE approaches
often emphasize the importance of users, with Agile placing emphasis on user involvement
[40] and continuous SE involving the collection of user data to further improve the system in a
data-driven manner (e.g., through A/B testing [41]). Understanding users is important in order
to ensure that the system meets their needs and provides value for them [20]. Thus, user metrics
are usually closely related to business aspects in practice [9]. User Experience (UX) is also a
well-established research area. A large list of user metrics can be found in Kemell et al. [33].
Rodden et al. [34] discuss measuring UX overall, while, for example, Arifin et al. [42] discuss
measuring UX in the more specific contexts of augmented reality applications. Moreover, in
terms of user conversion (visitor to paid user), various so-called funnels are utilized in business
in practice, with various related metrics being utilized. Examples of user metrics include:
    • Behavioral metrics (such as use frequency, session length, user engagement, various
      metrics measuring how a system is used during sessions, etc.).
    • User satisfaction metrics (such as user satisfaction score, usability scale, feedback, reviews,
      social media posts, etc.).
    • Impact metrics (conversion rate, retention rate, etc.).

3.7. Domain-specific Metrics
Domain-specific metrics not directly related to any of the preceding categories may be found
in various domain. For example, in the medical field, data related to patients and treatment
outcomes may be of interest in ML development as well. Organizations developing ML systems
should be aware of such metrics for their domain. However, given the diversity of such metrics,
we do not discuss them further in this paper past acknowledging them in this typology due to
their potential importance for practice.


4. Monitoring Ethical Aspects in ML systems
In this section, we leverage the typology of metrics presented in Section 3 to discuss the
monitoring of ethics in ML systems. We utilize the Ethics Guidelines for Trustworthy AI [15]
(AI HLEG) and their seven requirements as a framework for this discussion, as we look at what
kinds of monitoring these requirements might necessitate in practice. This section is split into
seven subsections based on the seven requirements of HLEG AI. The individual requirements
are elaborated on in their respective subsections. The discussion here is hardly exhaustive and
is simply meant to illustrate on a very general level how AI ethics principles (or in this case
requirements) might manifest in monitoring and metrics.
4.1. Monitoring the Requirement of Human Agency and Oversight
This requirement comprises fundamental rights, human agency, and human oversight. As for
fundamental rights, an assessment of an ML system’s fundamental rights impact ”should be done
prior to the system’s development” [15]. No clear avenues for monitoring are indicated. Human
agency posits that users ”should be able to make informed autonomous decisions regarding
AI systems” and ”not to be subject to a decision based solely on automated processing when
this produces legal effects on users or similarly significantly affects them” [15], and is also a
design-level issue first and foremost. This is also closely related to the GDPR [15].
  Human oversight, however, presents clear considerations for monitoring. Human oversight
involves human intervention, including, for example, the ability to decide when and how to use
the system in different contexts, to oversee its activity in a very broad sense, and to override
decisions made by the system [15]. Monitoring can play a key role in understanding when
human intervention is required. Certain metrics can be used to raise alerts, pointing to a need
for human intervention of varying kinds and of varying urgency. From the point of view of
ethics in particular, this may encompass fairness drift [43] for example, which may indicate
a need for model retraining in the near future, but may not be a crucial alert necessitating
immediate emergency action. Such metrics may be found across multiple categories in our
typology, and especially in model metrics.

4.2. Monitoring the Requirement of Technical Robustness and Safety
The requirement of technical robustness encompasses resilience to attack and security, fallback
plan and general safety, accuracy, and reliability and reproducibility. Cybersecurity is an issue
for ML systems like all software systems [15]. Cybersecurity is not a concern unique to AI ethics
and is (or should be) a well-established concern in organizations. Various existing papers discuss
monitoring ML systems in relation to cybersecurity. Attacks against ML systems can take on
forms not seen against conventional software, such as attacks against data (data poisoning, etc.)
or the ML model (model leakage, etc.) [15]
   Fallback plan and general safety posits that AI systems should be able to deal with problems
through a fallback plan, by, e.g., switching from statistical or rule-based procedure, or alerting
a human operator before continuing. Such safety measures should be proportionate to the
potential risk posed by the system. [15] Monitoring is necessary for determining when such
actions are needed, by defining through metrics what such problems could be. Such safety
concerns are also not unique to AI ethics.
   Accuracy is another conventional concern for ML developers discussed in a large number
of existing papers, and for which various metrics exist. AI ethics (and in this case the AI
HLEG) simply places additional emphasis on being able to mitigate the risk or harm posed by
inaccuracies and that the system is able to indicate how likely errors are (tying to transparency)
[15]. Reliability and reproducibility are related to accuracy, with reliability referring to a system
working ”properly with a range of inputs and in a range of situations” and reproducibility
referring to ”whether an AI experiment exhibits the same behavior when repeated under the
same conditions” [15].
4.3. Monitoring the Requirement of Privacy and Data Governance
The requirement of privacy and data governance encompasses privacy and data protection,
quality and integrity of data, and access to data [15]. Privacy and data protection is more
focused on design-level decisions (training data sets used, etc.) and establishing the relevant
protocols than continuous monitoring. Quality and integrity of data necessitate more continuous
monitoring in the context of self-learning systems. Otherwise, metrics may be checked when
any changes happen, for example, upon acquiring new training data. Access to data is an
aspect of privacy and data governance that can be continuously monitored (logged) aside from
establishing the relevant protocols.

4.4. Monitoring the Requirement of Transparency
The requirement of transparency encompasses traceability, explainability, and communication
[15]. Traceability posits that ”the data sets and the processes that yield the AI system’s decision,
including those of data gathering and data labeling as well as the algorithms used, should be
documented to the best possible standard”. In AI HLEG [15], these are presented as project
documentation issues, aside from the reference to model outputs that ties to explainability.
Explainability ”concerns the ability to explain both the technical processes of an AI system
and the related human decisions” [15]. In AI ethics literature, explainability is often seen as
a trade-off, where ML approaches with higher accuracy are seen to generally result in less
explainability, although this is an issue primarily relevant when training ML models. Ensuring
and evaluating explainability past deployment may be done via user collaboration, for which
Hoffman et al. [44] extensively discuss different approaches. However, this is not a continuous
monitoring issue but rather something to evaluate when a model is deployed.
   Transparency also encompasses communication in AI HLEG, which includes disclaimers for
informing users that they are interacting with an AI, as well as communicating the capabilities
and limitations of the system (accuracy, etc.) to relevant stakeholders [15]. Out of these two
aspects, the latter ties to monitoring in that such disclaimers may need to be updated with
metrics that reflect the evolving capabilities and limitations of the system as it is continuously
developed. For the most part, however, these are issues related to the organization and its
communication rather than the ML system.

4.5. Monitoring the Requirement of Diversity, Non-Discrimination and
     Fairness
The requirement of diversity, non-discrimination, and fairness encompasses avoidance of
unfair bias, accessibility and universal design, and stakeholder participation. Extant literature
extensively discusses unfair bias from the point of view of data and model outputs, and some
studies in relation to (continuous) monitoring of fairness also exist [12, 43]. Fairness is indeed
one of the most explored aspects of AI ethics [4] and is an issue businesses are also well aware
of [45], although this hardly means that it is an easily solved one, given the various practical
challenges associated with detecting and tackling it.
   Accessibility and universal design, like the name implies, is more related to design decisions.
However, monitoring may be relevant to ensure that the system is indeed usable by different
groups of users, in addition to user feedback. Indeed, in terms of user feedback (stakeholder
participation), AI HLEG considers it beneficial to ”solicit regular feedback even after deployment
and set up longer term mechanisms for stakeholder participation” [15]. Continuing to develop
the system based on user data (though not necessarily direct, explicit feedback) is something
continuous SE generally advocates. It could be argued that continuous SE in this regard supports
AI ethics, although AI ethics places more emphasis on soliciting direct feedback rather than
simply utilizing system use data. Some existing practices like user exit surveys or customer
satisfaction surveys may also contribute in this regard.

4.6. Monitoring the Requirement of Societal and Environmental Well-Being
The requirement of societal and environmental well-being encompasses sustainable and environ-
mentally friendly AI, social impact, and society and democracy. Sustainable and environmentally
friendly AI places emphasis on environmental friendliness, encompassing ”development, de-
ployment and use process, as well as its entire supply chain” [15]. Resources used on training
models and monitoring of environmental impacts during deployment are both relevant for
monitoring. On the other hand, social impact and society and democracy are more challenging
from the point of view of metrics and monitoring, and are arguably more related to design
decisions and decisions related to how the system is utilized in practice (where, by whom, etc.).
While AI HLEG posits that, in terms of social impact, the effects of ML systems ”should be
carefully monitored” [15], finding relevant metrics may be challenging. User feedback and
monitoring social media response, etc. may provide some indication of the wider impacts of an
ML system, in addition to other types of user data.

4.7. Monitoring the Requirement of Accountability
The requirement of accountability encompasses auditability, minimisation and reporting of
negative impacts, trade-offs, and redress. Auditability ”entails the enablement of the assessment
of algorithms, data and design processes”, for internal and external auditors [15]. Various
metrics can support auditability in providing attributes to assess. Minimisation and reporting
of negative impacts may involve monitoring to detect any relevant negative impacts during the
operational life of the system, but like social impacts, may be difficult to measure. Moreover,
the AI HLEG [15] places more emphasis on reporting made by other parties than organizations
operating the ML systems (whistle-blowing, trade unions, etc.).
   In placing emphasis on transparency, AI ethics (and AI HLEG [15]) also includes documenting
trade-offs and making them in an informed manner. However, these can be considered largely
high-level decisions, rather than something to repeatedly measure or continuously monitor,
although if experimenting with different model training approaches or training data, relevant
trade-offs may be more regular. Finally, redress refers to organizational processes related to
enabling users to seek redress, and while they may provide some relevant data on issues with
the system after the fact, they are hardly primarily monitoring issues.
5. Discussion
In this section, we discuss the implications of this typology (Section 3) and our example of how
one set of ethical guidelines [15] might be relevant for monitoring (Section 4). We also highlight
issues that future research could look into, or that practitioners might consider of interest.
   Relevant literature is diverse. This is a potential challenge for any researcher looking to
conduct studies on this topic, and one limitation for this paper. For example, various technical
papers include discussion on issues related to AI ethics, such as safety, fairness, and explainability,
but they may not explicitly discuss ethical aspects at all and may not use concepts typically
seen in AI ethics literature. Similarly, monitoring related to data aspects may be discussed in
literature not at all related to AI or ML (big data, etc.). Thus, there is already much potentially
relevant research that is simply framed differently.
   Ethics is not just numbers. We wish to highlight that, though AI ethics may ultimately
manifest in practice through metrics to monitor, ethics should not be reduced to just numbers.
As the AI HLEG guidelines [15] also highlight, ethics involves various trade-offs and design
decisions that are made during the design and development of an ML system. The selection of
metrics themselves may also include trade-offs when, for example, deciding between different
fairness metrics. There is no one-size fits all in ethics, and there should also be ethical decision-
making involved in the process of selecting the metrics used to monitor ethical aspects.
   To this end, however, not everything about AI ethics necessitates monitoring. Not all
ethical principles (or requirements in the case of AI HLEG) directly manifest as system features
[8], let alone ones that require monitoring. AI ethics also involves various design-level decisions
made early on in the development of a system [15].
   Some existing concerns are closely related to AI ethics. ”Doing” AI ethics involves
things already being done in many organizations developing ML. Well-established concerns
related to AI ethics include cybersecurity and model robustness and safety. These concerns are a
part of AI ethics but are nonetheless something most organizations developing ML systems are
familiar with. Thus, it is possible to ”unintentionally” tackle AI ethics to some extent, and some
aspects of AI ethics may be closer to existing practices. This is also relevant for monitoring.
   MLOps may support the implementation of AI ethics through sustainable MLOps [14].
Indeed, through an automated ML pipeline, MLOps may support the implementation of such
principles as transparency (versioning, established process ... etc.). However, the practicalities
of monitoring ethical attributes warrant consideration in MLOps contexts. For example: which
(ethics-related) metrics are to be monitored constantly in an automated fashion? Which metrics
should be checked only in certain situations, such as upon training or deploying a new model?
Which metrics should result in automated alerts and what are the thresholds for such alerts
for these metrics, and which metrics are less crucial and can be monitored more leisurely? We
consider these interesting avenues for future research.
   AI ethics requires a framework. To ”do” AI ethics, one needs to define what it is in the
given context [6, 46]. The most straightforward way of doing this is by utilizing an existing
framework, as we have done in this paper (through AI HLEG [15]). This makes it possible for
you define what it actually is you wish to monitor as well (e.g., which principles). Various
guidelines [4, 5], tools [27], and methods (e.g. [7]) may be utilized for this purpose.
   Challenges with knowhow, communication, and collaboration. ML development in
and of itself involves collaboration challenges as the rest of the development team needs to
collaborate with the ML experts, which can be challenging in practice [24]. Similar collaboration
challenges between different types of experts have been studied in the context of DevOps [47].
In terms of monitoring, Kourouklidis et al. [48] remark that ”domain experts in the area of
ML, who produce the ML models, commonly lack the required expertise in the area of software
engineering, needed to implement a robust and scalable monitoring solution”, pointing to an issue
improved collaboration could alleviate. Adding AI ethics expertise into this mix is arguably
likely to only pose further challenges.
   On this note, real-world development contexts may also pose challenges for monitor-
ing. ML development capabilities are not something found in every company. In the case of
commissioned systems developed by an external organization, the monitoring of ethics-related
metrics is something that needs to be discussed between the involved parties.


6. Conclusions
In this paper, we have discussed AI ethics from the point of view of monitoring in general,
as well as MLOps more specifically. We have presented a holistic typology of metrics for ML
system monitoring, which we have then utilized to look at ML system monitoring overall, as
well as in relation to AI ethics. This typology includes seven categories of metrics: (1) data
metrics, (2) ML model metrics, (3) system/infrastructure metrics, (4) process metrics, (5) business
metrics, (6) user metrics, and (7) domain-specific metrics. We argue that this typology, which
is based on a breadth of existing literature from the fields of both software engineering and
machine learning, provides one way of conceptualizing ML system monitoring in a holistic
manner. Additionally, as a framework for the discussion on monitoring related to AI ethics, we
have utilized the Ethics Guidelines for Trustworthy AI [15] (AI HLEG). Through this discussion,
we have provided some examples of how AI ethics may manifest in ML system monitoring.
   While this paper begins to illustrate the relevance of AI ethics for ML system monitoring,
its main purpose is to encourage further discussion and studies into the topic. As one tangible
future research suggestion, we recommend a literature review of what is currently known about
ethics-related metrics in AI development. A large number of papers discussing ethics-related
metrics already exists but this discussion is split across disciplines and the concepts utilized in
these papers are hardly uniform.


Acknowledgments
This work was partly funded by local authorities (“Business Finland”) under grant agreement
ITEA-2020-20219-IML4E of the ITEA4 programme.


Declaration on Generative AI
The author(s) have not employed any Generative AI tools.
References
 [1] B. Mittelstadt, Principles alone cannot guarantee ethical ai, Nature Machine Intelligence
     (2019) 1–7.
 [2] V. Vakkuri, K. Kemell, J. Kultanen, P. Abrahamsson, The current state of industrial practice
     in artificial intelligence ethics, IEEE Software 37 (2020) 50–57.
 [3] V. Vakkuri, K. Kemell, J. Kultanen, M. T. Siponen, P. Abrahamsson, Ethically aligned design
     of autonomous systems: Industry viewpoint and an empirical study, EJBO - Electronic
     Journal of Business Ethics and Organization Studies 27 (2022).
 [4] A. Jobin, M. Ienca, E. Vayena, The global landscape of ai ethics guidelines, Nature Machine
     Intelligence 1 (2019) 389–399.
 [5] T. Hagendorff, The ethics of ai ethics: An evaluation of guidelines, Minds and Machines
     (2020) 1–22.
 [6] E. Halme, M. Jantunen, V. Vakkuri, K.-K. Kemell, P. Abrahamsson, Making ethics practical:
     User stories as a way of implementing ethical consideration in software engineering,
     Information and Software Technology 167 (2024) 107379.
 [7] V. Vakkuri, K.-K. Kemell, M. Jantunen, E. Halme, P. Abrahamsson, ECCOLA — a method
     for implementing ethically aligned ai systems, Journal of Systems and Software 182 (2021)
     111067. doi:https://doi.org/10.1016/j.jss.2021.111067 .
 [8] K. K. Kemell, V. Vakkuri, F. Sohrab, How do ai ethics principles work? from process to
     product point of view, in: Conference on Technology Ethics, Tethics, CEUR-WS, 2023, pp.
     24–38.
 [9] B. Fitzgerald, K. Stol, Continuous software engineering: A roadmap and agenda, Journal
     of Systems and Software 123 (2017) 176–189.
[10] P. Rodríguez, A. Haghighatkhah, L. E. Lwakatare, S. Teppola, T. Suomalainen, J. Eskeli,
     T. Karvonen, P. Kuvaja, J. M. Verner, M. Oivo, Continuous deployment of software intensive
     products and services: A systematic mapping study, Journal of Systems and Software 123
     (2017) 263–291. doi:https://doi.org/10.1016/j.jss.2015.12.015 .
[11] M. M. John, H. H. Olsson, J. Bosch, Towards mlops: A framework and maturity model, in:
     2021 47th Euromicro Conference on Software Engineering and Advanced Applications
     (SEAA), 2021, pp. 1–8. doi:10.1109/SEAA53835.2021.00050 .
[12] A. Ghosh, A. Shanbhag, C. Wilson, Faircanary: Rapid continuous explainable fairness,
     in: Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, 2022, pp.
     307–316.
[13] Q. Lu, L. Zhu, X. Xu, J. Whittle, D. Douglas, C. Sanderson, Software engineering for
     responsible ai: An empirical study and operationalised patterns, in: Proceedings of the
     44th International Conference on Software Engineering: Software Engineering in Practice,
     2022, pp. 241–242.
[14] D. A. Tamburri, Sustainable mlops: Trends and challenges, in: 2020 22nd international
     symposium on symbolic and numeric algorithms for scientific computing (SYNASC), IEEE,
     2020, pp. 17–23.
[15] Ethics Guidelines for Trustworthy AI, Ethics guidelines for trustworthy ai, 2019. URL:
     https://ec.europa.eu/digital-single-market/en/news/ethics-guidelines-trustworthy-ai.
[16] L. Gao, M. Lu, L. Li, C. Pan, A survey of software runtime monitoring, in: 2017 8th IEEE
     International Conference on Software Engineering and Service Science (ICSESS), 2017, pp.
     308–313. doi:10.1109/ICSESS.2017.8342921 .
[17] T. Schröder, M. Schulz, Monitoring machine learning models: a categorization of challenges
     and methods, Data Science and Management 5 (2022) 105–116. doi:https://doi.org/10.
     1016/j.dsm.2022.07.004 .
[18] B. Kitchenham, J. Walker, A quantitative approach to monitoring software development,
     Software Engineering Journal 4 (1989).
[19] S. Moreschini, F. Lomio, D. Hästbacka, D. Taibi, Mlops for evolvable ai intensive software
     systems, in: 2022 IEEE International Conference on Software Analysis, Evolution and
     Reengineering (SANER), 2022, pp. 1293–1294. doi:10.1109/SANER53432.2022.00155 .
[20] E. Ries, The Lean Startup: How Today’s Entrepreneurs Use Continuous Innovation to
     Create Radically Successful Businesses, New York: Crown Business, 2011.
[21] P. Kourouklidis, D. Kolovos, N. Matragkas, J. Noppen, Towards a low-code solution for
     monitoring machine learning model performance, in: Proceedings of the 23rd ACM/IEEE
     international conference on model driven engineering languages and systems: companion
     proceedings, 2020, pp. 1–8.
[22] S. Mäkinen, H. Skogström, E. Laaksonen, T. Mikkonen, Who needs mlops: What data
     scientists seek to accomplish and how can mlops help?, in: 2021 IEEE/ACM 1st Workshop
     on AI Engineering-Software Engineering for AI (WAIN), IEEE, 2021, pp. 109–112.
[23] M. Naser, A. H. Alavi, Error metrics and performance fitness indicators for artificial
     intelligence and machine learning in engineering and sciences, Architecture, Structures
     and Construction (2021) 1–19.
[24] G. Giray, A software engineering perspective on engineering machine learning systems:
     State of the art and challenges, Journal of Systems and Software 180 (2021) 111031.
[25] S. Martínez-Fernández, J. Bogner, X. Franch, M. Oriol, J. Siebert, A. Trendowicz, A. M.
     Vollmer, S. Wagner, Software engineering for ai-based systems: A survey, ACM Trans.
     Softw. Eng. Methodol. 31 (2022). doi:10.1145/3487043 .
[26] J. Borenstein, F. S. Grodzinsky, A. Howard, K. W. Miller, M. J. Wolf, Ai ethics: A long
     history and a recent burst of attention, Computer 54 (2021) 96–102.
[27] J. Morley, L. Floridi, L. Kinsey, A. Elhalal, From what to how: An initial review of publicly
     available ai ethics tools, methods and research to translate principles into practices, Science
     and Engineering Ethics 26 (2020) 2141–2168.
[28] M. Sloane, J. Zakrzewski, German ai start-ups and “ai ethics”: Using a social practice lens
     for assessing and implementing socio-technical innovation, in: 2022 ACM Conference
     on Fairness, Accountability, and Transparency, FAccT ’22, Association for Computing
     Machinery, New York, NY, USA, 2022, p. 935–947.
[29] V. Vakkuri, K.-K. Kemell, M. Jantunen, P. Abrahamsson, “this is just a prototype”: How
     ethics are ignored in software startup-like environments, in: V. Stray, R. Hoda, M. Paasi-
     vaara, P. Kruchten (Eds.), Agile Processes in Software Engineering and Extreme Program-
     ming, Springer International Publishing, Cham, 2020, pp. 195–210.
[30] J. A. Siqueira De Cerqueira, A. Pinheiro De Azevedo, H. Acco Tives, E. Dias Canedo, Guide
     for artificial intelligence ethical requirements elicitation-re4ai ethical guide, in: 55th
     Hawaii International Conference on System Sciences, 2022.
[31] N. Forsgren, M. Kersten, Devops metrics, Commun. ACM 61 (2018) 44–48. URL: https:
     //doi.org/10.1145/3159169. doi:10.1145/3159169 .
[32] R. Amaro, R. Pereira, M. M. da Silva, Capabilities and metrics in devops: A design science
     study, Information & Management 60 (2023) 103809. doi:https://doi.org/10.1016/j.
     im.2023.103809 .
[33] K.-K. Kemell, X. Wang, A. Nguyen-Duc, J. Grendus, T. Tuunanen, P. Abrahamsson, Startup
     metrics that tech entrepreneurs need to know, in: Fundamentals of Software Startups:
     Essential Engineering and Business Aspects, Springer, 2020, pp. 111–127.
[34] K. Rodden, H. Hutchinson, X. Fu, Measuring the user experience on a large scale: user-
     centered metrics for web applications, in: Proceedings of the SIGCHI conference on human
     factors in computing systems, 2010, pp. 2395–2398.
[35] T. Honglei, S. Wei, Z. Yanan, The research on software metrics and software complexity
     metrics, in: 2009 International Forum on Computer Science-Technology and Applications,
     volume 1, 2009, pp. 131–136. doi:10.1109/IFCSTA.2009.39 .
[36] T. Mikkonen, J. K. Nurminen, M. Raatikainen, I. Fronza, N. Mäkitalo, T. Männistö, Is
     machine learning software just software: A maintainability view, in: D. Winkler, S. Biffl,
     D. Mendez, M. Wimmer, J. Bergsmann (Eds.), Software Quality: Future Perspectives on
     Software Engineering Quality, Springer International Publishing, Cham, 2021, pp. 94–105.
[37] N. E. Fenton, M. Neil, Software metrics: roadmap, in: Proceedings of the Conference on
     the Future of Software Engineering, 2000, pp. 357–370.
[38] S. Ackerman, O. Raz, M. Zalmanovici, A. Zlotnick, Automatically detecting data drift in
     machine learning classifiers, arXiv preprint arXiv:2111.05672 (2021).
[39] J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, G. Zhang, Learning under concept drift: A review,
     IEEE transactions on knowledge and data engineering 31 (2018) 2346–2363.
[40] M. Bano, D. Zowghi, A systematic review on the relationship between user involvement
     and system success, Information and software technology 58 (2015) 148–169.
[41] H. H. Olsson, J. Bosch, From opinions to data-driven software r&d: A multi-case study on
     how to close the ’open loop’ problem, in: 2014 40th EUROMICRO Conference on Software
     Engineering and Advanced Applications, 2014, pp. 9–16. doi:10.1109/SEAA.2014.75 .
[42] Y. Arifin, T. G. Sastria, E. Barlian, User experience metric for augmented reality
     application: A review, Procedia Computer Science 135 (2018) 648–656. URL: https:
     //www.sciencedirect.com/science/article/pii/S187705091831514X. doi:https://doi.org/
     10.1016/j.procs.2018.08.221 , the 3rd International Conference on Computer Science
     and Computational Intelligence (ICCSCI 2018) : Empowering Smart Technology in Digital
     Era for a Better Life.
[43] Z. Wang, N. Saxena, T. Yu, S. Karki, T. Zetty, I. Haque, S. Zhou, D. Kc, I. Stockwell, X. Wang,
     et al., Preventing discriminatory decision-making in evolving data streams, in: Proceedings
     of the 2023 ACM Conference on Fairness, Accountability, and Transparency, 2023, pp.
     149–159.
[44] R. R. Hoffman, S. T. Mueller, G. Klein, J. Litman, Metrics for explainable ai: Challenges
     and prospects, arXiv preprint arXiv:1812.04608 (2018).
[45] TechNewsWorld, More than one in three firms burned by ai bias, 2022. URL: https://www.
     technewsworld.com/story/more-than-one-in-three-firms-burned-by-ai-bias-87387.html.
[46] K.-K. Kemell, V. Vakkuri, What is the cost of ai ethics? initial conceptual framework and
     empirical insights, in: International Conference on Software Business, Springer, 2023, pp.
     247–262.
[47] M. S. Khan, A. W. Khan, F. Khan, M. A. Khan, T. K. Whangbo, Critical challenges to adopt
     devops culture in software organizations: A systematic review, IEEE Access 10 (2022)
     14339–14349. doi:10.1109/ACCESS.2022.3145970 .
[48] P. Kourouklidis, D. Kolovos, J. Noppen, N. Matragkas, A model-driven engineering ap-
     proach for monitoring machine learning models, in: 2021 ACM/IEEE International Con-
     ference on Model Driven Engineering Languages and Systems Companion (MODELS-C),
     IEEE, 2021, pp. 160–164.