AI Evaluation and ISO Standards: Is the Gap Closed?⋆
                                Andrea Trenta1

                                1 UNINFO UNI TC 533 Technical Committee Artificial Intelligence, Turin, Italy


                                                  Abstract
                                                   This paper is a part of a set of papers showing how newly defined data and software quality
                                                  measures can be described in ISO 25000 format. In the first group of papers [1], [2],[3], [4],
                                                  we discussed with the help of some examples, the general approach of conformance when
                                                  new quality measures are defined, and in the last papers [6], [7] how to build practical
                                                  ISO/IEC 25000 compliant product quality measures for AI, starting from measures developed
                                                  in several public projects. In this paper we analyze the feasibility of evaluation of an AI
                                                  product, according AI ISO/IEC standards, through examples from existing practices.
                                                  Moreover, this paper can be considered for the works in AI standardization area.

                                                  Keywords
                                                  product quality, measures, accuracy, metric, machine learning, artificial intelligence1


                                  1. Introduction                                                   existing practices are coherent to SQuaRE and AI ISO
                                                                                                    standards and possible remaining issues.
                                  Policy makers, industries, and academia are facing the                Figure 1 gives an overview of entities that contribute
                                  problem of building trust in AI, and trustworthyness, in          to AI product assessment.
                                  turn, requires [15] that the AI product is measured,
                                  evaluated, and finally assessed. It is to be noted that an        2. AI standardization (2024
                                  organization can accomplish measurements and
                                  evaluation based on existing practices as it is now                  update)
                                  advised and supported by [13]. Starting from this point,          The issue of drafting standards for AI trustworthiness
                                  in the present paper we show, through a sort of reverse           was mainly, but not only, assigned to the international
                                  engineering, why measurements and evaluation of


                                                                                Figure 1: AI evaluation overview (UML-like)


                                IWESQ’24: International Workshop on Experience with SQuaRE Family                  © 2024 Copyright for this paper by its authors. Use permitted under
                                                                                                                   Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                and its Future Direction, December 03, 2024, Chongqing, CN
                                  andrea.trenta@dataqualitylab.it (A. Trenta)

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                               Figure 2: Non-official ISO standards by topic

standardization body ISO/IEC SC42 and to the European                [13], will be performed on voluntary or, in the near
standardization body CEN/CENELEC JTC21 that have in                  future, on mandatory basis, in the former case to
charge the drafting of technical standards in support of             promote trustworthiness in AI systems, in the latter case
industry and of lawful rules.                                        to get compliance to rules [14].
    SC42 achieved important results in drafting AI
standards, and this success leveraged, among the others,             3. ISO standards for AI evaluation
the foundational SQuaRE and testing standards from
SC7, the implementation standards, the assessment and                We focus on the topic of AI product quality evaluation
the management system standards. So, the work of                     and in the following, we analyze2 a set of ISO standards
ISO\IEC SC42 has given birth to a set of standards on AI             and their mutual relationships relevant for
that are covering topics such as definitions [8], software           organizations that develop, deploy and use AI, that are:
and data quality [19], testing, risk management,
                                                                       • ISO/IEC 42001 ‘AI management system’ [13]
assessment, management system, application, according
to the non-official scheme of figure 2.                                • ISO/IEC CD TS 17847 ‘Verification and validation
    It is to be noted that SQuaRE product quality                        analysis of AI systems’ [15]
standards play a central role in AI measurements and                   • ISO/IEC TS 25058 ‘Guidance for quality evaluation
evaluation and SC42 has developed extensions [12], [14]                  of artificial intelligence (AI) systems’ [14]
to standards of the series ISO/IEC 25000. Indeed, the                  • ISO/IEC 25059 ‘Quality model for AI systems’ [12]
ISO/IEC 25000 series itself foresees the possibility to                • ISO/IEC 25040 ‘Quality evaluation framework’ [16]
extend the model to specific technologies like AI,                     • ISO/IEC AWI TR 421063 ‘Overview of differentiated
through the definition of new characteristics and new                    benchmarking of AI system quality characteristics’
measures. This role and its reasons are also well                        [17]
explained in the ISO/IEC news given in
https://www.iec.ch/blog/new-international-standard-                     Firstly, we recall that [13], gives guidance for the
ensuring-quality-ai-systems.                                         management system mainly for AI product and services,
    In this context, the assessment of product quality,              as existing management systems for processes are
possibly together with the assessment of process quality             applicable also to AI. The full picture in [13] is completed


2 Sources are public pages of: (1) ISO Online browsing platform,        3 AWI and CD means standards ‘under development’
available at https://www.iso.org/obp (2) ISO Online Standards
directory, available at https://www.iso.org/standards.html.
thanks to the reference to [12], as even AI performance              (3) Categorization of software is defined as ‘specific
can be addressed through an AI quality model.                            way to allocate a target system into a category’;
    According to [15], Validation & Verification analysis                [22] contained also the concept of differentiated
for AI systems 4 is made by: (1) quality evaluation, (2)                 levels of quality associated to an application
formal methods and (3) simulation. The reason why [15]                   software; for example, there are different levels of
was introduced in addition to the existing V&V                           reliability required for a banking software (lower)
standards, is that testing and quality evaluation are                    than for a meteorological satellite (higher).
considered not exhaustive when also formal methods5                  (4) Quality evaluation is composed by the following
and simulation are feasible (e.g. in case of neural                      steps:
networks), so that they should also be applied jointly to                -    measurement
traditional quality evaluation for AI systems. By the way,               -    rating (single dimension)
at the moment [13] doesn’t cite [15] but it could be                     -    rating (multiple dimensions8)
coherently cited, when necessary, in future revisions of             (5) Evaluation result output can be of two types:
[13].                                                                    -    a pass/fail result or
    In all the AI standards it is recognized that the                    -    a quality score
reference for AI quality evaluation is [14] that in turn is
based on quality models defined in [9].                                  A quality score output is advised in industry context
    It is to be noted that for the AI set of standards, many         and for a user’s perspective, a pass/fail output is
useful concepts can be imported from [16]: (1) quality               sufficient for compliance with technical requirements or
rating module, (2) recommended scoring method, (3)                   other kind of prescriptions, including certification.
categorization of software, (4) quality evaluation, (5)                  The need for concepts (1), (2) and (3) is very clear
evaluation result;                                                   and unavoidable, as AI software is highly differentiated,
                                                                     in the sense that it there are a lot of algorithms for each
(1) Quality rating module is defined as ‘set of quality              task (recognition, classification, prediction, etc.) and in
    measures, operational environment, and methods                   turn each algorithm can behave differently due to
    for conducting quality measurements and quality                  different model, training, hyperparameters, etc.
    ratings on a specific category of target entities’ and
    takes into account the variability of the                            As represented in figure 1 and detailed in [6], [7], the
    environment and measures through a rich                          manifold of measures implies a deep categorization of
    documentation; template rating contains a                        the software under measurement, that includes even the
    parametric description of rating method and                      specification of the code. This approach it is also
    measurement environment in the perspective they                  coherent with the fact that code libraries often include
    will be reused. It includes rating level setting and             their own coded measures9 .
    range of acceptance.                                                 This need has coherently been taken into account for
(2) Overall score can be a recommended scoring                       example in [17] where benchmarking is ‘differentiated’
    method and consists in a weighted linear                         due to the difficulty to apply an ‘absolute’ benchmarking.
    combination of the individual scores reached from                    It is to be pointed out that the issue of comparing
    measurements of some selected measures of a                      measurements is relevant for a nonlocal quality
    group of (sub)characteristics. In this perspective, an           evaluation process (i.e. products comparison) and not for
    overall quality score Qs could be 6 a sum of j-                  measurement process: in fact, it is always possible to
    measurements Mij for each of the Wi weighted i-                  carry out a measurement with any metric1010 ; on the
    characteristics selected for the evaluation, and                 opposite, it is challenging the comparison and
    should be comparable with the relevant                           replicability of the measured values, due to the extreme
    benchmarks7 Bij:                                                 variability of AI products, even when performing the
                                 &         #
                                                𝑀!"                  same task: e.g. an image classification task can be
                        𝑄𝑠 = $ 𝑊! ∙ $                                performed through thousands combinations of models,
                                                𝐵!"
                               !$%        "$%                        algorithms, training data, etc.


4 It is intended an AI system that encompasses AI system                7 In the formula, Bij is defined as the best value of Mij (e.g., the best
components and the interaction of non-AI components with the AI         value of an historical series of Mij measurements) and it is a
system components                                                       normalizing factor.
5 The measurement source for formal methods is a model and not a        8 The multidimensional rating is better known as quality analysis.
delivered machine, on the other end, the measurement source for         9 One example is Scikit-Learn
AI quality rating should be a deliverable source code or delivered      https://scikitlearn.org/1.5/modules/model_evaluation.html
machine and not a model.                                                10 See [1][6][18][13] for the definition of new measures.
6 The formula is taken from [6].
     Another widely recognized limitation to                            5. Evaluation example: Holistic
benchmarking [21][17], is when an ML implemented
with neural networks uses continuous learning, its                          Evaluation for Language
hyperparameters are evolving, and the measurement of                        Models (HELM) [23]
characteristics of the NN can be different (and assessed
                                                                        In [7] we showed how the measures performed in the
worse or better) from the measurement taken in the
                                                                        research HELM [23] is, even if unintentionally,
initial state. This is also the reason why some AI devices
                                                                        compliant to ISO AI quality standards.
are deployed and sold as “frozen” giving a guarantee to
                                                                            In this paper we show that not only measures but
the user-buyer that the behaviour and performance
                                                                        also evaluation method of HELM is compliant to ISO AI
value of the ML will be the same all the time.
                                                                        quality standards. This conclusion is supported by the
     Anyway, additional requirements (e.g. operational
                                                                        fact that HELM uses the same concepts of ISO standards,
performance not worse than tested ones) and
                                                                        in        particular,    mapping        entities     in
measurement can be satisfied, so enlarging the field of
                                                                        https://crfm.stanford.edu/helm/v0.2.2/ with entities of
evaluation, both along the time and the post-training
                                                                        figure 1:
data and perform a further assessment of the ML in the
operational mode.                                                           a)     Taxonomizing of LLM15 by 36 models16
                                                                            b)     Selection of 13 characteristics17 and 57
4. Evaluation example: Rec. ITU-T                                                  measures
    F.748.11                                                                c)     Definition of a quality rating module that
                                                                                   includes 42 standard scenarios18, then
The Rec. F.748.11 [24], proposes, metrics, benchmark                               measurements rating
and evaluation method for AI processors. Benchmarks                         d)     Evaluation by each model and by each
are calculated both for the testing environment and for                            characteristic over multiple scenarios, as
the production environment, being the latter relevant                              shown in fig.3
for quality evaluation.
    Despite the different focus, (a processor instead of a
pure software application), the approach of ITUT
F.748.11 is the same of other examples as it is based on:

   a)     Taxonomizing of Deep Neural Networks11 by
          12 models12
   b)     Selection of 6 characteristics13 and 11 measures
   c)     Definition of a quality rating module that
          includes 8 standard scenarios14 ; benchmarks
          calculation; measurements rating.
   d)     Evaluation by each model and by each
          characteristic over single scenario

     It should be noted that [24] is the only standard that
details both the training scenario (e.g. ResNet_50,
ImageNet) and the relevant target value, i.e. benchmark,
(e.g. 74.9% Top-1 Accuracy) and this approach seems
advised by [15].
     So, we can conclude that both the measures and the
evaluation method of ITU-T F.748.11 [24] can be
accounted as compliant to ISO AI quality standards.
                                                                           Figure 2: Accuracy over multiple scenarios (win rate)
                                                                                                  [20]


11 Deep NN is a Techmethod in fig.1.                                      15 LLM is a Techmethod in fig.1.
12 Model is an algorithm in fig.1.                                        16 Model is an algorithm in fig.1.
13 Most of scenarios refer to accuracy characteristic and its             17 Characteristics that are not present (e.g. toxicity) in models [5],
measures.                                                                 [6], can be still handled as ISO 25000 conforming mechanism [27].
14 Standard scenarios rely on standard dataset, both in training and      18 Standard scenarios rely on standard dataset, both in training and
input prompts. A set of scenarios is in general suitable for a set of     input prompts. A set of scenarios is in general suitable for a set of
measures (e.g. CivilComments and RealToxicityPrompts dataset are          measures (e.g. CivilComments and RealToxicityPrompts dataset are
suitable for toxicity measurement).                                       suitable for toxicity measurement).
    We recall that in [7], we considered the measure of                   It is easy to check that PwC applies:
detection of toxic text19 19 and defined the table 1 below.
    In conclusion, both the measures 20 and the                           a)     Taxonomizing of various technologies21 by
evaluation method of HELM [23] can be accounted as                               5189 tasks22
compliant to ISO AI quality standards.                                    b)     Selection of characteristics23 and measures
                                                                          c)     Definition of a quality rating module that
                                                                                 includes 1 standard scenario24; benchmarks
  Table 1 Toxicity measure (ISO/IEC 25000 format)
                                                                                 calculation; benchmark is intended a set of
                                                                                 performance measurements (es. Top
                                                                                 Accuracy-1) of all the available models against
                                                                                 the same dataset in a limited simulation
                                                                                 environment; measurements rating is
                                                                                 displayed through a curve that links the
                                                                                 performance score of each model over time
                                                                                 (plane x=time, y=score, see fig. 4)
                                                                          d)     Evaluation (graphical) by each model and by
                                                                                 each characteristic over a single scenario

                                                                              It should be noted that:

                                                                          -       the simulation environment of PwC is
6. Evaluation example: Papers                                                     simplified (same code language, single
    with Code [26]                                                                dataset,…) and cannot represent the
                                                                                  production environment but gives immediate
Papers with Code is a resource project hosted by Meta
                                                                                  and homogeneous graphical comparisons
AI Research and based on an increasing number of
                                                                                  among all the coded models solving a certain
research papers that are mostly uploaded with code; this
                                                                                  task, and
huge repository contains also the results of
                                                                          -       the quality analysis is not complete as it is
measurements made running the code of each paper
                                                                                  missing a multidimensional rating (e.g. a
over standard dataset (e.g. ImageNet, ….).
                                                                                  weighted score over multiple characteristics).


                                         Figure 4: Historical benchmark for image classification


19 In NLP applications, there is the general task of text                22 Tasks correspond to algorithm in fig.1

classification, and among them there is the specific task for the        23 Most of scenarios refer to accuracy characteristic and its
machine to detect prompts with toxic text (e.g. biased questions,        measures.
hate speech…)                                                            24 Standard scenarios rely on standard dataset, both in training and
20 For the scope of this paper, we don’t discuss the characteristic to   input prompts. A set of scenarios is in general suitable for a set of
which the measure of table 3 is referred; as hypothesis, it could be     measures (e.g. CivilComments and RealToxicityPrompts dataset are
referred to Functional correctness.                                      suitable for toxicity measurement).
21 TechMethod of fig.1, it is a miscellaneous non homogeneous
first-level description of the technological solution like NN, LLM,….
    Despite those limitations, steps a,b,c,d, are                [5]    A. Vetrò, “Metrics for identifying bias in datasets,”
consistent with ISO approach, in conclusion both the                    SYSTEM, 2021. [5] D. Natale, Extensions of
measures and the evaluation method of PwC can be                        ISO/IEC 25000 quality models to the context of
accounted as compliant to ISO AI quality standards.                     Artificial     Intelligence,     Proceedings       of
                                                                        IWESQ@APSEC                    2022,            URL:
7. Proposal                                                             https://ceurws.org/Vol-3356
                                                                 [6]    A. Trenta: ISO/IEC 25000 and AI Product Quality
The proposal in this paper completes the proposal in                    Measurement           Perspectives      Proceedings
[7][6]; there we showed how to account and represent                    IWESQ@APSEC 2022, URL: https://ceur-
measures from AI practices into the ISO/IEC 25000                       ws.org/Vol-3356
format, here we explore how some existing quality                [7]    A.Trenta “Accounting AI Measures as ISO/IEC
evaluation practices are accountable as evaluation                      25000 Standards Measures”, Proceedings of
according AI ISO standards.                                             IWESQ@APSEC            2023,    URL:     http://ceur-
    Finally, it is highlighted a difference between ‘local’             ws.org/Vol-3612
and ‘global’ benchmarking; the former it is always               [8]    ISO/IEC 22989:2022 Information technology —
possible as it is always possible to measure and to rate                Artificial intelligence Artificial intelligence
on the quality target basis defined by the organization                 concepts         and        terminology.        URL:
that handle the product as well; the latter, intended as a              https://www.iso.org/standard/74296.html
comparison among different products is challenging,              [9]    ISO/IEC 23053:2022 - Framework for Artificial
due to the manifold of environments of the products                     Intelligence (AI) - Systems Using Machine
even they are performing the same task.                                 Learning                   (ML)                 URL:
                                                                        https://www.iso.org/standard/74438.html
8. Conclusion                                                    [10]   ISO/IEC TR 24372 Information technology —
The set of ISO AI standards clearly leverages SQuaRE                    Artificial    intelligence     —     Overview      of
product quality standards, both for measures and for                    computational approaches for AI systems URL:
evaluation. According SQuaRE and ISO/IEC 42001, it is                   https://www.iso.org/standard/78508.html
always possible to define a quality model and a set of           [11]   ISO/IEC TR 24030:2024 Information technology —
measures, even customized, to accomplish any                            Artificial intelligence (AI) Use cases URL:
measurements campaign.                                                  https://www.iso.org/standard/84144.html
     At the same manner, it is always possible to define a       [12]   ISO/IEC DIS 25059 Software engineering Systems
quality rating module and accomplish an AI product                      and software Quality Requirements and
evaluation on behalf of an organization or a third party.               Evaluation (SQuaRE) — Quality Model for AI-
     Benchmarking is a useful method to rate, or to                     based                  systems.                 URL:
assign performance targets, or compare products, but is                 https://www.iso.org/standard/80655.html
in general possible only locally, i.e., it is not possible for   [13]   ISO/IEC 42001 Information technology Artificial
all the products, even if in the same category, for the                 intelligence — Management system. URL:
limitations discussed above.                                            https://www.iso.org/standard/81230.html
     Such considerations are supported by the practices          [14]   ISO/IEC TS 25058 Software engineering Systems
analyzed.                                                               and software Quality Requirements and
                                                                        Evaluation (SquaRE) — Guidance for quality
                                                                        evaluation of artificial intelligence (AI) systems.
References
                                                                        URL: https://www.iso.org/standard/82570.html
[1]   D. Natale, A. Trenta, Examples of practical use of         [15]   ISO/IEC CD TS 17847 -Information Technology —
      ISO/IEC 25000, Proceedings of IWESQ@APSEC                         Artificial Intelligence Verification and validation
      2019. URL: http://ceurws.org/Vol-2545/                            analysis of AI systems
[2]   A. Trenta: ISO/IEC 25000 quality measures for A.I.:        [16]   ISO/IEC 25040 - Systems and software engineering
      a geometrical approach, Proceedings of                            – Systems and software Quality Requirements and
      IWESQ@APSEC          2020.   URL:      http://ceur-               Evaluation      (SQuaRE)       Quality    evaluation
      ws.org/Vol-2800/                                                  framework
[3]   A. Trenta, Data bias measurement: a geometrical            [17]   ISO/IEC AWI TR 42106 Information technology —
      approach through frames, Proceedings of                           Artificial intelligence Overview of differentiated
      IWESQ@APSEC          2021.   URL:      http://ceur-               benchmarking of AI system quality characteristics
      ws.org/Vol-3114/                                           [18]   ISO/IEC DIS 25002 Systems and Software
[4]   A. Simonetta, A. Trenta, M. C. Paoletti, and                      engineering - Systems and software Quality
       Requirements and Evaluation (SQuaRE) Quality
       models overview and usage
[19]   ISO/IEC 5259-2 Artificial intelligence Data quality
       for analytics and ML — Part 2: Data quality
       measures.                                     URL:
       https://www.iso.org/standard/81860.html
[20]   ISO/IEC TR 24029-1:2021 Artificial Intelligence
       (AI) — Assessment of the robustness of neural
       networks — Part 1: Overview
[21]   ISO/IEC 24029-2:2023 Artificial intelligence (AI) —
       Assessment of the robustness of neural networks
       — Part 2: Methodology for the use of formal
       methods
[22]   ISO/IEC TR 12182:2015 - Systems and software
       engineering — Framework for categorization of IT
       systems and software, and guide for applying it,
       URL: https://www.iso.org/standard/63611.html
[23]   P.Liang, R. Bommasani, T. Lee et al., Holistic
       Evaluation of Language Models, Stanford Institute
       for Human-Centered Artificial Intelligence (HAI),
       Stanford University, 2022
[24]   ITU-T F.748.11 Metrics and evaluation methods for
       a deep neural network processor benchmark, 2020
[25]   ITU-T F.748.12 Deep learning software framework
       evaluation methodology, 2021
[26]   Meta Research, Papers with Code resource. URL:
       https://paperswithcode.com/sota