=Paper=
{{Paper
|id=Vol-3916/paper4
|storemode=property
|title=AI Evaluation and ISO Standards: Is the Gap Closed?
|pdfUrl=https://ceur-ws.org/Vol-3916/paper_04.pdf
|volume=Vol-3916
|authors=Andrea Trenta
|dblpUrl=https://dblp.org/rec/conf/iwesq/Trenta24
}}
==AI Evaluation and ISO Standards: Is the Gap Closed?==
AI Evaluation and ISO Standards: Is the Gap Closed?⋆
Andrea Trenta1
1 UNINFO UNI TC 533 Technical Committee Artificial Intelligence, Turin, Italy
Abstract
This paper is a part of a set of papers showing how newly defined data and software quality
measures can be described in ISO 25000 format. In the first group of papers [1], [2],[3], [4],
we discussed with the help of some examples, the general approach of conformance when
new quality measures are defined, and in the last papers [6], [7] how to build practical
ISO/IEC 25000 compliant product quality measures for AI, starting from measures developed
in several public projects. In this paper we analyze the feasibility of evaluation of an AI
product, according AI ISO/IEC standards, through examples from existing practices.
Moreover, this paper can be considered for the works in AI standardization area.
Keywords
product quality, measures, accuracy, metric, machine learning, artificial intelligence1
1. Introduction existing practices are coherent to SQuaRE and AI ISO
standards and possible remaining issues.
Policy makers, industries, and academia are facing the Figure 1 gives an overview of entities that contribute
problem of building trust in AI, and trustworthyness, in to AI product assessment.
turn, requires [15] that the AI product is measured,
evaluated, and finally assessed. It is to be noted that an 2. AI standardization (2024
organization can accomplish measurements and
evaluation based on existing practices as it is now update)
advised and supported by [13]. Starting from this point, The issue of drafting standards for AI trustworthiness
in the present paper we show, through a sort of reverse was mainly, but not only, assigned to the international
engineering, why measurements and evaluation of
Figure 1: AI evaluation overview (UML-like)
IWESQ’24: International Workshop on Experience with SQuaRE Family © 2024 Copyright for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
and its Future Direction, December 03, 2024, Chongqing, CN
andrea.trenta@dataqualitylab.it (A. Trenta)
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Figure 2: Non-official ISO standards by topic
standardization body ISO/IEC SC42 and to the European [13], will be performed on voluntary or, in the near
standardization body CEN/CENELEC JTC21 that have in future, on mandatory basis, in the former case to
charge the drafting of technical standards in support of promote trustworthiness in AI systems, in the latter case
industry and of lawful rules. to get compliance to rules [14].
SC42 achieved important results in drafting AI
standards, and this success leveraged, among the others, 3. ISO standards for AI evaluation
the foundational SQuaRE and testing standards from
SC7, the implementation standards, the assessment and We focus on the topic of AI product quality evaluation
the management system standards. So, the work of and in the following, we analyze2 a set of ISO standards
ISO\IEC SC42 has given birth to a set of standards on AI and their mutual relationships relevant for
that are covering topics such as definitions [8], software organizations that develop, deploy and use AI, that are:
and data quality [19], testing, risk management,
• ISO/IEC 42001 ‘AI management system’ [13]
assessment, management system, application, according
to the non-official scheme of figure 2. • ISO/IEC CD TS 17847 ‘Verification and validation
It is to be noted that SQuaRE product quality analysis of AI systems’ [15]
standards play a central role in AI measurements and • ISO/IEC TS 25058 ‘Guidance for quality evaluation
evaluation and SC42 has developed extensions [12], [14] of artificial intelligence (AI) systems’ [14]
to standards of the series ISO/IEC 25000. Indeed, the • ISO/IEC 25059 ‘Quality model for AI systems’ [12]
ISO/IEC 25000 series itself foresees the possibility to • ISO/IEC 25040 ‘Quality evaluation framework’ [16]
extend the model to specific technologies like AI, • ISO/IEC AWI TR 421063 ‘Overview of differentiated
through the definition of new characteristics and new benchmarking of AI system quality characteristics’
measures. This role and its reasons are also well [17]
explained in the ISO/IEC news given in
https://www.iec.ch/blog/new-international-standard- Firstly, we recall that [13], gives guidance for the
ensuring-quality-ai-systems. management system mainly for AI product and services,
In this context, the assessment of product quality, as existing management systems for processes are
possibly together with the assessment of process quality applicable also to AI. The full picture in [13] is completed
2 Sources are public pages of: (1) ISO Online browsing platform, 3 AWI and CD means standards ‘under development’
available at https://www.iso.org/obp (2) ISO Online Standards
directory, available at https://www.iso.org/standards.html.
thanks to the reference to [12], as even AI performance (3) Categorization of software is defined as ‘specific
can be addressed through an AI quality model. way to allocate a target system into a category’;
According to [15], Validation & Verification analysis [22] contained also the concept of differentiated
for AI systems 4 is made by: (1) quality evaluation, (2) levels of quality associated to an application
formal methods and (3) simulation. The reason why [15] software; for example, there are different levels of
was introduced in addition to the existing V&V reliability required for a banking software (lower)
standards, is that testing and quality evaluation are than for a meteorological satellite (higher).
considered not exhaustive when also formal methods5 (4) Quality evaluation is composed by the following
and simulation are feasible (e.g. in case of neural steps:
networks), so that they should also be applied jointly to - measurement
traditional quality evaluation for AI systems. By the way, - rating (single dimension)
at the moment [13] doesn’t cite [15] but it could be - rating (multiple dimensions8)
coherently cited, when necessary, in future revisions of (5) Evaluation result output can be of two types:
[13]. - a pass/fail result or
In all the AI standards it is recognized that the - a quality score
reference for AI quality evaluation is [14] that in turn is
based on quality models defined in [9]. A quality score output is advised in industry context
It is to be noted that for the AI set of standards, many and for a user’s perspective, a pass/fail output is
useful concepts can be imported from [16]: (1) quality sufficient for compliance with technical requirements or
rating module, (2) recommended scoring method, (3) other kind of prescriptions, including certification.
categorization of software, (4) quality evaluation, (5) The need for concepts (1), (2) and (3) is very clear
evaluation result; and unavoidable, as AI software is highly differentiated,
in the sense that it there are a lot of algorithms for each
(1) Quality rating module is defined as ‘set of quality task (recognition, classification, prediction, etc.) and in
measures, operational environment, and methods turn each algorithm can behave differently due to
for conducting quality measurements and quality different model, training, hyperparameters, etc.
ratings on a specific category of target entities’ and
takes into account the variability of the As represented in figure 1 and detailed in [6], [7], the
environment and measures through a rich manifold of measures implies a deep categorization of
documentation; template rating contains a the software under measurement, that includes even the
parametric description of rating method and specification of the code. This approach it is also
measurement environment in the perspective they coherent with the fact that code libraries often include
will be reused. It includes rating level setting and their own coded measures9 .
range of acceptance. This need has coherently been taken into account for
(2) Overall score can be a recommended scoring example in [17] where benchmarking is ‘differentiated’
method and consists in a weighted linear due to the difficulty to apply an ‘absolute’ benchmarking.
combination of the individual scores reached from It is to be pointed out that the issue of comparing
measurements of some selected measures of a measurements is relevant for a nonlocal quality
group of (sub)characteristics. In this perspective, an evaluation process (i.e. products comparison) and not for
overall quality score Qs could be 6 a sum of j- measurement process: in fact, it is always possible to
measurements Mij for each of the Wi weighted i- carry out a measurement with any metric1010 ; on the
characteristics selected for the evaluation, and opposite, it is challenging the comparison and
should be comparable with the relevant replicability of the measured values, due to the extreme
benchmarks7 Bij: variability of AI products, even when performing the
& #
𝑀!" same task: e.g. an image classification task can be
𝑄𝑠 = $ 𝑊! ∙ $ performed through thousands combinations of models,
𝐵!"
!$% "$% algorithms, training data, etc.
4 It is intended an AI system that encompasses AI system 7 In the formula, Bij is defined as the best value of Mij (e.g., the best
components and the interaction of non-AI components with the AI value of an historical series of Mij measurements) and it is a
system components normalizing factor.
5 The measurement source for formal methods is a model and not a 8 The multidimensional rating is better known as quality analysis.
delivered machine, on the other end, the measurement source for 9 One example is Scikit-Learn
AI quality rating should be a deliverable source code or delivered https://scikitlearn.org/1.5/modules/model_evaluation.html
machine and not a model. 10 See [1][6][18][13] for the definition of new measures.
6 The formula is taken from [6].
Another widely recognized limitation to 5. Evaluation example: Holistic
benchmarking [21][17], is when an ML implemented
with neural networks uses continuous learning, its Evaluation for Language
hyperparameters are evolving, and the measurement of Models (HELM) [23]
characteristics of the NN can be different (and assessed
In [7] we showed how the measures performed in the
worse or better) from the measurement taken in the
research HELM [23] is, even if unintentionally,
initial state. This is also the reason why some AI devices
compliant to ISO AI quality standards.
are deployed and sold as “frozen” giving a guarantee to
In this paper we show that not only measures but
the user-buyer that the behaviour and performance
also evaluation method of HELM is compliant to ISO AI
value of the ML will be the same all the time.
quality standards. This conclusion is supported by the
Anyway, additional requirements (e.g. operational
fact that HELM uses the same concepts of ISO standards,
performance not worse than tested ones) and
in particular, mapping entities in
measurement can be satisfied, so enlarging the field of
https://crfm.stanford.edu/helm/v0.2.2/ with entities of
evaluation, both along the time and the post-training
figure 1:
data and perform a further assessment of the ML in the
operational mode. a) Taxonomizing of LLM15 by 36 models16
b) Selection of 13 characteristics17 and 57
4. Evaluation example: Rec. ITU-T measures
F.748.11 c) Definition of a quality rating module that
includes 42 standard scenarios18, then
The Rec. F.748.11 [24], proposes, metrics, benchmark measurements rating
and evaluation method for AI processors. Benchmarks d) Evaluation by each model and by each
are calculated both for the testing environment and for characteristic over multiple scenarios, as
the production environment, being the latter relevant shown in fig.3
for quality evaluation.
Despite the different focus, (a processor instead of a
pure software application), the approach of ITUT
F.748.11 is the same of other examples as it is based on:
a) Taxonomizing of Deep Neural Networks11 by
12 models12
b) Selection of 6 characteristics13 and 11 measures
c) Definition of a quality rating module that
includes 8 standard scenarios14 ; benchmarks
calculation; measurements rating.
d) Evaluation by each model and by each
characteristic over single scenario
It should be noted that [24] is the only standard that
details both the training scenario (e.g. ResNet_50,
ImageNet) and the relevant target value, i.e. benchmark,
(e.g. 74.9% Top-1 Accuracy) and this approach seems
advised by [15].
So, we can conclude that both the measures and the
evaluation method of ITU-T F.748.11 [24] can be
accounted as compliant to ISO AI quality standards.
Figure 2: Accuracy over multiple scenarios (win rate)
[20]
11 Deep NN is a Techmethod in fig.1. 15 LLM is a Techmethod in fig.1.
12 Model is an algorithm in fig.1. 16 Model is an algorithm in fig.1.
13 Most of scenarios refer to accuracy characteristic and its 17 Characteristics that are not present (e.g. toxicity) in models [5],
measures. [6], can be still handled as ISO 25000 conforming mechanism [27].
14 Standard scenarios rely on standard dataset, both in training and 18 Standard scenarios rely on standard dataset, both in training and
input prompts. A set of scenarios is in general suitable for a set of input prompts. A set of scenarios is in general suitable for a set of
measures (e.g. CivilComments and RealToxicityPrompts dataset are measures (e.g. CivilComments and RealToxicityPrompts dataset are
suitable for toxicity measurement). suitable for toxicity measurement).
We recall that in [7], we considered the measure of It is easy to check that PwC applies:
detection of toxic text19 19 and defined the table 1 below.
In conclusion, both the measures 20 and the a) Taxonomizing of various technologies21 by
evaluation method of HELM [23] can be accounted as 5189 tasks22
compliant to ISO AI quality standards. b) Selection of characteristics23 and measures
c) Definition of a quality rating module that
includes 1 standard scenario24; benchmarks
Table 1 Toxicity measure (ISO/IEC 25000 format)
calculation; benchmark is intended a set of
performance measurements (es. Top
Accuracy-1) of all the available models against
the same dataset in a limited simulation
environment; measurements rating is
displayed through a curve that links the
performance score of each model over time
(plane x=time, y=score, see fig. 4)
d) Evaluation (graphical) by each model and by
each characteristic over a single scenario
It should be noted that:
- the simulation environment of PwC is
6. Evaluation example: Papers simplified (same code language, single
with Code [26] dataset,…) and cannot represent the
production environment but gives immediate
Papers with Code is a resource project hosted by Meta
and homogeneous graphical comparisons
AI Research and based on an increasing number of
among all the coded models solving a certain
research papers that are mostly uploaded with code; this
task, and
huge repository contains also the results of
- the quality analysis is not complete as it is
measurements made running the code of each paper
missing a multidimensional rating (e.g. a
over standard dataset (e.g. ImageNet, ….).
weighted score over multiple characteristics).
Figure 4: Historical benchmark for image classification
19 In NLP applications, there is the general task of text 22 Tasks correspond to algorithm in fig.1
classification, and among them there is the specific task for the 23 Most of scenarios refer to accuracy characteristic and its
machine to detect prompts with toxic text (e.g. biased questions, measures.
hate speech…) 24 Standard scenarios rely on standard dataset, both in training and
20 For the scope of this paper, we don’t discuss the characteristic to input prompts. A set of scenarios is in general suitable for a set of
which the measure of table 3 is referred; as hypothesis, it could be measures (e.g. CivilComments and RealToxicityPrompts dataset are
referred to Functional correctness. suitable for toxicity measurement).
21 TechMethod of fig.1, it is a miscellaneous non homogeneous
first-level description of the technological solution like NN, LLM,….
Despite those limitations, steps a,b,c,d, are [5] A. Vetrò, “Metrics for identifying bias in datasets,”
consistent with ISO approach, in conclusion both the SYSTEM, 2021. [5] D. Natale, Extensions of
measures and the evaluation method of PwC can be ISO/IEC 25000 quality models to the context of
accounted as compliant to ISO AI quality standards. Artificial Intelligence, Proceedings of
IWESQ@APSEC 2022, URL:
7. Proposal https://ceurws.org/Vol-3356
[6] A. Trenta: ISO/IEC 25000 and AI Product Quality
The proposal in this paper completes the proposal in Measurement Perspectives Proceedings
[7][6]; there we showed how to account and represent IWESQ@APSEC 2022, URL: https://ceur-
measures from AI practices into the ISO/IEC 25000 ws.org/Vol-3356
format, here we explore how some existing quality [7] A.Trenta “Accounting AI Measures as ISO/IEC
evaluation practices are accountable as evaluation 25000 Standards Measures”, Proceedings of
according AI ISO standards. IWESQ@APSEC 2023, URL: http://ceur-
Finally, it is highlighted a difference between ‘local’ ws.org/Vol-3612
and ‘global’ benchmarking; the former it is always [8] ISO/IEC 22989:2022 Information technology —
possible as it is always possible to measure and to rate Artificial intelligence Artificial intelligence
on the quality target basis defined by the organization concepts and terminology. URL:
that handle the product as well; the latter, intended as a https://www.iso.org/standard/74296.html
comparison among different products is challenging, [9] ISO/IEC 23053:2022 - Framework for Artificial
due to the manifold of environments of the products Intelligence (AI) - Systems Using Machine
even they are performing the same task. Learning (ML) URL:
https://www.iso.org/standard/74438.html
8. Conclusion [10] ISO/IEC TR 24372 Information technology —
The set of ISO AI standards clearly leverages SQuaRE Artificial intelligence — Overview of
product quality standards, both for measures and for computational approaches for AI systems URL:
evaluation. According SQuaRE and ISO/IEC 42001, it is https://www.iso.org/standard/78508.html
always possible to define a quality model and a set of [11] ISO/IEC TR 24030:2024 Information technology —
measures, even customized, to accomplish any Artificial intelligence (AI) Use cases URL:
measurements campaign. https://www.iso.org/standard/84144.html
At the same manner, it is always possible to define a [12] ISO/IEC DIS 25059 Software engineering Systems
quality rating module and accomplish an AI product and software Quality Requirements and
evaluation on behalf of an organization or a third party. Evaluation (SQuaRE) — Quality Model for AI-
Benchmarking is a useful method to rate, or to based systems. URL:
assign performance targets, or compare products, but is https://www.iso.org/standard/80655.html
in general possible only locally, i.e., it is not possible for [13] ISO/IEC 42001 Information technology Artificial
all the products, even if in the same category, for the intelligence — Management system. URL:
limitations discussed above. https://www.iso.org/standard/81230.html
Such considerations are supported by the practices [14] ISO/IEC TS 25058 Software engineering Systems
analyzed. and software Quality Requirements and
Evaluation (SquaRE) — Guidance for quality
evaluation of artificial intelligence (AI) systems.
References
URL: https://www.iso.org/standard/82570.html
[1] D. Natale, A. Trenta, Examples of practical use of [15] ISO/IEC CD TS 17847 -Information Technology —
ISO/IEC 25000, Proceedings of IWESQ@APSEC Artificial Intelligence Verification and validation
2019. URL: http://ceurws.org/Vol-2545/ analysis of AI systems
[2] A. Trenta: ISO/IEC 25000 quality measures for A.I.: [16] ISO/IEC 25040 - Systems and software engineering
a geometrical approach, Proceedings of – Systems and software Quality Requirements and
IWESQ@APSEC 2020. URL: http://ceur- Evaluation (SQuaRE) Quality evaluation
ws.org/Vol-2800/ framework
[3] A. Trenta, Data bias measurement: a geometrical [17] ISO/IEC AWI TR 42106 Information technology —
approach through frames, Proceedings of Artificial intelligence Overview of differentiated
IWESQ@APSEC 2021. URL: http://ceur- benchmarking of AI system quality characteristics
ws.org/Vol-3114/ [18] ISO/IEC DIS 25002 Systems and Software
[4] A. Simonetta, A. Trenta, M. C. Paoletti, and engineering - Systems and software Quality
Requirements and Evaluation (SQuaRE) Quality
models overview and usage
[19] ISO/IEC 5259-2 Artificial intelligence Data quality
for analytics and ML — Part 2: Data quality
measures. URL:
https://www.iso.org/standard/81860.html
[20] ISO/IEC TR 24029-1:2021 Artificial Intelligence
(AI) — Assessment of the robustness of neural
networks — Part 1: Overview
[21] ISO/IEC 24029-2:2023 Artificial intelligence (AI) —
Assessment of the robustness of neural networks
— Part 2: Methodology for the use of formal
methods
[22] ISO/IEC TR 12182:2015 - Systems and software
engineering — Framework for categorization of IT
systems and software, and guide for applying it,
URL: https://www.iso.org/standard/63611.html
[23] P.Liang, R. Bommasani, T. Lee et al., Holistic
Evaluation of Language Models, Stanford Institute
for Human-Centered Artificial Intelligence (HAI),
Stanford University, 2022
[24] ITU-T F.748.11 Metrics and evaluation methods for
a deep neural network processor benchmark, 2020
[25] ITU-T F.748.12 Deep learning software framework
evaluation methodology, 2021
[26] Meta Research, Papers with Code resource. URL:
https://paperswithcode.com/sota