AI Evaluation and ISO Standards: Is the Gap Closed?⋆ Andrea Trenta1 1 UNINFO UNI TC 533 Technical Committee Artificial Intelligence, Turin, Italy Abstract This paper is a part of a set of papers showing how newly defined data and software quality measures can be described in ISO 25000 format. In the first group of papers [1], [2],[3], [4], we discussed with the help of some examples, the general approach of conformance when new quality measures are defined, and in the last papers [6], [7] how to build practical ISO/IEC 25000 compliant product quality measures for AI, starting from measures developed in several public projects. In this paper we analyze the feasibility of evaluation of an AI product, according AI ISO/IEC standards, through examples from existing practices. Moreover, this paper can be considered for the works in AI standardization area. Keywords product quality, measures, accuracy, metric, machine learning, artificial intelligence1 1. Introduction existing practices are coherent to SQuaRE and AI ISO standards and possible remaining issues. Policy makers, industries, and academia are facing the Figure 1 gives an overview of entities that contribute problem of building trust in AI, and trustworthyness, in to AI product assessment. turn, requires [15] that the AI product is measured, evaluated, and finally assessed. It is to be noted that an 2. AI standardization (2024 organization can accomplish measurements and evaluation based on existing practices as it is now update) advised and supported by [13]. Starting from this point, The issue of drafting standards for AI trustworthiness in the present paper we show, through a sort of reverse was mainly, but not only, assigned to the international engineering, why measurements and evaluation of Figure 1: AI evaluation overview (UML-like) IWESQ’24: International Workshop on Experience with SQuaRE Family © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). and its Future Direction, December 03, 2024, Chongqing, CN andrea.trenta@dataqualitylab.it (A. Trenta) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Figure 2: Non-official ISO standards by topic standardization body ISO/IEC SC42 and to the European [13], will be performed on voluntary or, in the near standardization body CEN/CENELEC JTC21 that have in future, on mandatory basis, in the former case to charge the drafting of technical standards in support of promote trustworthiness in AI systems, in the latter case industry and of lawful rules. to get compliance to rules [14]. SC42 achieved important results in drafting AI standards, and this success leveraged, among the others, 3. ISO standards for AI evaluation the foundational SQuaRE and testing standards from SC7, the implementation standards, the assessment and We focus on the topic of AI product quality evaluation the management system standards. So, the work of and in the following, we analyze2 a set of ISO standards ISO\IEC SC42 has given birth to a set of standards on AI and their mutual relationships relevant for that are covering topics such as definitions [8], software organizations that develop, deploy and use AI, that are: and data quality [19], testing, risk management, • ISO/IEC 42001 ‘AI management system’ [13] assessment, management system, application, according to the non-official scheme of figure 2. • ISO/IEC CD TS 17847 ‘Verification and validation It is to be noted that SQuaRE product quality analysis of AI systems’ [15] standards play a central role in AI measurements and • ISO/IEC TS 25058 ‘Guidance for quality evaluation evaluation and SC42 has developed extensions [12], [14] of artificial intelligence (AI) systems’ [14] to standards of the series ISO/IEC 25000. Indeed, the • ISO/IEC 25059 ‘Quality model for AI systems’ [12] ISO/IEC 25000 series itself foresees the possibility to • ISO/IEC 25040 ‘Quality evaluation framework’ [16] extend the model to specific technologies like AI, • ISO/IEC AWI TR 421063 ‘Overview of differentiated through the definition of new characteristics and new benchmarking of AI system quality characteristics’ measures. This role and its reasons are also well [17] explained in the ISO/IEC news given in https://www.iec.ch/blog/new-international-standard- Firstly, we recall that [13], gives guidance for the ensuring-quality-ai-systems. management system mainly for AI product and services, In this context, the assessment of product quality, as existing management systems for processes are possibly together with the assessment of process quality applicable also to AI. The full picture in [13] is completed 2 Sources are public pages of: (1) ISO Online browsing platform, 3 AWI and CD means standards ‘under development’ available at https://www.iso.org/obp (2) ISO Online Standards directory, available at https://www.iso.org/standards.html. thanks to the reference to [12], as even AI performance (3) Categorization of software is defined as ‘specific can be addressed through an AI quality model. way to allocate a target system into a category’; According to [15], Validation & Verification analysis [22] contained also the concept of differentiated for AI systems 4 is made by: (1) quality evaluation, (2) levels of quality associated to an application formal methods and (3) simulation. The reason why [15] software; for example, there are different levels of was introduced in addition to the existing V&V reliability required for a banking software (lower) standards, is that testing and quality evaluation are than for a meteorological satellite (higher). considered not exhaustive when also formal methods5 (4) Quality evaluation is composed by the following and simulation are feasible (e.g. in case of neural steps: networks), so that they should also be applied jointly to - measurement traditional quality evaluation for AI systems. By the way, - rating (single dimension) at the moment [13] doesn’t cite [15] but it could be - rating (multiple dimensions8) coherently cited, when necessary, in future revisions of (5) Evaluation result output can be of two types: [13]. - a pass/fail result or In all the AI standards it is recognized that the - a quality score reference for AI quality evaluation is [14] that in turn is based on quality models defined in [9]. A quality score output is advised in industry context It is to be noted that for the AI set of standards, many and for a user’s perspective, a pass/fail output is useful concepts can be imported from [16]: (1) quality sufficient for compliance with technical requirements or rating module, (2) recommended scoring method, (3) other kind of prescriptions, including certification. categorization of software, (4) quality evaluation, (5) The need for concepts (1), (2) and (3) is very clear evaluation result; and unavoidable, as AI software is highly differentiated, in the sense that it there are a lot of algorithms for each (1) Quality rating module is defined as ‘set of quality task (recognition, classification, prediction, etc.) and in measures, operational environment, and methods turn each algorithm can behave differently due to for conducting quality measurements and quality different model, training, hyperparameters, etc. ratings on a specific category of target entities’ and takes into account the variability of the As represented in figure 1 and detailed in [6], [7], the environment and measures through a rich manifold of measures implies a deep categorization of documentation; template rating contains a the software under measurement, that includes even the parametric description of rating method and specification of the code. This approach it is also measurement environment in the perspective they coherent with the fact that code libraries often include will be reused. It includes rating level setting and their own coded measures9 . range of acceptance. This need has coherently been taken into account for (2) Overall score can be a recommended scoring example in [17] where benchmarking is ‘differentiated’ method and consists in a weighted linear due to the difficulty to apply an ‘absolute’ benchmarking. combination of the individual scores reached from It is to be pointed out that the issue of comparing measurements of some selected measures of a measurements is relevant for a nonlocal quality group of (sub)characteristics. In this perspective, an evaluation process (i.e. products comparison) and not for overall quality score Qs could be 6 a sum of j- measurement process: in fact, it is always possible to measurements Mij for each of the Wi weighted i- carry out a measurement with any metric1010 ; on the characteristics selected for the evaluation, and opposite, it is challenging the comparison and should be comparable with the relevant replicability of the measured values, due to the extreme benchmarks7 Bij: variability of AI products, even when performing the & # 𝑀!" same task: e.g. an image classification task can be 𝑄𝑠 = $ 𝑊! ∙ $ performed through thousands combinations of models, 𝐵!" !$% "$% algorithms, training data, etc. 4 It is intended an AI system that encompasses AI system 7 In the formula, Bij is defined as the best value of Mij (e.g., the best components and the interaction of non-AI components with the AI value of an historical series of Mij measurements) and it is a system components normalizing factor. 5 The measurement source for formal methods is a model and not a 8 The multidimensional rating is better known as quality analysis. delivered machine, on the other end, the measurement source for 9 One example is Scikit-Learn AI quality rating should be a deliverable source code or delivered https://scikitlearn.org/1.5/modules/model_evaluation.html machine and not a model. 10 See [1][6][18][13] for the definition of new measures. 6 The formula is taken from [6]. Another widely recognized limitation to 5. Evaluation example: Holistic benchmarking [21][17], is when an ML implemented with neural networks uses continuous learning, its Evaluation for Language hyperparameters are evolving, and the measurement of Models (HELM) [23] characteristics of the NN can be different (and assessed In [7] we showed how the measures performed in the worse or better) from the measurement taken in the research HELM [23] is, even if unintentionally, initial state. This is also the reason why some AI devices compliant to ISO AI quality standards. are deployed and sold as “frozen” giving a guarantee to In this paper we show that not only measures but the user-buyer that the behaviour and performance also evaluation method of HELM is compliant to ISO AI value of the ML will be the same all the time. quality standards. This conclusion is supported by the Anyway, additional requirements (e.g. operational fact that HELM uses the same concepts of ISO standards, performance not worse than tested ones) and in particular, mapping entities in measurement can be satisfied, so enlarging the field of https://crfm.stanford.edu/helm/v0.2.2/ with entities of evaluation, both along the time and the post-training figure 1: data and perform a further assessment of the ML in the operational mode. a) Taxonomizing of LLM15 by 36 models16 b) Selection of 13 characteristics17 and 57 4. Evaluation example: Rec. ITU-T measures F.748.11 c) Definition of a quality rating module that includes 42 standard scenarios18, then The Rec. F.748.11 [24], proposes, metrics, benchmark measurements rating and evaluation method for AI processors. Benchmarks d) Evaluation by each model and by each are calculated both for the testing environment and for characteristic over multiple scenarios, as the production environment, being the latter relevant shown in fig.3 for quality evaluation. Despite the different focus, (a processor instead of a pure software application), the approach of ITUT F.748.11 is the same of other examples as it is based on: a) Taxonomizing of Deep Neural Networks11 by 12 models12 b) Selection of 6 characteristics13 and 11 measures c) Definition of a quality rating module that includes 8 standard scenarios14 ; benchmarks calculation; measurements rating. d) Evaluation by each model and by each characteristic over single scenario It should be noted that [24] is the only standard that details both the training scenario (e.g. ResNet_50, ImageNet) and the relevant target value, i.e. benchmark, (e.g. 74.9% Top-1 Accuracy) and this approach seems advised by [15]. So, we can conclude that both the measures and the evaluation method of ITU-T F.748.11 [24] can be accounted as compliant to ISO AI quality standards. Figure 2: Accuracy over multiple scenarios (win rate) [20] 11 Deep NN is a Techmethod in fig.1. 15 LLM is a Techmethod in fig.1. 12 Model is an algorithm in fig.1. 16 Model is an algorithm in fig.1. 13 Most of scenarios refer to accuracy characteristic and its 17 Characteristics that are not present (e.g. toxicity) in models [5], measures. [6], can be still handled as ISO 25000 conforming mechanism [27]. 14 Standard scenarios rely on standard dataset, both in training and 18 Standard scenarios rely on standard dataset, both in training and input prompts. A set of scenarios is in general suitable for a set of input prompts. A set of scenarios is in general suitable for a set of measures (e.g. CivilComments and RealToxicityPrompts dataset are measures (e.g. CivilComments and RealToxicityPrompts dataset are suitable for toxicity measurement). suitable for toxicity measurement). We recall that in [7], we considered the measure of It is easy to check that PwC applies: detection of toxic text19 19 and defined the table 1 below. In conclusion, both the measures 20 and the a) Taxonomizing of various technologies21 by evaluation method of HELM [23] can be accounted as 5189 tasks22 compliant to ISO AI quality standards. b) Selection of characteristics23 and measures c) Definition of a quality rating module that includes 1 standard scenario24; benchmarks Table 1 Toxicity measure (ISO/IEC 25000 format) calculation; benchmark is intended a set of performance measurements (es. Top Accuracy-1) of all the available models against the same dataset in a limited simulation environment; measurements rating is displayed through a curve that links the performance score of each model over time (plane x=time, y=score, see fig. 4) d) Evaluation (graphical) by each model and by each characteristic over a single scenario It should be noted that: - the simulation environment of PwC is 6. Evaluation example: Papers simplified (same code language, single with Code [26] dataset,…) and cannot represent the production environment but gives immediate Papers with Code is a resource project hosted by Meta and homogeneous graphical comparisons AI Research and based on an increasing number of among all the coded models solving a certain research papers that are mostly uploaded with code; this task, and huge repository contains also the results of - the quality analysis is not complete as it is measurements made running the code of each paper missing a multidimensional rating (e.g. a over standard dataset (e.g. ImageNet, ….). weighted score over multiple characteristics). Figure 4: Historical benchmark for image classification 19 In NLP applications, there is the general task of text 22 Tasks correspond to algorithm in fig.1 classification, and among them there is the specific task for the 23 Most of scenarios refer to accuracy characteristic and its machine to detect prompts with toxic text (e.g. biased questions, measures. hate speech…) 24 Standard scenarios rely on standard dataset, both in training and 20 For the scope of this paper, we don’t discuss the characteristic to input prompts. A set of scenarios is in general suitable for a set of which the measure of table 3 is referred; as hypothesis, it could be measures (e.g. CivilComments and RealToxicityPrompts dataset are referred to Functional correctness. suitable for toxicity measurement). 21 TechMethod of fig.1, it is a miscellaneous non homogeneous first-level description of the technological solution like NN, LLM,…. Despite those limitations, steps a,b,c,d, are [5] A. Vetrò, “Metrics for identifying bias in datasets,” consistent with ISO approach, in conclusion both the SYSTEM, 2021. [5] D. Natale, Extensions of measures and the evaluation method of PwC can be ISO/IEC 25000 quality models to the context of accounted as compliant to ISO AI quality standards. Artificial Intelligence, Proceedings of IWESQ@APSEC 2022, URL: 7. Proposal https://ceurws.org/Vol-3356 [6] A. Trenta: ISO/IEC 25000 and AI Product Quality The proposal in this paper completes the proposal in Measurement Perspectives Proceedings [7][6]; there we showed how to account and represent IWESQ@APSEC 2022, URL: https://ceur- measures from AI practices into the ISO/IEC 25000 ws.org/Vol-3356 format, here we explore how some existing quality [7] A.Trenta “Accounting AI Measures as ISO/IEC evaluation practices are accountable as evaluation 25000 Standards Measures”, Proceedings of according AI ISO standards. IWESQ@APSEC 2023, URL: http://ceur- Finally, it is highlighted a difference between ‘local’ ws.org/Vol-3612 and ‘global’ benchmarking; the former it is always [8] ISO/IEC 22989:2022 Information technology — possible as it is always possible to measure and to rate Artificial intelligence Artificial intelligence on the quality target basis defined by the organization concepts and terminology. URL: that handle the product as well; the latter, intended as a https://www.iso.org/standard/74296.html comparison among different products is challenging, [9] ISO/IEC 23053:2022 - Framework for Artificial due to the manifold of environments of the products Intelligence (AI) - Systems Using Machine even they are performing the same task. Learning (ML) URL: https://www.iso.org/standard/74438.html 8. Conclusion [10] ISO/IEC TR 24372 Information technology — The set of ISO AI standards clearly leverages SQuaRE Artificial intelligence — Overview of product quality standards, both for measures and for computational approaches for AI systems URL: evaluation. According SQuaRE and ISO/IEC 42001, it is https://www.iso.org/standard/78508.html always possible to define a quality model and a set of [11] ISO/IEC TR 24030:2024 Information technology — measures, even customized, to accomplish any Artificial intelligence (AI) Use cases URL: measurements campaign. https://www.iso.org/standard/84144.html At the same manner, it is always possible to define a [12] ISO/IEC DIS 25059 Software engineering Systems quality rating module and accomplish an AI product and software Quality Requirements and evaluation on behalf of an organization or a third party. Evaluation (SQuaRE) — Quality Model for AI- Benchmarking is a useful method to rate, or to based systems. URL: assign performance targets, or compare products, but is https://www.iso.org/standard/80655.html in general possible only locally, i.e., it is not possible for [13] ISO/IEC 42001 Information technology Artificial all the products, even if in the same category, for the intelligence — Management system. URL: limitations discussed above. https://www.iso.org/standard/81230.html Such considerations are supported by the practices [14] ISO/IEC TS 25058 Software engineering Systems analyzed. and software Quality Requirements and Evaluation (SquaRE) — Guidance for quality evaluation of artificial intelligence (AI) systems. References URL: https://www.iso.org/standard/82570.html [1] D. Natale, A. Trenta, Examples of practical use of [15] ISO/IEC CD TS 17847 -Information Technology — ISO/IEC 25000, Proceedings of IWESQ@APSEC Artificial Intelligence Verification and validation 2019. URL: http://ceurws.org/Vol-2545/ analysis of AI systems [2] A. Trenta: ISO/IEC 25000 quality measures for A.I.: [16] ISO/IEC 25040 - Systems and software engineering a geometrical approach, Proceedings of – Systems and software Quality Requirements and IWESQ@APSEC 2020. URL: http://ceur- Evaluation (SQuaRE) Quality evaluation ws.org/Vol-2800/ framework [3] A. Trenta, Data bias measurement: a geometrical [17] ISO/IEC AWI TR 42106 Information technology — approach through frames, Proceedings of Artificial intelligence Overview of differentiated IWESQ@APSEC 2021. URL: http://ceur- benchmarking of AI system quality characteristics ws.org/Vol-3114/ [18] ISO/IEC DIS 25002 Systems and Software [4] A. Simonetta, A. Trenta, M. C. Paoletti, and engineering - Systems and software Quality Requirements and Evaluation (SQuaRE) Quality models overview and usage [19] ISO/IEC 5259-2 Artificial intelligence Data quality for analytics and ML — Part 2: Data quality measures. URL: https://www.iso.org/standard/81860.html [20] ISO/IEC TR 24029-1:2021 Artificial Intelligence (AI) — Assessment of the robustness of neural networks — Part 1: Overview [21] ISO/IEC 24029-2:2023 Artificial intelligence (AI) — Assessment of the robustness of neural networks — Part 2: Methodology for the use of formal methods [22] ISO/IEC TR 12182:2015 - Systems and software engineering — Framework for categorization of IT systems and software, and guide for applying it, URL: https://www.iso.org/standard/63611.html [23] P.Liang, R. Bommasani, T. Lee et al., Holistic Evaluation of Language Models, Stanford Institute for Human-Centered Artificial Intelligence (HAI), Stanford University, 2022 [24] ITU-T F.748.11 Metrics and evaluation methods for a deep neural network processor benchmark, 2020 [25] ITU-T F.748.12 Deep learning software framework evaluation methodology, 2021 [26] Meta Research, Papers with Code resource. URL: https://paperswithcode.com/sota