=Paper=
{{Paper
|id=Vol-3612/IWESQ_2023_Paper_04
|storemode=property
|title=Accounting AI Measures as ISO/IEC 25000 Standards Measures
|pdfUrl=https://ceur-ws.org/Vol-3612/IWESQ_2023_Paper_04.pdf
|volume=Vol-3612
|authors=Andrea Trenta
|dblpUrl=https://dblp.org/rec/conf/apsec/Trenta23
}}
==Accounting AI Measures as ISO/IEC 25000 Standards Measures==
Accounting AI Measures as ISO/IEC 25000 Standards Measures Andrea Trenta 1 1 UNINFO UNI TC 533 Technical Committee Artificial Intelligence, Turin, Italy Abstract This paper is a part of a set of papers showing how newly defined data and software quality measures can be described in ISO 25000 format. In the first group of papers [3], [1], [28], [2], we discussed with the help of some examples, the general approach of conformance when new quality measures are defined, and in the last paper [20] how to build practical ISO/IEC 25000 compliant product quality measures for AI, starting from measures developed in several public projects. In this paper we continue to show, through some examples, that standards and research coming from the scientific community on the topic of AI measures can be easily accounted as ISO/IEC 25000 measures. Moreover, the paper can be considered for the works in AI standardization area. Keywords 1 product quality, measures, accuracy, ISO, ISO/IEC 25059, ISO/IEC 5259, ISO/IEC 24029, ITU-T F.748.11, metric, AI, ML, Machine Learning, Artificial Intelligence 1. Introduction ‘method’ is the high-level categorization [7] like decision tree, k-means clustering, neural networks,… Policy makers, industries, and academia are ‘algorithm’ is the type of method2 (es. ResNet facing the problem of building trust in AI; in this for method=NN) ‘library’ contains the code to be invoked for paper we show, following the approach of a evaluation (see machine learning process in [11]) previous paper [20], how some AI measures taken ‘parameters’ are the configuration data of the from non-ISO standards and research literature, algorithm. can be accounted as ISO/IEC 25000 AI product ‘training’ includes dataset (ImageNet, quality measures. MNIST,…) and process (initialization, The items considered for AI product quality retraining,…). measures are recalled in the following “shopping list”. Then, we can define for the i-characteristic and For the following, it is useful to recall the j-measure, the measurement definitions given in [20]. The implementation I is defined as a function 2) Mij=Mij(I) of 1) I= I(method, algorithm(library, and taking into account 1): parameters), training(dataset, process)) 3) Mij=Mij (method, algorithm(library, where: parameters), training(dataset, process)) Proceedings of 5th International Workshop on Experience with SQuaRE series and its Future Direction IWESQ@APSEC, December 4th, 2023, Seoul, Korea EMAIL: andrea.trenta@dataqualitylab.it © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) CEUR ht tp: // ceur -ws .or g Works hop I SSN1613- 0073 Pr oceedi ngs 2 Note: for ‘algorithm’ it is intended the categorization of the code that perform the task, e.g. for the classification task, the ‘algorithm’ can be either a neural network, or a decision tree, or a support vector machine, or other. CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 22 Figure 1 AI quality measure shopping list (UML-like) With those definitions, benchmark Bij is the drafting of technical standards in support of best value Mij for the time being (e.g. for a full industry and of lawful rules. For the scope of this year) for the i-characteristic and the j-measure 3 paper, we consider, among the others, the among all the K implementations of Ik standards based on ISO 25000 series that define or contribute to define product quality for an AI 4) Bij=maxk Mij(Ik) k=1,..,K product [8]. The assessment of product quality, possibly together with the assessment of process Starting from those definitions, we map some quality [9], will be performed in the near future on existing measures to ISO 25000 measures when voluntary or mandatory basis, in the former case 1) holds. In the following, we pick those existing to promote trustworthiness in AI systems, in the measures from latter case to get compliance to rules [10]. In the following, we focus on ML based AI systems [4]. A. ROC curve metric [24] The work of ISO\IEC SC42 in the last years B. Recommendation ITU-T F.748.11 [21], has given birth to a set of standards on AI that are C. Holistic Evaluation of Language Models [19] covering topics such as quality, testing, risk, management system, data, application according and explain how they can be accounted as to the non-official scheme of figure 2. ISO/IEC 25000 measures and make some more It is to be noted that SC42 has developed and consideration about the perspectives of the is developing extensions [5], [23] to standards of ongoing standardization work in the relevant the series ISO/IEC 25000 and this appears at the bodies on the topic of AI product4 evaluation and moment the most mature approach to the AI assessment. product evaluation, as it relies on the core SQuaRE standards developed since 2008. Indeed, 2. AI Standardization (2023 the ISO/IEC 25000 itself foresees the possibility to extend the model to specific technologies like UPDATE) AI, through the definition of new characteristics and new measures. This view and its reasons are Policy makers have addressed the issue of AI also well explained in the ISO/IEC news given in trustworthiness mainly, but not only, to the https://www.iec.ch/blog/new-international- international standardization body ISO/IEC SC42 standard-ensuring-quality-ai-systems. and to the European standardization body At the moment, the ISO 25000 extensions for CEN/CENELEC JTC21 that have in charge the AI are the technical specification for AI product 3 4 in 4) the j-measure is supposed as scalar; if the j-measure is a vector Note: the topic of product measurement is distinct from the topic of or a matrix, the expression 4) should be adapted. the process measurement. 23 Figure 2 Non-official AI ISO standards by topics quality evaluation [23] that is under development, curve (AUC) can be calculated. Higher AUCs and the quality model for AI already published [5], indicate more robust performance, ranging from 0 that is to be read in conjunction with [23]. (worst) to 1 (best). Classifiers that perform no The considerations of this paper are supporting better than chance will have an AUC of 0.5 the current set of ISO standards. AUC is an example of how statistical methods for 3. Example: AUC (Area Under assessing NN can be accounted as quality ISO 25000 measure (Table 1). Receiver Operating Curve) In conclusion, an AI measure like AUC well known in scientific literature and classified A receiver operating characteristic (ROC) according to [24] into the category of statistical curve is a graphical method for displaying true methods, can be represented in an ISO/IEC 25000 positive rates and false positive rates across format5. multiple thresholds from a binary classifier [7]. 4. Example: Rec. ITU-T F.748.11 The Rec. F.748.11 [21], proposes, among the others, metrics for AI applications. The approach is to define benchmarks, as the history of processors evolution that includes both architecture, clock, energy consumption, and more parameters, has shown that it is impossible to make comparisons without a common challenging metric, like e.g. FLOP/s. At the same manner, for each triplet composed by Table 1 Accuracy measure – AUC in ISO 25000 • Application (e.g. Image classification, format Speech recognition,..), To express performance across all thresholds, • Dataset (e.g. Imagenet, LibriSpeech,..), the area under the receiver operating characteristic • ML model (e.g. ResNet, DeepSpeech2,..), 5 For the scope of this paper, we don’t discuss the characteristic to which the measure of table 1 is referred; as hypothesis, it could be referred to Functional correctness. 24 it is defined the benchmark Secondly, the characteristics of the model are • Accuracy identified; new characteristics are introduced with a specific metric for each triplet (e.g. Word (calibration, toxicity) that are not present in Error Rate for Speech recognition implemented models [5], [6] but can be handled as ISO 25000 with DeepSpeech2 and tested against dataset conforming [27]. LibriSpeech). Thirdly, the measures contain the same The ML model is further detailed with neuron description used in [20]: layers, input size and source code, e.g. • ML model detailed (e.g. ResNet_50) Mij=Mij (method, algorithm(library, • ML model source code (e.g. parameters), training(dataset, process)) https://github.com/KaimingHe/deep- residual-networks) With those considerations, it is easy to This corresponds to the use case 1 Accuracy identify the full description of the measures showed in [20] and can be represented as in Table according the ISO/IEC 25000 format. 2 for the triplet Image Classification, Imagenet, For example, we consider the measure of ResNet, implemented with ResNet_50 with detection of toxic text6 and draw the table 3 source code https://github.com/KaimingHe/deep- below. residual-networks Table 3 Toxicity - Detection of toxic text in ISO Table 2 Accuracy measure – Prediction accuracy 25000 format in ISO 25000 format In conclusion, also the measures 7 for LLM So, we can conclude that ITU-T F.748.11 [21] presented in [19] can be mapped to ISO 25000 measures can be accounted as ISO/IEC 25000 quality measure. conforming measures. 6. Measurements in Operation 5. Example: LLM (Large Language Models) As highlighted in [25], when an ML implemented with neural networks uses continuous learning, its hyperparameters are In this clause we try to show how the measures evolving, and the measurement of characteristics performed in the research [19] can be accounted of the NN can be different (and assessed worse or as ISO/IEC 25000 measures. To do this, we note better) from the measurement taken in the initial that in [60], the following criteria are applied, that state. This is also the reason why the AI medical are the same criteria used for defining ISO 25000 devices are deployed and sold as “frozen” [26], compliant measures [20]. giving a guarantee to the user-buyer that the Firstly, [19] taxonomizes the LLM behaviour of the ML will be the same all the time. applications, as proposed in [20]. Anyway, additional requirements (e.g. operational performance not worse than tested ones) and measurement can be satisfied, so 6 7 In NLP applications, there is the general task of text classification, For the scope of this paper, we don’t discuss the characteristic to and among them there is the specific task for the machine to detect which the measure of table 3 is referred; as hypothesis, it could be prompts with toxic text (e.g.. biased questions, hate speech,…) referred to Functional correctness. 25 enlarging the field of evaluation, both along the 8. Proposal time and the post-training data and perform a further assessment of the ML in the operational The proposal in this paper completes the mode. proposal in [20]; there we showed how to design and document a product quality measure that 7. Formal Methods includes algorithm, training dataset, library code and parameters; here we show in a sort of reverse It is to be noted the awareness of the scientific engineering, how to account and represent community for the need of an a-priori guarantee measures from standard and scientific literature of the robustness of NN: many papers ([12], [13], into the ISO/IEC 25000 format. [14], [15], [16], [17], [18]) contain the word Finally, some investigation areas (formal “certification” or “formal guarantees” or methods and operational measures) and relevant “verification” or “provably”, as they research the considerations are presented in the perspective of proof of a target performance. an even wider application of the present and [20] To understand how the topic is presently paper proposals. addressed and to complete the landscape of AI measures, we recall the approach represented by 9. Conclusion formal methods. According to the classification given in [24], The role of ISO/IEC 25000 in measurement formal methods can successfully answer the and assessing of AI product quality is widely question whether or not, for a given Neural recognized and of growing interest. Network, input and output (e.g. input image of A further confirmation comes from the airplane and output label “airplane”), a similarity between measurement methods modification of the input leads to the same output developed in scientific literature and projects and or a different one (e.g. input image of airplane the ISO 25000 conforming measurement method with noise, output label “helicopter”). This as shown in [20] and in this paper. question can be formulated as a formal The paper analyzed this similarity and came to mathematical statement that is either true or false the conclusion that most of the measurement for a given neural network and image. methods used for AI can be easily mapped into Based on the research that have proven that ISO 25000 format. for Neural Network using the linear piece-wise activation function (ReLU), it is possible to measure robustness in terms of lower bound of 10. References minimal adversarial distortion for given input data points [13]. The results for ReLU, were [1] A. Trenta, Data bias measurement: a successively extended to NNs with common geometrical approach through frames, activation functions like sigmoid [13]. The results Proceedings of IWESQ@APSEC 2021. of various research on this topic are summarized URL: http://ceur-ws.org/Vol-3114/ in §6.2 [25] that supports formal methods as [2] A. Trenta: ISO/IEC 25000 quality measures engineering or quality evaluation of some NNs for A.I.: a geometrical approach, and characteristics. Proceedings of IWESQ@APSEC 2020. Even if formal methods are at the moment URL: http://ceur-ws.org/Vol-2800/ considered [25] a quality approach [3] D. Natale, A. Trenta, Examples of practical complementary to ISO/IEC 25000, we could use of ISO/IEC 25000, Proceedings of consider the math function that defines distortion IWESQ@APSEC 2019. URL: http://ceur- bounds8 as a SQuaRE measurement function and ws.org/Vol-2545/ then account any formal method as an ISO/IEC [4] International Organization for 25000 measure. Standardization, ISO/IEC 22989:2022 Information technology — Artificial 8 𝑓#$ (x) ≤ 𝑓# (x) ≤ 𝑓#% (x) Formal methods are based on the theorem that, in certain conditions, there exist Upper and Lower bounds for an m-layer neural network function f with 𝑛! neurons so that for Ɐj ∈ [𝑛! ] and Where x is the perturbed input vector, centered in the reference Ɐx ∈ ( |𝑥 − 𝑥" | ≤ 𝜀) holds: data point x0 and bounded in a sphere with ray ε. 26 intelligence —Artificial intelligence [14] A. Sinha, H. Namkoong, and J. Duchi, concepts and terminology. URL: “Certifiable distributional robustness with https://www.iso.org/standard/74296.html principled adversarial training,” ICLR, 2018 [5] International Organization for [15] A. Raghunathan, J. Steinhardt, and P. Liang, Standardization, ISO/IEC DIS 25059 “Certified defenses against adversarial Software engineering — Systems and examples,”ICLR, 2018. software Quality Requirements and [16] T.-W. Weng, H. Zhang, H. Chen, Z. Song, Evaluation (SQuaRE) — Quality Model for C.-J. Hsieh, D. Boning, I. S. Dhillon, and L. AI-based systems. URL: Daniel, “Towards fast computation of https://www.iso.org/standard/80655.html certified robustness for relu networks,” [6] International Organization for ICML, 2018. Standardization, ISO/IEC CD 5259-2 (under [17] T. Gehr, M. Mirman, D. Drachsler-Cohen, P. development) Artificial intelligence — Data Tsankov, S. Chaudhuri, and M. Vechev, quality for analytics and ML — Part 2: Data “Ai2: Safety and robustness certification of quality measures. URL: neural networks with abstract interpretation,” https://www.iso.org/standard/81860.html in IEEE Symposium on Security and Privacy [7] International Organization for (SP), vol. 00, 2018, pp. 948–963. Standardization, ISO/IEC 23053:2022 [18] Mirman M., Gehr T., Vechev M. Framework for Artificial Intelligence (AI) Differentiable Abstract Interpretation for Systems Using Machine Learning (ML). 1020 Provably Robust Neural Networks. URL: Proceedings of the 35th International https://www.iso.org/standard/74438.html Conference on Machine Learning. 2018, 80, [8] D. Natale, Extensions of ISO/IEC 25000 3575–3583 quality models to the context of Artificial [19] P.Liang, R. Bommasani, T. Lee et al., Intelligence, Proceedings of Holistic Evaluation of Language Models, IWESQ@APSEC 2022. To appear. Stanford Institute for Human-Centered [9] International Organization for Artificial Intelligence (HAI), Stanford Standardization, ISO/IEC 42001 (draft) University, 2022 Information technology — Artificial [20] A. Trenta: ISO/IEC 25000 and AI Product intelligence — Management system. URL: Quality Measurement Perspectives https://www.iso.org/standard/81230.html Proceedings APSEC IWESQ 2022 [10] European Commission, COM/2021/206 (https://ceur-ws.org/Vol-3356/, SSN 1613- ‘Proposal for a regulation of the european 0073,https://dblp.org/db/conf/apsec/iwesq20 parliament and of the council laying down 22.html#GirirajHH22) harmonised rules on artificial intelligence [21] ITU-T F.748.11 Metrics and evaluation (artificial intelligence act) and amending methods for a deep neural network processor certain union legislative acts’, 2021. URL: benchmark, 2020 https://eur-lex.europa.eu/legal- [22] ITU-T F.748.12 Deep learning software content/EN/TXT/?uri=CELEX:52021PC020 framework evaluation methodology, 2021 6 [23] International Organization for [11] International Organization for Standardization, ISO/IEC DTS 25058 Standardization, ISO/IEC 23053:2022 Software engineering — Systems and Framework for Artificial Intelligence (AI) software Quality Requirements and URL: Evaluation (SquaRE) — Guidance for https://www.iso.org/standard/74438.html quality evaluation of artificial intelligence [12] M. Hein and M. Andriushchenko, “Formal (AI) systems guarantees on the robustness of a classifier [24] International Organization for against adversarial manipulation,” in NIPS, Standardization, ISO/IEC TR 24029-1:2021 2017. Artificial Intelligence (AI) — Assessment of [13] Zhang H., Weng T.-W., Chen P.-Y., Hsieh the robustness of neural networks — Part 1: C.-J., Daniel L. Efficient Neural Network Overview Robustness Certification with General [25] International Organization for Activation Functions. Neural Information Standardization, ISO/IEC 24029-2:2023 Processing Systems Conference. 2018, 31, Artificial intelligence (AI) — Assessment of 4944–4953 27 the robustness of neural networks — Part 2: Methodology for the use of formal methods [26] M. van Hartskamp, S. Consoli er al., Artificial Intelligence in Clinical Health Care Applications: Viewpoint, Interactive journal of medical research, 2019 [27] International Organization for Standardization, ISO/IEC DIS 25002 Systems and Software engineering - Systems and software Quality Requirements and Evaluation (SQuaRE) - Quality models overview and usage [28] A. Simonetta, A. Trenta, M. C. Paoletti, and A. Vetrò, “Metrics for identifying bias in datasets,” SYSTEM, 2021. 28