=Paper= {{Paper |id=Vol-3612/IWESQ_2023_Paper_04 |storemode=property |title=Accounting AI Measures as ISO/IEC 25000 Standards Measures |pdfUrl=https://ceur-ws.org/Vol-3612/IWESQ_2023_Paper_04.pdf |volume=Vol-3612 |authors=Andrea Trenta |dblpUrl=https://dblp.org/rec/conf/apsec/Trenta23 }} ==Accounting AI Measures as ISO/IEC 25000 Standards Measures== https://ceur-ws.org/Vol-3612/IWESQ_2023_Paper_04.pdf
                          Accounting AI Measures as ISO/IEC 25000 Standards Measures
                          Andrea Trenta 1
                          1
                                UNINFO UNI TC 533 Technical Committee Artificial Intelligence, Turin, Italy


                                                               Abstract
                                                               This paper is a part of a set of papers showing how newly defined data and software quality
                                                               measures can be described in ISO 25000 format. In the first group of papers [3], [1], [28], [2],
                                                               we discussed with the help of some examples, the general approach of conformance when new
                                                               quality measures are defined, and in the last paper [20] how to build practical ISO/IEC 25000
                                                               compliant product quality measures for AI, starting from measures developed in several public
                                                               projects. In this paper we continue to show, through some examples, that standards and research
                                                               coming from the scientific community on the topic of AI measures can be easily accounted as
                                                               ISO/IEC 25000 measures. Moreover, the paper can be considered for the works in AI
                                                               standardization area.

                                                               Keywords 1
                                                               product quality, measures, accuracy, ISO, ISO/IEC 25059, ISO/IEC 5259, ISO/IEC 24029,
                                                               ITU-T F.748.11, metric, AI, ML, Machine Learning, Artificial Intelligence


                          1. Introduction                                                                                                          ‘method’ is the high-level categorization [7]
                                                                                                                                               like decision tree, k-means clustering, neural
                                                                                                                                               networks,…
                             Policy makers, industries, and academia are
                                                                                                                                                   ‘algorithm’ is the type of method2 (es. ResNet
                          facing the problem of building trust in AI; in this
                                                                                                                                               for method=NN)
                                                                                                                                                   ‘library’ contains the code to be invoked for
                          paper we show, following the approach of a
                                                                                                                                               evaluation (see machine learning process in [11])
                          previous paper [20], how some AI measures taken
                                                                                                                                                   ‘parameters’ are the configuration data of the
                          from non-ISO standards and research literature,
                                                                                                                                               algorithm.
                          can be accounted as ISO/IEC 25000 AI product
                                                                                                                                                   ‘training’ includes dataset (ImageNet,
                          quality measures.
                                                                                                                                               MNIST,…)         and     process    (initialization,
                              The items considered for AI product quality
                                                                                                                                               retraining,…).
                          measures are recalled in the following “shopping
                          list”.
                                                                                                                                               Then, we can define for the i-characteristic and
                              For the following, it is useful to recall
                                                                                                                                               the j-measure, the measurement
                          definitions given in [20].
                              The implementation I is defined as a function
                                                                                                                                                  2)   Mij=Mij(I)
                          of
                              1) I= I(method, algorithm(library,
                                                                                                                                               and taking into account 1):
                                   parameters), training(dataset, process))
                                                                                                                                                  3)   Mij=Mij (method, algorithm(library,
                                       where:
                                                                                                                                                       parameters), training(dataset, process))


                          Proceedings of 5th International Workshop on Experience with
                          SQuaRE series and its Future Direction IWESQ@APSEC,
                          December 4th, 2023, Seoul, Korea
                          EMAIL: andrea.trenta@dataqualitylab.it
                                                           © 2023 Copyright for this paper by its authors. Use permitted under Creative
                                                           Commons License Attribution 4.0 International (CC BY 4.0).
                                                           CEUR Workshop Proceedings (CEUR-WS.org)
                                CEUR
                                            ht
                                             tp:
                                               //
                                                ceur
                                                   -ws
                                                     .or
                                                       g
                                Works
                                    hop     I
                                            SSN1613-
                                                   0073
                                Pr
                                 oceedi
                                      ngs




                          2
                            Note: for ‘algorithm’ it is intended the categorization of the code
                          that perform the task, e.g. for the classification task, the ‘algorithm’
                          can be either a neural network, or a decision tree, or a support
                          vector machine, or other.

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                                                          22
      Figure 1 AI quality measure shopping list (UML-like)
   With those definitions, benchmark Bij is the                                 drafting of technical standards in support of
best value Mij for the time being (e.g. for a full                              industry and of lawful rules. For the scope of this
year) for the i-characteristic and the j-measure 3                              paper, we consider, among the others, the
among all the K implementations of Ik                                           standards based on ISO 25000 series that define
                                                                                or contribute to define product quality for an AI
    4)     Bij=maxk Mij(Ik) k=1,..,K                                            product [8]. The assessment of product quality,
                                                                                possibly together with the assessment of process
   Starting from those definitions, we map some                                 quality [9], will be performed in the near future on
existing measures to ISO 25000 measures when                                    voluntary or mandatory basis, in the former case
1) holds. In the following, we pick those existing                              to promote trustworthiness in AI systems, in the
measures from                                                                   latter case to get compliance to rules [10]. In the
                                                                                following, we focus on ML based AI systems [4].
A. ROC curve metric [24]                                                            The work of ISO\IEC SC42 in the last years
B. Recommendation ITU-T F.748.11 [21],                                          has given birth to a set of standards on AI that are
C. Holistic Evaluation of Language Models [19]                                  covering topics such as quality, testing, risk,
                                                                                management system, data, application according
and explain how they can be accounted as                                        to the non-official scheme of figure 2.
ISO/IEC 25000 measures and make some more                                           It is to be noted that SC42 has developed and
consideration about the perspectives of the                                     is developing extensions [5], [23] to standards of
ongoing standardization work in the relevant                                    the series ISO/IEC 25000 and this appears at the
bodies on the topic of AI product4 evaluation and                               moment the most mature approach to the AI
assessment.                                                                     product evaluation, as it relies on the core
                                                                                SQuaRE standards developed since 2008. Indeed,
2. AI Standardization                                 (2023                     the ISO/IEC 25000 itself foresees the possibility
                                                                                to extend the model to specific technologies like
   UPDATE)                                                                      AI, through the definition of new characteristics
                                                                                and new measures. This view and its reasons are
    Policy makers have addressed the issue of AI                                also well explained in the ISO/IEC news given in
trustworthiness mainly, but not only, to the                                    https://www.iec.ch/blog/new-international-
international standardization body ISO/IEC SC42                                 standard-ensuring-quality-ai-systems.
and to the European standardization body                                            At the moment, the ISO 25000 extensions for
CEN/CENELEC JTC21 that have in charge the                                       AI are the technical specification for AI product
3                                                                               4
 in 4) the j-measure is supposed as scalar; if the j-measure is a vector          Note: the topic of product measurement is distinct from the topic of
or a matrix, the expression 4) should be adapted.                               the process measurement.




                                                                           23
    Figure 2 Non-official AI ISO standards by topics
quality evaluation [23] that is under development,                           curve (AUC) can be calculated. Higher AUCs
and the quality model for AI already published [5],                          indicate more robust performance, ranging from 0
that is to be read in conjunction with [23].                                 (worst) to 1 (best). Classifiers that perform no
   The considerations of this paper are supporting                           better than chance will have an AUC of 0.5
the current set of ISO standards.
                                                                             AUC is an example of how statistical methods for
3. Example: AUC (Area Under                                                  assessing NN can be accounted as quality ISO
                                                                             25000 measure (Table 1).
   Receiver Operating Curve)                                                    In conclusion, an AI measure like AUC well
                                                                             known in scientific literature and classified
   A receiver operating characteristic (ROC)                                 according to [24] into the category of statistical
curve is a graphical method for displaying true                              methods, can be represented in an ISO/IEC 25000
positive rates and false positive rates across                               format5.
multiple thresholds from a binary classifier [7].
                                                                             4. Example: Rec. ITU-T F.748.11

                                                                             The Rec. F.748.11 [21], proposes, among the
                                                                             others, metrics for AI applications.
                                                                             The approach is to define benchmarks, as the
                                                                             history of processors evolution that includes both
                                                                             architecture, clock, energy consumption, and
                                                                             more parameters, has shown that it is impossible
                                                                             to make comparisons without a common
                                                                             challenging metric, like e.g. FLOP/s. At the same
                                                                             manner, for each triplet composed by
Table 1 Accuracy measure – AUC in ISO 25000
                                                                                 • Application (e.g. Image classification,
format                                                                               Speech recognition,..),
   To express performance across all thresholds,
                                                                                 • Dataset (e.g. Imagenet, LibriSpeech,..),
the area under the receiver operating characteristic
                                                                                 • ML model (e.g. ResNet, DeepSpeech2,..),
5
  For the scope of this paper, we don’t discuss the characteristic to
which the measure of table 1 is referred; as hypothesis, it could be
referred to Functional correctness.




                                                                        24
it is defined the benchmark                                                       Secondly, the characteristics of the model are
      • Accuracy                                                               identified; new characteristics are introduced
with a specific metric for each triplet (e.g. Word                             (calibration, toxicity) that are not present in
Error Rate for Speech recognition implemented                                  models [5], [6] but can be handled as ISO 25000
with DeepSpeech2 and tested against dataset                                    conforming [27].
LibriSpeech).                                                                     Thirdly, the measures contain the same
The ML model is further detailed with neuron                                   description used in [20]:
layers, input size and source code, e.g.
• ML model detailed (e.g. ResNet_50)                                              Mij=Mij (method, algorithm(library,
• ML            model     source     code      (e.g.                           parameters), training(dataset, process))
      https://github.com/KaimingHe/deep-
      residual-networks)                                                          With those considerations, it is easy to
This corresponds to the use case 1 Accuracy                                    identify the full description of the measures
showed in [20] and can be represented as in Table                              according the ISO/IEC 25000 format.
2 for the triplet Image Classification, Imagenet,                                 For example, we consider the measure of
ResNet, implemented with ResNet_50 with                                        detection of toxic text6 and draw the table 3
source code https://github.com/KaimingHe/deep-                                 below.
residual-networks




                                                                               Table 3 Toxicity - Detection of toxic text in ISO
Table 2 Accuracy measure – Prediction accuracy                                 25000 format
in ISO 25000 format
                                                                               In conclusion, also the measures 7 for LLM
So, we can conclude that ITU-T F.748.11 [21]                                   presented in [19] can be mapped to ISO 25000
measures can be accounted as ISO/IEC 25000                                     quality measure.
conforming measures.

                                                                               6. Measurements in Operation
5. Example: LLM (Large Language
   Models)                                                                         As highlighted in [25], when an ML
                                                                               implemented with neural networks uses
                                                                               continuous learning, its hyperparameters are
   In this clause we try to show how the measures
                                                                               evolving, and the measurement of characteristics
performed in the research [19] can be accounted                                of the NN can be different (and assessed worse or
as ISO/IEC 25000 measures. To do this, we note                                 better) from the measurement taken in the initial
that in [60], the following criteria are applied, that                         state. This is also the reason why the AI medical
are the same criteria used for defining ISO 25000                              devices are deployed and sold as “frozen” [26],
compliant measures [20].                                                       giving a guarantee to the user-buyer that the
   Firstly, [19] taxonomizes the LLM                                           behaviour of the ML will be the same all the time.
applications, as proposed in [20].                                                 Anyway, additional requirements (e.g.
                                                                               operational performance not worse than tested
                                                                               ones) and measurement can be satisfied, so
6                                                                              7
 In NLP applications, there is the general task of text classification,          For the scope of this paper, we don’t discuss the characteristic to
and among them there is the specific task for the machine to detect            which the measure of table 3 is referred; as hypothesis, it could be
prompts with toxic text (e.g.. biased questions, hate speech,…)                referred to Functional correctness.




                                                                          25
enlarging the field of evaluation, both along the                           8. Proposal
time and the post-training data and perform a
further assessment of the ML in the operational
                                                                               The proposal in this paper completes the
mode.
                                                                            proposal in [20]; there we showed how to design
                                                                            and document a product quality measure that
7. Formal Methods                                                           includes algorithm, training dataset, library code
                                                                            and parameters; here we show in a sort of reverse
    It is to be noted the awareness of the scientific                       engineering, how to account and represent
community for the need of an a-priori guarantee                             measures from standard and scientific literature
of the robustness of NN: many papers ([12], [13],                           into the ISO/IEC 25000 format.
[14], [15], [16], [17], [18]) contain the word                                 Finally, some investigation areas (formal
“certification” or “formal guarantees” or                                   methods and operational measures) and relevant
“verification” or “provably”, as they research the                          considerations are presented in the perspective of
proof of a target performance.                                              an even wider application of the present and [20]
    To understand how the topic is presently                                paper proposals.
addressed and to complete the landscape of AI
measures, we recall the approach represented by                             9. Conclusion
formal methods.
    According to the classification given in [24],
                                                                                The role of ISO/IEC 25000 in measurement
formal methods can successfully answer the
                                                                            and assessing of AI product quality is widely
question whether or not, for a given Neural
                                                                            recognized and of growing interest.
Network, input and output (e.g. input image of
                                                                                A further confirmation comes from the
airplane and output label “airplane”), a
                                                                            similarity between measurement methods
modification of the input leads to the same output
                                                                            developed in scientific literature and projects and
or a different one (e.g. input image of airplane
                                                                            the ISO 25000 conforming measurement method
with noise, output label “helicopter”). This
                                                                            as shown in [20] and in this paper.
question can be formulated as a formal
                                                                                The paper analyzed this similarity and came to
mathematical statement that is either true or false
                                                                            the conclusion that most of the measurement
for a given neural network and image.
                                                                            methods used for AI can be easily mapped into
     Based on the research that have proven that
                                                                            ISO 25000 format.
for Neural Network using the linear piece-wise
activation function (ReLU), it is possible to
measure robustness in terms of lower bound of                               10.        References
minimal adversarial distortion for given input data
points [13]. The results for ReLU, were                                     [1] A. Trenta, Data bias measurement: a
successively extended to NNs with common                                        geometrical approach through frames,
activation functions like sigmoid [13]. The results                             Proceedings of IWESQ@APSEC 2021.
of various research on this topic are summarized                                URL: http://ceur-ws.org/Vol-3114/
in §6.2 [25] that supports formal methods as                                [2] A. Trenta: ISO/IEC 25000 quality measures
engineering or quality evaluation of some NNs                                   for A.I.: a geometrical approach,
and characteristics.                                                            Proceedings of IWESQ@APSEC 2020.
    Even if formal methods are at the moment                                    URL: http://ceur-ws.org/Vol-2800/
considered        [25]     a   quality     approach                         [3] D. Natale, A. Trenta, Examples of practical
complementary to ISO/IEC 25000, we could                                        use of ISO/IEC 25000, Proceedings of
consider the math function that defines distortion                              IWESQ@APSEC 2019. URL: http://ceur-
bounds8 as a SQuaRE measurement function and                                    ws.org/Vol-2545/
then account any formal method as an ISO/IEC                                [4] International       Organization        for
25000 measure.                                                                  Standardization,   ISO/IEC     22989:2022
                                                                                Information technology — Artificial

    8                                                                                    𝑓#$ (x) ≤ 𝑓# (x) ≤ 𝑓#% (x)
       Formal methods are based on the theorem that, in certain
conditions, there exist Upper and Lower bounds for an m-layer
neural network function f with 𝑛! neurons so that for Ɐj ∈ [𝑛! ] and            Where x is the perturbed input vector, centered in the reference
Ɐx ∈ ( |𝑥 − 𝑥" | ≤ 𝜀) holds:                                                data point x0 and bounded in a sphere with ray ε.




                                                                       26
     intelligence       —Artificial    intelligence        [14] A. Sinha, H. Namkoong, and J. Duchi,
     concepts       and      terminology.     URL:              “Certifiable distributional robustness with
     https://www.iso.org/standard/74296.html                    principled adversarial training,” ICLR, 2018
[5] International           Organization        for        [15] A. Raghunathan, J. Steinhardt, and P. Liang,
     Standardization, ISO/IEC DIS 25059                         “Certified defenses against adversarial
     Software engineering — Systems and                         examples,”ICLR, 2018.
     software Quality Requirements and                     [16] T.-W. Weng, H. Zhang, H. Chen, Z. Song,
     Evaluation (SQuaRE) — Quality Model for                    C.-J. Hsieh, D. Boning, I. S. Dhillon, and L.
     AI-based              systems.           URL:              Daniel, “Towards fast computation of
     https://www.iso.org/standard/80655.html                    certified robustness for relu networks,”
[6] International           Organization        for             ICML, 2018.
     Standardization, ISO/IEC CD 5259-2 (under             [17] T. Gehr, M. Mirman, D. Drachsler-Cohen, P.
     development) Artificial intelligence — Data                Tsankov, S. Chaudhuri, and M. Vechev,
     quality for analytics and ML — Part 2: Data                “Ai2: Safety and robustness certification of
     quality             measures.            URL:              neural networks with abstract interpretation,”
     https://www.iso.org/standard/81860.html                    in IEEE Symposium on Security and Privacy
[7] International           Organization        for             (SP), vol. 00, 2018, pp. 948–963.
     Standardization, ISO/IEC          23053:2022          [18] Mirman M., Gehr T., Vechev M.
     Framework for Artificial Intelligence (AI)                 Differentiable Abstract Interpretation for
     Systems Using Machine Learning (ML).                       1020 Provably Robust Neural Networks.
     URL:                                                       Proceedings of the 35th International
     https://www.iso.org/standard/74438.html                    Conference on Machine Learning. 2018, 80,
[8] D. Natale, Extensions of ISO/IEC 25000                      3575–3583
     quality models to the context of Artificial           [19] P.Liang, R. Bommasani, T. Lee et al.,
     Intelligence,           Proceedings         of             Holistic Evaluation of Language Models,
     IWESQ@APSEC 2022. To appear.                               Stanford Institute for Human-Centered
[9] International           Organization        for             Artificial Intelligence (HAI), Stanford
     Standardization, ISO/IEC 42001 (draft)                     University, 2022
     Information technology — Artificial                   [20] A. Trenta: ISO/IEC 25000 and AI Product
     intelligence — Management system. URL:                     Quality        Measurement        Perspectives
     https://www.iso.org/standard/81230.html                    Proceedings       APSEC      IWESQ       2022
[10] European Commission, COM/2021/206                          (https://ceur-ws.org/Vol-3356/, SSN 1613-
     ‘Proposal for a regulation of the european                 0073,https://dblp.org/db/conf/apsec/iwesq20
     parliament and of the council laying down                  22.html#GirirajHH22)
     harmonised rules on artificial intelligence           [21] ITU-T F.748.11 Metrics and evaluation
     (artificial intelligence act) and amending                 methods for a deep neural network processor
     certain union legislative acts’, 2021. URL:                benchmark, 2020
     https://eur-lex.europa.eu/legal-                      [22] ITU-T F.748.12 Deep learning software
     content/EN/TXT/?uri=CELEX:52021PC020                       framework evaluation methodology, 2021
     6                                                     [23] International         Organization         for
[11] International          Organization        for             Standardization, ISO/IEC DTS 25058
     Standardization,      ISO/IEC     23053:2022               Software engineering — Systems and
     Framework for Artificial Intelligence (AI)                 software Quality Requirements and
     URL:                                                       Evaluation (SquaRE) — Guidance for
     https://www.iso.org/standard/74438.html                    quality evaluation of artificial intelligence
[12] M. Hein and M. Andriushchenko, “Formal                     (AI) systems
     guarantees on the robustness of a classifier          [24] International         Organization         for
     against adversarial manipulation,” in NIPS,                Standardization, ISO/IEC TR 24029-1:2021
     2017.                                                      Artificial Intelligence (AI) — Assessment of
[13] Zhang H., Weng T.-W., Chen P.-Y., Hsieh                    the robustness of neural networks — Part 1:
     C.-J., Daniel L. Efficient Neural Network                  Overview
     Robustness Certification with General                 [25] International         Organization         for
     Activation Functions. Neural Information                   Standardization, ISO/IEC 24029-2:2023
     Processing Systems Conference. 2018, 31,                   Artificial intelligence (AI) — Assessment of
     4944–4953



                                                      27
     the robustness of neural networks — Part 2:
     Methodology for the use of formal methods
[26] M. van Hartskamp, S. Consoli er al.,
     Artificial Intelligence in Clinical Health Care
     Applications: Viewpoint, Interactive journal
     of medical research, 2019
[27] International         Organization          for
     Standardization, ISO/IEC DIS 25002
     Systems and Software engineering - Systems
     and software Quality Requirements and
     Evaluation (SQuaRE) - Quality models
     overview and usage
[28] A. Simonetta, A. Trenta, M. C. Paoletti, and
     A. Vetrò, “Metrics for identifying bias in
     datasets,” SYSTEM, 2021.




                                                       28