The SQuaRE Series as a Guarantee of Ethics in the Results
                                of AI systems
                                Alessandro Simonetta1,2,∗,†, Maria Cristina Paoletti3,† and Tsuyoshi Nakajima4,†
                                1
                                 Department of Enterprise Engineering, University of Rome Tor Vergata, Rome, Italy
                                2
                                 Italian Space Agency, via del Politecnico snc, Rome, Italy
                                3
                                 Professional Association of Italian Actuaries, Rome, Italy
                                4
                                  Department of Computer Science and Engineering Shibaura Institute of Technology, Tokyo, Japan


                                                                             Abstract
                                                                             AI is an enabling technology that can be utilized in various fields with impressive results. However, in its adoption, there are
                                                                             risk factors that can be mitigated through the adoption of quality standards. It’s not by chance that the new ISO/IEC 25059
                                                                             includes a specific quality model for AI systems. The article describes a research approach that proposes a way to prevent
                                                                             the lack of quality in training data from propagating into the deductions of an AI system. This is all based on the concept
                                                                             of completeness from ISO/IEC 25012 and can be referred to ISO/IEC 5259-2 characteristics of diversity, representativeness,
                                                                             similarity for input dataset evaluation and to ISO/IEC 25059 functional correctness for output results evaluation.

                                                                             Keywords
                                                                             Fairness, Machine Learning, Completeness, ISO/IEC 25012, Maximum Completeness, Bias, Classification


                                1. Introduction                                                                                                                            • ISO 31000:2018 Risk management — Guidelines
                                                                                                                                                                             [2]
                                The vast availability of data and tools has allowed the con-                                                                               • ISO/IEC 25000:2014 Systems and software en-
                                struction of predictive and classification models that form                                                                                  gineering — Systems and software Quality Re-
                                the foundation of Automated Decision-Making (ADM)                                                                                            quirements and Evaluation (SQuaRE) — Guide to
                                systems. Many business decisions rely on recommenda-                                                                                         SQuaRE [3]
                                tions generated by software systems, and in some cases,                                                                                    • ISO/IEC 27002:2022 Information security, cyber-
                                these decisions are entirely automated. The notion that                                                                                      security and privacy protection - Information se-
                                this promotes the concept of decision neutrality due to                                                                                      curity controls [4]
                                being algorithm-based is quite prevalent. However, since
                                                                                                                                                                           • ISO/IEC DIS 5259-2 Artificial Intelligence — Data
                                the decision-making path of an AI system is heavily in-
                                                                                                                                                                             Quality for Analysis and Machine Learning (ML)
                                fluenced by the data used during the learning phase, bi-
                                                                                                                                                                             - Part 2: Data Quality Measures [5].
                                ases present in the data can sometimes transfer into the
                                choices proposed by the system. In the literature, it has                                                                             Specifically, ISO 31000 includes risk management prin-
                                been demonstrated that the use of AI systems trained on                                                                               ciples that allow for the assessment of both the risk of
                                biased datasets can lead to situations of discrimination                                                                              using incomplete data during the learning phase and the
                                [1]. The risk of skewed outcomes primarily stemming                                                                                   risk associated with unfair predictions [1]. Other kinds
                                from imbalanced datasets has also been studied, and it                                                                                of risks, such as the ability of protect data from infor-
                                can be mitigated by the introduction of synthetic data [2].                                                                           mation leakage, are for further study. ISO/IEC 27002
                                Learning algorithms construct the model based on the                                                                                  offers two possible new approaches for proactive secu-
                                training data, so such disproportion can lead to conclu-                                                                              rity, threat detection and machine learning/artificial in-
                                sions that deviate from reality [3,4]. On the other hand, in                                                                          telligence systems. Initially, the ISO/IEC 25010 software
                                some situations, it is challenging to obtain homogeneous,                                                                             quality model [6] did not encompass quality character-
                                proportional, and, most importantly, representative data.                                                                             istics of AI systems. However, starting from 2023, the
                                In these cases, the ISO standards that can help us are [1]:                                                                           SQuaRE series is enriched with the quality model for AI
                                                                                                                                                                      systems: ISO/IEC 25059 standard. Table 1 presents the
                                IWESQ 2023                                                                                                                            new sub-characteristics identified by the working group
                                ∗
                                     Corresponding author.                                                                                                            and their scope in relation to the original standard [7].
                                †
                                    These authors contributed equally.
                                Envelope-Open alessandro.simonetta@gmail.com (A. Simonetta);
                                mariacristina.paoletti@gmail.com (M. C. Paoletti)
                                Orcid 0000-0003-2002-9815 (A. Simonetta); 0000-0001-6850-1184
                                (M. C. Paoletti); 0000-0002-9721-4763 (T. Nakajima)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                                                       Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                                                                                 17
Alessandro Simonetta et al. CEUR Workshop Proceedings                                                                  1–5


Table 1
                               ISO/IEC 25010:2011                             ISO/IEC 25059:2023*
                                                                             AI sub-characteristics
               4.2 characteristics of the software product model
               Functional suitability                                   correctness            adaptability
               Usability                                                controllability        transparency
               Reliability                                              robustness
               Security                                                 intervenability

               4.1 characteristics of the quality in use
               Satisfaction                                             transparency           transparency
               Absence and mitigation of risks                          ethical/social risk
               * in the process of being published


2. Fairness Evaluation in ML
                                                                                                 𝑇𝑃
   Outputs                                                                       𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
                                                                                               𝑇𝑃 + 𝐹𝑃
                                                                                                                        (6)

In the context of machine learning, evaluating fairness in                                  𝑇𝑃
                                                                                   𝑅𝑒𝑐𝑎𝑙𝑙 =                     (7)
machine learning models is a very sensitive and impor-                                   𝑇𝑃 + 𝐹𝑁
tant issue. The goal is to ensure that models yield results     Accuracy is a measure of functional correctness ac-
that are independent of group membership and do not           cording to ISO IEC 25059 and ISO IEC TS 4213.
perpetuate or, in some cases, even exacerbate existing
societal inequalities.
   There are two different approaches: measuring the          3. Statistical Evaluation Methods
intensity of output errors or measuring the overall direc-       on Output
tion of errors. The first approach focuses on assessing
disparate or unfair errors among different categories,        In a classification or decision scenario, statistical criteria
ethnicities, or groups. The second approach evaluates         allow us to evaluate discrimination in terms of statisti-
whether the model tends to make errors in a particular        cal expressions involving the random variables A (sen-
direction or towards a specific group, ethnicity, or other    sitive attribute), Y (target variable), and R (the classifier
sensitive attribute. Bias or fairness metrics can be used     or score). Therefore, it is easy to determine whether a
to evaluate this overall direction.                           criterion is satisfied or not by calculating the joint dis-
   In the case of classification algorithms, the confusion    tribution of these random variables. Starting from the
matrix 𝑃 allows for the calculation of the number of true     definition of independence introduced in [8], for there
positives (TP), true negatives (TN), false positives (FP),    to be independence between two values of the sensitive
and false negatives (FN)                                      attribute, we need to verify that the joint probability has
                                                              the same values in both cases 𝑎𝑖 and 𝑎𝑗 :
                     𝑝11      ...     𝑝1𝑛
                  𝑃=[ ⋮        ⋮       ⋮ ]              (1)
                                                                         𝑃(𝑅 = 1|𝐴 = 𝑎𝑖 ) = 𝑃(𝑅 = 1|𝐴 = 𝑎𝑗 )            (8)
                     𝑝𝑛1      ...     𝑝𝑛𝑛
                                                                 According to this hypothesis, the ideal case of perfect
                       𝑇 𝑃(𝑖) = 𝑝𝑖𝑖                     (2)   fairness occurs when the probabilities have the same
                                                              value. As a consequence of this consideration, a measure
                               𝑛                              of non-independence is obtained by calculating the dis-
                    𝐹 𝑃(𝑖) = ∑ 𝑝𝑖𝑘                      (3)   tance between the two values, which is zero in the ideal
                            𝑘=1,𝑘≠𝑖                           case of complete independence:
                               𝑛
                   𝑇 𝑁 (𝑖) = ∑ 𝑝𝑘𝑘                      (4)
                            𝑘=1,𝑘≠𝑖                            𝔘(𝑎𝑖 , 𝑎𝑗 ) = |𝑃(𝑅 = 1|𝐴 = 𝑎𝑖 ) − 𝑃(𝑅 = 1|𝐴 = 𝑎𝑗 )| (9)
The concepts of precision, recall, and accuracy are well-
known in the literature and are presented below for the     Table 2 shows the calculation of joint probabilities in
sake of completeness in the discussion:                   the case of the well-known Compas dataset [9], in which
                                                          the ML system incorrectly predicted a higher degree of
                              𝑇𝑃 + 𝑇𝑁
           𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =                                (5) recidivism among African-American detainees.
                        𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁


                                                           18
Alessandro Simonetta et al. CEUR Workshop Proceedings                                                                        1–5


Table 2                                                          4. Mutual Information
Probability for Sensitive Attribute Race
                                                                 The concept of mutual information allows for the mea-
     𝐴 = 𝑎𝑖                𝑃(𝑅 = 1|𝐴 = 𝑎𝑖 )    Centroid          surement of relationships between the joint probabilities
     Caucasian                   0.33                            mentioned in (9). Indeed, it can measure the mutual
     Hispanic                    0.28            0.26            information between A and R, which is the amount of
     Other                       0.20                            information one random variable reveals about the other.
     Asian                       0.23                            Therefore, the condition of independence between the
     African-American            0.58                            random variables A and R, as indicated in 9, can be ex-
                                                 0.65
     Native-American             0.73                            pressed in terms of mutual information:

                                                                      𝐼 (𝐴, 𝑅) = 𝐻 (𝐴) + 𝐻 (𝑅) − 𝐻 (𝐴, 𝑅)  (11)
   In the case of the Compas dataset, the joint probabil-
ities cluster around two centroids, which supports the where H(R) and H(A) are the entropies associated with R
reasoning that it would be more reasonable to select these and A, respectively:
two points as representative of the two treatment groups.                             𝑛
In fact, if the probability values cluster into subsets of                 𝐻 (𝑅) = ∑ 𝑃(𝑟𝑖 )𝑙𝑜𝑔(𝑃(𝑟𝑖 ))     (12)
values, it signifies fair independence within the group                             𝑖=1
and, conversely, inequity between groups. If the distri-                             𝑛
bution of probability values is nearly uniform and it is                  𝐻 (𝐴) = ∑ 𝑃(𝑎𝑖 )𝑙𝑜𝑔(𝑃(𝑎𝑖 ))      (13)
not possible to identify distinct groups, or if the num-                            𝑖=1
ber of groups is greater than two, you can calculate the Instead, the third term in equation 12 is:
independence measure through the average of distances:
                                                                                        𝑛,𝑚
                                   𝑚−1     𝑚                              𝐻 (𝑅, 𝐴) = ∑ 𝑃(𝑟𝑖 ∩ 𝑎𝑗 )𝑙𝑜𝑔(𝑃(𝑟𝑖 ∩ 𝑎𝑗 ))           (14)
                             2
        𝔘(𝑎1 , .., 𝑎𝑚 ) =          ∑ ∑ 𝔘(𝑎𝑖 , 𝑎𝑗 )        (10)                         𝑖=1,𝑗=1
                          𝑚(𝑚 − 1) 𝑖=1 𝑗=𝑖+1
In the literature, there are various clustering algorithms,      The other indices can also be expressed by mutual in-
with k-means and DBSCAN being used in [10]. A differ-            formation and in particular referring to [11] and [10]
ent approach to measuring fairness corresponds to the            Separation is calculated by:
maximum disproportion in the values of joint probabili-
                                                                  𝐼 (𝑅, 𝐴|𝑌 ) = 𝐻 (𝑅, 𝑌 ) + 𝐻 (𝐴, 𝑌 ) − 𝐻 (𝑅, 𝑌 , 𝐴) − 𝐻 (𝑌 ) (15)
ties (range or variability interval). Instead of measuring
the distances between probabilities belonging to groups,         sufficiency is expressed by the following equation:
we can calculate the difference between the maximum
and minimum values (MaxMin algorithm). What has                   𝐼 (𝑌 , 𝐴|𝑅) = 𝐻 (𝑌 , 𝑅) + 𝐻 (𝐴, 𝑅) − 𝐻 (𝑌 , 𝑅, 𝐴) − 𝐻 (𝑅) (16)
been discussed so far is applicable, without loss of gen-
erality, to other fairness measures such as separation,          finally, the Overall Accuracy Equality (17) is computed
sufficiency, and overall accuracy equality. In all of these      by:
cases, the researcher is interested in identifying the pres-
ence of unfairness in a sensitive attribute A and assessing         𝐻 (𝐴, 𝑅|𝑌 = 𝑅) = 𝐻 (𝐴, 𝑌 = 𝑅)+
its magnitude based on a value within the range {0, 1}.                          + 𝐻 (𝑅, 𝑌 = 𝑅) − 𝐻 (𝑅 = 𝑌 , 𝐴|𝑅 = 𝑌 )       (17)
However, if you calculate a fairness measure for each sen-
sitive attribute A, you may discover that different treat-
ment groups exist in relation to different indices. Since        5. Data Quality Measures for Input
the original problem is to understand whether there are
treatment differences in the values of sensitive attributes,     The underlying idea of this research is to find a way to
rather than calculating a measure for each individual at-        anticipate disparities in the final outcomes of an AI sys-
tribute, we can compute fairness measures for each value         tem by evaluating the learning training sets from the
of the sensitive attribute. This way, we can construct a         perspective of data quality (ISO IEC 25012). In particular,
fairness vector with components being the fairness in-           it has been observed how concepts of completeness, het-
dices and examine the relationships between different            erogeneity (Gini index), diversity (Shannon or Simpson
vectors. In [9], a method was used to match treatment            index) or imbalance (imbalance ratio) can be used as pre-
groups based on the Pearson correlation index.                   dictive markers to highlight the risk that a data defect
                                                                 may propagate within the learning system.
                                                                    Initially, [12] to analyze data quality issues in the learn-
                                                                 ing data, Gini indices, imbalance ratios, Shannon, and


                                                             19
Alessandro Simonetta et al. CEUR Workshop Proceedings                                                           1–5


Simpson indices were used. For fairness measures, in-        References
dependence and separation measures - consisting of the
components True Positive Rate (TPR) and False Positive        [1] A. Simonetta, A. Vetrò, M. C. Paoletti, M. Torchi-
Rate (FPR) - were considered using the average of dis-            ano, Integrating square data quality model with iso
tances between probabilities as a criterion for synthetiz-        31000 risk management to measure and mitigate
ing values (11).                                                  software bias, CEUR Workshop Proceedings 3114
   The research revealed that the Gini index has good pre-        (2021) pp. 17–22.
dictive capability for low values of the TPR component        [2] International      organization     for    standard-
of Separation. The imbalance ratio indicator has good             ization, ”iso 31000:2018(en) risk manage-
predictive capability for separation but not for indepen-         ment — guidelines”, 2018. URL: https:
dence. The Shannon index showed an acceptable level of            //www.iso.org/iso-31000-risk-management.html.
prediction for the independence measure, excellent for        [3] International Organization for Standardization,
the separation measure, but was completely ineffective            ”ISO/IEC 25000:2014 Systems and software engi-
for the FPR measure of separation. The Simpson index              neering Systems and software Quality Require-
did not appear to be useful as a predictive bias measure.         ments and Evaluation (SQuaRE) Guide to SQuaRE”,
   The results were quite encouraging, so there was an            2014. URL: https://www.iso.org/standard/64764.
attempt to improve the approach by acting on two fronts:          html.
the calculation method of fairness measures and the qual-     [4] International Organization for Standardization,
ity index of the input data to the learning system.               ”ISO/IEC 27002:2022 Information security, cyberse-
   Regarding the calculation method for fairness mea-             curity and privacy protection Information security
sures, the use of a central tendency index could mask             controls”, 2022. URL: https:/https://www.iso.org/
compensated errors, so three different approaches were            standard/75652.html.
attempted: using the maximum disparity between proba-         [5] International Organization for Standardization,
bility values (MinMax method [13]), using the distance            ”ISO/IEC DIS 5259-2 Artificial intelligence - Data
between groups of similar probabilities (k-means and              quality for analytics and machine learning (ML) -
DBSCAN), and using mutual information.                            Part 2: Data quality measures”, Under development.
   As for the quality index selected in ISO IEC 25012, we         URL: https://www.iso.org/standard/81860.html.
chose the characteristic of completeness, particularly the    [6] International Organization for Standardization,
concept of maximum completeness as defined in [10].               ”ISO/IEC 25010 Systems and software engineering
   The study demonstrated that the use of maximum com-            Systems and software Quality Requirements and
pleteness and the MinMax measurement system provided              Evaluation (SQuaRE) - System and software quality
the best predictive capability for fairness indices: inde-        models”, 2011. URL: https://www.iso.org/standard/
pendence, separation, sufficiency, and overall accuracy           81860.html.
equality. Additionally, the use of the MinMax technique       [7] International organization for standardization,
showed better sensitivity compared to mutual informa-             ”iso/iec 25059:2023 software engineering systems
tion and the DBSCAN clustering system, as shown in                and software quality requirements and evaluation
[13].                                                             (square) - quality model for ai systems”, 2023. URL:
                                                                  https://www.iso.org/standard/80655.html.
                                                              [8] S. Barocas, M. Hardt, A. Narayanan, Fairness and
6. Conclusions                                                    machine learning, 2020. URL: https://fairmlbook.
                                                                  org/, chapter: Classification.
In the realm of AI systems, data governance and data          [9] J. Larson, S. Mattu, L. Kirchner, J. Angwin,
quality are extremely important concepts. Since AI algo-          Compas recidivism dataset,             2016. URL:
rithms rely on learning datasets, the quality of input data       https://www.propublica.org/article/
can impact the outcomes. In this article, we have seen            how-we-analyzed-the-compas-recidivism-algorithm.
how completeness can serve as a good predictor of errors [10] A. Simonetta, M. C. Paoletti, A. Venticinque, The
in the outputs of an ML system. In this context, it is clear      use of maximum completeness to estimate bias in ai-
that the definition of guidelines for the application of          based recommendation systems, CEUR Workshop
data governance and data quality in AI systems is crucial.        Proceedings 3360 (2022) pp. 76–84.
Addressing bias in the data of technological systems is a [11] D. Steinberg, A. Reid, S. O’Callaghan, F. Lattimore,
significant challenge in the digital age, as the decisions        L. McCalman, T. S. Caetano, Fast fair regression
made by algorithms can have substantial societal and              via efficient approximations of mutual information,
personal implications, which can be measured according            CoRR abs/2002.06200 (2020). URL: https://arxiv.org/
to international ISO/IEC standards.                               abs/2002.06200.
                                                             [12] A. Vetrò, M. Torchiano, M. Mecati, A data quality


                                                         20
Alessandro Simonetta et al. CEUR Workshop Proceedings        1–5


     approach to the identification of discrimination
     risk in automated decision making systems,
     Government Information Quarterly 38 (2021)
     101619. URL: https://www.sciencedirect.com/
     science/article/pii/S0740624X21000551. doi:https:
     //doi.org/10.1016/j.giq.2021.101619 .
[13] A. Simonetta, T. Nakajima, M. C. Paoletti, A. Ven-
     ticinque, Fairness metrics and maximum complete-
     ness for the prediction of discrimination, CEUR
     Workshop Proceedings 3356 (2022) pp. 13–20.


                                                        21