The SQuaRE Series as a Guarantee of Ethics in the Results of AI systems Alessandro Simonetta1,2,∗,†, Maria Cristina Paoletti3,† and Tsuyoshi Nakajima4,† 1 Department of Enterprise Engineering, University of Rome Tor Vergata, Rome, Italy 2 Italian Space Agency, via del Politecnico snc, Rome, Italy 3 Professional Association of Italian Actuaries, Rome, Italy 4 Department of Computer Science and Engineering Shibaura Institute of Technology, Tokyo, Japan Abstract AI is an enabling technology that can be utilized in various fields with impressive results. However, in its adoption, there are risk factors that can be mitigated through the adoption of quality standards. It’s not by chance that the new ISO/IEC 25059 includes a specific quality model for AI systems. The article describes a research approach that proposes a way to prevent the lack of quality in training data from propagating into the deductions of an AI system. This is all based on the concept of completeness from ISO/IEC 25012 and can be referred to ISO/IEC 5259-2 characteristics of diversity, representativeness, similarity for input dataset evaluation and to ISO/IEC 25059 functional correctness for output results evaluation. Keywords Fairness, Machine Learning, Completeness, ISO/IEC 25012, Maximum Completeness, Bias, Classification 1. Introduction • ISO 31000:2018 Risk management — Guidelines [2] The vast availability of data and tools has allowed the con- • ISO/IEC 25000:2014 Systems and software en- struction of predictive and classification models that form gineering — Systems and software Quality Re- the foundation of Automated Decision-Making (ADM) quirements and Evaluation (SQuaRE) — Guide to systems. Many business decisions rely on recommenda- SQuaRE [3] tions generated by software systems, and in some cases, • ISO/IEC 27002:2022 Information security, cyber- these decisions are entirely automated. The notion that security and privacy protection - Information se- this promotes the concept of decision neutrality due to curity controls [4] being algorithm-based is quite prevalent. However, since • ISO/IEC DIS 5259-2 Artificial Intelligence — Data the decision-making path of an AI system is heavily in- Quality for Analysis and Machine Learning (ML) fluenced by the data used during the learning phase, bi- - Part 2: Data Quality Measures [5]. ases present in the data can sometimes transfer into the choices proposed by the system. In the literature, it has Specifically, ISO 31000 includes risk management prin- been demonstrated that the use of AI systems trained on ciples that allow for the assessment of both the risk of biased datasets can lead to situations of discrimination using incomplete data during the learning phase and the [1]. The risk of skewed outcomes primarily stemming risk associated with unfair predictions [1]. Other kinds from imbalanced datasets has also been studied, and it of risks, such as the ability of protect data from infor- can be mitigated by the introduction of synthetic data [2]. mation leakage, are for further study. ISO/IEC 27002 Learning algorithms construct the model based on the offers two possible new approaches for proactive secu- training data, so such disproportion can lead to conclu- rity, threat detection and machine learning/artificial in- sions that deviate from reality [3,4]. On the other hand, in telligence systems. Initially, the ISO/IEC 25010 software some situations, it is challenging to obtain homogeneous, quality model [6] did not encompass quality character- proportional, and, most importantly, representative data. istics of AI systems. However, starting from 2023, the In these cases, the ISO standards that can help us are [1]: SQuaRE series is enriched with the quality model for AI systems: ISO/IEC 25059 standard. Table 1 presents the IWESQ 2023 new sub-characteristics identified by the working group ∗ Corresponding author. and their scope in relation to the original standard [7]. † These authors contributed equally. Envelope-Open alessandro.simonetta@gmail.com (A. Simonetta); mariacristina.paoletti@gmail.com (M. C. Paoletti) Orcid 0000-0003-2002-9815 (A. Simonetta); 0000-0001-6850-1184 (M. C. Paoletti); 0000-0002-9721-4763 (T. Nakajima) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 17 Alessandro Simonetta et al. CEUR Workshop Proceedings 1–5 Table 1 ISO/IEC 25010:2011 ISO/IEC 25059:2023* AI sub-characteristics 4.2 characteristics of the software product model Functional suitability correctness adaptability Usability controllability transparency Reliability robustness Security intervenability 4.1 characteristics of the quality in use Satisfaction transparency transparency Absence and mitigation of risks ethical/social risk * in the process of being published 2. Fairness Evaluation in ML 𝑇𝑃 Outputs 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 + 𝐹𝑃 (6) In the context of machine learning, evaluating fairness in 𝑇𝑃 𝑅𝑒𝑐𝑎𝑙𝑙 = (7) machine learning models is a very sensitive and impor- 𝑇𝑃 + 𝐹𝑁 tant issue. The goal is to ensure that models yield results Accuracy is a measure of functional correctness ac- that are independent of group membership and do not cording to ISO IEC 25059 and ISO IEC TS 4213. perpetuate or, in some cases, even exacerbate existing societal inequalities. There are two different approaches: measuring the 3. Statistical Evaluation Methods intensity of output errors or measuring the overall direc- on Output tion of errors. The first approach focuses on assessing disparate or unfair errors among different categories, In a classification or decision scenario, statistical criteria ethnicities, or groups. The second approach evaluates allow us to evaluate discrimination in terms of statisti- whether the model tends to make errors in a particular cal expressions involving the random variables A (sen- direction or towards a specific group, ethnicity, or other sitive attribute), Y (target variable), and R (the classifier sensitive attribute. Bias or fairness metrics can be used or score). Therefore, it is easy to determine whether a to evaluate this overall direction. criterion is satisfied or not by calculating the joint dis- In the case of classification algorithms, the confusion tribution of these random variables. Starting from the matrix 𝑃 allows for the calculation of the number of true definition of independence introduced in [8], for there positives (TP), true negatives (TN), false positives (FP), to be independence between two values of the sensitive and false negatives (FN) attribute, we need to verify that the joint probability has the same values in both cases 𝑎𝑖 and 𝑎𝑗 : 𝑝11 ... 𝑝1𝑛 𝑃=[ ⋮ ⋮ ⋮ ] (1) 𝑃(𝑅 = 1|𝐴 = 𝑎𝑖 ) = 𝑃(𝑅 = 1|𝐴 = 𝑎𝑗 ) (8) 𝑝𝑛1 ... 𝑝𝑛𝑛 According to this hypothesis, the ideal case of perfect 𝑇 𝑃(𝑖) = 𝑝𝑖𝑖 (2) fairness occurs when the probabilities have the same value. As a consequence of this consideration, a measure 𝑛 of non-independence is obtained by calculating the dis- 𝐹 𝑃(𝑖) = ∑ 𝑝𝑖𝑘 (3) tance between the two values, which is zero in the ideal 𝑘=1,𝑘≠𝑖 case of complete independence: 𝑛 𝑇 𝑁 (𝑖) = ∑ 𝑝𝑘𝑘 (4) 𝑘=1,𝑘≠𝑖 𝔘(𝑎𝑖 , 𝑎𝑗 ) = |𝑃(𝑅 = 1|𝐴 = 𝑎𝑖 ) − 𝑃(𝑅 = 1|𝐴 = 𝑎𝑗 )| (9) The concepts of precision, recall, and accuracy are well- known in the literature and are presented below for the Table 2 shows the calculation of joint probabilities in sake of completeness in the discussion: the case of the well-known Compas dataset [9], in which the ML system incorrectly predicted a higher degree of 𝑇𝑃 + 𝑇𝑁 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (5) recidivism among African-American detainees. 𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁 18 Alessandro Simonetta et al. CEUR Workshop Proceedings 1–5 Table 2 4. Mutual Information Probability for Sensitive Attribute Race The concept of mutual information allows for the mea- 𝐴 = 𝑎𝑖 𝑃(𝑅 = 1|𝐴 = 𝑎𝑖 ) Centroid surement of relationships between the joint probabilities Caucasian 0.33 mentioned in (9). Indeed, it can measure the mutual Hispanic 0.28 0.26 information between A and R, which is the amount of Other 0.20 information one random variable reveals about the other. Asian 0.23 Therefore, the condition of independence between the African-American 0.58 random variables A and R, as indicated in 9, can be ex- 0.65 Native-American 0.73 pressed in terms of mutual information: 𝐼 (𝐴, 𝑅) = 𝐻 (𝐴) + 𝐻 (𝑅) − 𝐻 (𝐴, 𝑅) (11) In the case of the Compas dataset, the joint probabil- ities cluster around two centroids, which supports the where H(R) and H(A) are the entropies associated with R reasoning that it would be more reasonable to select these and A, respectively: two points as representative of the two treatment groups. 𝑛 In fact, if the probability values cluster into subsets of 𝐻 (𝑅) = ∑ 𝑃(𝑟𝑖 )𝑙𝑜𝑔(𝑃(𝑟𝑖 )) (12) values, it signifies fair independence within the group 𝑖=1 and, conversely, inequity between groups. If the distri- 𝑛 bution of probability values is nearly uniform and it is 𝐻 (𝐴) = ∑ 𝑃(𝑎𝑖 )𝑙𝑜𝑔(𝑃(𝑎𝑖 )) (13) not possible to identify distinct groups, or if the num- 𝑖=1 ber of groups is greater than two, you can calculate the Instead, the third term in equation 12 is: independence measure through the average of distances: 𝑛,𝑚 𝑚−1 𝑚 𝐻 (𝑅, 𝐴) = ∑ 𝑃(𝑟𝑖 ∩ 𝑎𝑗 )𝑙𝑜𝑔(𝑃(𝑟𝑖 ∩ 𝑎𝑗 )) (14) 2 𝔘(𝑎1 , .., 𝑎𝑚 ) = ∑ ∑ 𝔘(𝑎𝑖 , 𝑎𝑗 ) (10) 𝑖=1,𝑗=1 𝑚(𝑚 − 1) 𝑖=1 𝑗=𝑖+1 In the literature, there are various clustering algorithms, The other indices can also be expressed by mutual in- with k-means and DBSCAN being used in [10]. A differ- formation and in particular referring to [11] and [10] ent approach to measuring fairness corresponds to the Separation is calculated by: maximum disproportion in the values of joint probabili- 𝐼 (𝑅, 𝐴|𝑌 ) = 𝐻 (𝑅, 𝑌 ) + 𝐻 (𝐴, 𝑌 ) − 𝐻 (𝑅, 𝑌 , 𝐴) − 𝐻 (𝑌 ) (15) ties (range or variability interval). Instead of measuring the distances between probabilities belonging to groups, sufficiency is expressed by the following equation: we can calculate the difference between the maximum and minimum values (MaxMin algorithm). What has 𝐼 (𝑌 , 𝐴|𝑅) = 𝐻 (𝑌 , 𝑅) + 𝐻 (𝐴, 𝑅) − 𝐻 (𝑌 , 𝑅, 𝐴) − 𝐻 (𝑅) (16) been discussed so far is applicable, without loss of gen- erality, to other fairness measures such as separation, finally, the Overall Accuracy Equality (17) is computed sufficiency, and overall accuracy equality. In all of these by: cases, the researcher is interested in identifying the pres- ence of unfairness in a sensitive attribute A and assessing 𝐻 (𝐴, 𝑅|𝑌 = 𝑅) = 𝐻 (𝐴, 𝑌 = 𝑅)+ its magnitude based on a value within the range {0, 1}. + 𝐻 (𝑅, 𝑌 = 𝑅) − 𝐻 (𝑅 = 𝑌 , 𝐴|𝑅 = 𝑌 ) (17) However, if you calculate a fairness measure for each sen- sitive attribute A, you may discover that different treat- ment groups exist in relation to different indices. Since 5. Data Quality Measures for Input the original problem is to understand whether there are treatment differences in the values of sensitive attributes, The underlying idea of this research is to find a way to rather than calculating a measure for each individual at- anticipate disparities in the final outcomes of an AI sys- tribute, we can compute fairness measures for each value tem by evaluating the learning training sets from the of the sensitive attribute. This way, we can construct a perspective of data quality (ISO IEC 25012). In particular, fairness vector with components being the fairness in- it has been observed how concepts of completeness, het- dices and examine the relationships between different erogeneity (Gini index), diversity (Shannon or Simpson vectors. In [9], a method was used to match treatment index) or imbalance (imbalance ratio) can be used as pre- groups based on the Pearson correlation index. dictive markers to highlight the risk that a data defect may propagate within the learning system. Initially, [12] to analyze data quality issues in the learn- ing data, Gini indices, imbalance ratios, Shannon, and 19 Alessandro Simonetta et al. CEUR Workshop Proceedings 1–5 Simpson indices were used. For fairness measures, in- References dependence and separation measures - consisting of the components True Positive Rate (TPR) and False Positive [1] A. Simonetta, A. Vetrò, M. C. Paoletti, M. Torchi- Rate (FPR) - were considered using the average of dis- ano, Integrating square data quality model with iso tances between probabilities as a criterion for synthetiz- 31000 risk management to measure and mitigate ing values (11). software bias, CEUR Workshop Proceedings 3114 The research revealed that the Gini index has good pre- (2021) pp. 17–22. dictive capability for low values of the TPR component [2] International organization for standard- of Separation. The imbalance ratio indicator has good ization, ”iso 31000:2018(en) risk manage- predictive capability for separation but not for indepen- ment — guidelines”, 2018. URL: https: dence. The Shannon index showed an acceptable level of //www.iso.org/iso-31000-risk-management.html. prediction for the independence measure, excellent for [3] International Organization for Standardization, the separation measure, but was completely ineffective ”ISO/IEC 25000:2014 Systems and software engi- for the FPR measure of separation. The Simpson index neering Systems and software Quality Require- did not appear to be useful as a predictive bias measure. ments and Evaluation (SQuaRE) Guide to SQuaRE”, The results were quite encouraging, so there was an 2014. URL: https://www.iso.org/standard/64764. attempt to improve the approach by acting on two fronts: html. the calculation method of fairness measures and the qual- [4] International Organization for Standardization, ity index of the input data to the learning system. ”ISO/IEC 27002:2022 Information security, cyberse- Regarding the calculation method for fairness mea- curity and privacy protection Information security sures, the use of a central tendency index could mask controls”, 2022. URL: https:/https://www.iso.org/ compensated errors, so three different approaches were standard/75652.html. attempted: using the maximum disparity between proba- [5] International Organization for Standardization, bility values (MinMax method [13]), using the distance ”ISO/IEC DIS 5259-2 Artificial intelligence - Data between groups of similar probabilities (k-means and quality for analytics and machine learning (ML) - DBSCAN), and using mutual information. Part 2: Data quality measures”, Under development. As for the quality index selected in ISO IEC 25012, we URL: https://www.iso.org/standard/81860.html. chose the characteristic of completeness, particularly the [6] International Organization for Standardization, concept of maximum completeness as defined in [10]. ”ISO/IEC 25010 Systems and software engineering The study demonstrated that the use of maximum com- Systems and software Quality Requirements and pleteness and the MinMax measurement system provided Evaluation (SQuaRE) - System and software quality the best predictive capability for fairness indices: inde- models”, 2011. URL: https://www.iso.org/standard/ pendence, separation, sufficiency, and overall accuracy 81860.html. equality. Additionally, the use of the MinMax technique [7] International organization for standardization, showed better sensitivity compared to mutual informa- ”iso/iec 25059:2023 software engineering systems tion and the DBSCAN clustering system, as shown in and software quality requirements and evaluation [13]. (square) - quality model for ai systems”, 2023. URL: https://www.iso.org/standard/80655.html. [8] S. Barocas, M. Hardt, A. Narayanan, Fairness and 6. Conclusions machine learning, 2020. URL: https://fairmlbook. org/, chapter: Classification. In the realm of AI systems, data governance and data [9] J. Larson, S. Mattu, L. Kirchner, J. Angwin, quality are extremely important concepts. Since AI algo- Compas recidivism dataset, 2016. URL: rithms rely on learning datasets, the quality of input data https://www.propublica.org/article/ can impact the outcomes. In this article, we have seen how-we-analyzed-the-compas-recidivism-algorithm. how completeness can serve as a good predictor of errors [10] A. Simonetta, M. C. Paoletti, A. Venticinque, The in the outputs of an ML system. In this context, it is clear use of maximum completeness to estimate bias in ai- that the definition of guidelines for the application of based recommendation systems, CEUR Workshop data governance and data quality in AI systems is crucial. Proceedings 3360 (2022) pp. 76–84. Addressing bias in the data of technological systems is a [11] D. Steinberg, A. Reid, S. O’Callaghan, F. Lattimore, significant challenge in the digital age, as the decisions L. McCalman, T. S. Caetano, Fast fair regression made by algorithms can have substantial societal and via efficient approximations of mutual information, personal implications, which can be measured according CoRR abs/2002.06200 (2020). URL: https://arxiv.org/ to international ISO/IEC standards. abs/2002.06200. [12] A. Vetrò, M. Torchiano, M. Mecati, A data quality 20 Alessandro Simonetta et al. CEUR Workshop Proceedings 1–5 approach to the identification of discrimination risk in automated decision making systems, Government Information Quarterly 38 (2021) 101619. URL: https://www.sciencedirect.com/ science/article/pii/S0740624X21000551. doi:https: //doi.org/10.1016/j.giq.2021.101619 . [13] A. Simonetta, T. Nakajima, M. C. Paoletti, A. Ven- ticinque, Fairness metrics and maximum complete- ness for the prediction of discrimination, CEUR Workshop Proceedings 3356 (2022) pp. 13–20. 21