=Paper=
{{Paper
|id=Vol-3114/paper4
|storemode=property
|title=Integrating SQuARE with ISO 31000 risk management to measure and mitigate software bias
|pdfUrl=https://ceur-ws.org/Vol-3114/paper-04.pdf
|volume=Vol-3114
|authors=Alessandro Simonetta,Antonio Vetrò,Maria Cristina Paoletti,Marco Torchiano
|dblpUrl=https://dblp.org/rec/conf/apsec/SimonettaVPT21
}}
==Integrating SQuARE with ISO 31000 risk management to measure and mitigate software bias==
Integrating SQuaRE data quality model with ISO
31000 risk management to measure and mitigate
software bias
Alessandro Simonetta Antonio Vetrò Maria Cristina Paoletti
Department of Enterprise Engineering Dept. of Control and Computer Eng. Rome, Italy
University of Rome Tor Vergata Politecnico di Torino mariacristina.paoletti@gmail.com
Rome, Italy Turin, Italy ORCID: 0000-0001-6850-1184
alessandro.simonetta@gmail.com antonio.vetro@polito.it
ORCID: 0000-0003-2002-9815 ORCID: 0000-0003-2027-3308
Marco Torchiano
Dept. of Control and Computer Eng.
Politecnico di Torino
Turin, Italy
marco.torchiano@polito.it
ORCID: 0000-0001-5328-368X
Abstract — In the last decades the exponential growth of concern mainly scalability, efficiency, and removal of
available information, together with the availability of systems decision makers’ subjectivity. However, several critical
able to learn the knowledge that is present in the data, has aspects have emerged: lack of accountability and transparency
pushed towards the complete automation of many decision- [3], massive use of natural resources and low-unpaid labor to
making processes in public and private organizations. This building extensive training sets [4] , the distortion of the public
circumstance is posing impelling ethical and legal issues since a sphere of political discussion [5], and the amplification of
large number of studies and journalistic investigations showed existing inequalities in society [6]. This paper focuses on the
that software-based decisions, when based on historical data, latter problem, which occurs when automated software
perpetuate the same prejudices and bias existing in society,
decisions “systematically and unfairly discriminate against
resulting in a systematic and inescapable negative impact for
individuals from minorities and disadvantaged groups. The
certain individuals or groups of individuals in favor of others
problem is so relevant that the terms data bias and algorithm [by denying] an opportunity for a good or [assigning] an
ethics have become familiar not only to researchers, but also to undesirable outcome to an individual or groups of individuals
industry leaders and policy makers. In this context, we believe on grounds that are unreasonable or inappropriate” [7] . In
that the ISO SQuaRE standard, if appropriately integrated with practice, software systems may perpetuate the same bias of
risk management concepts and procedures from ISO 31000, can our societies, systematically discriminating the weakest
play an important role in democratizing the innovation of people and exacerbating existing inequalities [8]. A recurring
software-generated decisions, by making the development of cause for this phenomenon is the use of incomplete and biased
this type of software systems more socially sustainable and in data, because of errors or limitations in the data collection
line with the shared values of our societies. More in details, we (e.g., under-sampling of a specific population group) or
identified two additional measure for a quality characteristic simply because the distributions of the original population are
already present in the standard (completeness) and another that skewed. From a data engineering perspective, this translates
extends it (balance) with the aim of highlighting information into imbalanced data, i.e. a condition with an unequal
gaps or presence of bias in the training data. Those measures distribution of data between the classes of a given attribute,
serve as risk level indicators to be checked with common which causes highly heterogeneous accuracy across the
fairness measures that indicate the level of polarization of the classifications [9] [10]. Imbalanced data is known to be
software classifications/predictions. The adoption of additional problematic in the machine learning domain since long time
features with respect to the standard broadens its scope of
[11]. In fact, imbalanced datasets may lead to imbalanced
application, while maintaining consistency and conformity. The
proposed methodology aims to find correlations between quality
results, which in the context of ADM systems means
deficiencies and algorithm decisions, thus allowing to verify and differentiation of products, information and services based on
mitigate their impact. personal characteristics. In applications such as allocation of
social benefits, insurance tariffs, job profiles matching, etc.,
Keywords— ISO Square, ISO31000, data ethics, data quality, such differentiations can lead to unjustified unequal treatment
data bias, algorithm fairness, discrimination risk or discrimination.
I. INTRODUCTION For this reason, we maintain that imbalanced and
incomplete data shall be considered as a risk factor in all the
Software nowadays replace most human decisions in ADM systems that rely on historical data and operate in
many contexts [1] ; the rapid pace of innovation suggests that relevant aspects of the lives of individuals. Our proposal relies
this phenomenon will further increase in the future [2]. This on the integration of the measurement principles of the ISO
trend has been enabled by the large availability of data and of SQuaRE [12] with the risk management process defined in
the technical means to analyze them for building the ISO 31000 [13] to assess the potential risk of discriminating
predictive, classification, and ranking models that are at the software output and take action for remediations. In the paper,
core of automated decision making (ADM) systems. we describe the theoretical foundations, and we provide a
Advantages for using ADM systems are evident and they workflow of activities. We believe that the approach can be
Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
useful to a variety of stakeholders for assessing the risk of enrich original data with synthetic data to
discriminations, including the creators or commissioners of mitigate the problem.
software systems, researchers, policymakers, regulators,
certification or audit authorities. Assessments should prompt A. Risk analysis: where SQuaRE and ISO3100 meet
taking appropriate action to prevent adverse effects. We integrate the SQuaRE theoretical framework with the
ISO 31000 risk management principles to measure the risk
II. METHODOLOGY that an unbalanced or incomplete training set might cause
Figure 1 gives an overview of the proposed methodology. discriminating software output. Since the primary recipients
The process begins with the common subdivision of the of this document are the participants of the “3rd International
original data into training and test data. At this point, it is Workshop on Experience with SQuaRE series and its Future
possible to measure the quality in the training data (balance Direction”1, we do not describe here the standard, however we
and completeness) and the fairness in the results obtained on summarize the most important aspects for the scope of the
the test data. Data balance measures extend the characteristics paper. Firstly, we remind that SQuaRE includes quality
of the data quality model (ISO/IEC 25012), while modeling and measurements of software products2, data and
completeness measures complement it. Data quality measures software services. According to the philosophy and
give rise to an indicator for unbalanced or incomplete data for organization of this family of standards, quality is categorized
the sensitive characteristics, which implicates a risk of biased into one or more quantifiable characteristics and sub-
classifications by the algorithm. In this circumstance, it is characteristics. For example, the standard ISO/IEC
necessary to also assess the fairness of the algorithms used 25012:2011 formalizes the product quality model as
through the measures outlined in this paper. The presence of composed of eight characteristics, which are further
unfair results from the point of view of sensitive features in subdivided into sub-characteristics. Each (sub) characteristic
correspondence with poor quality data leads to the necessary relates to static properties of software and dynamic properties
data enrichment step to try to mitigate the problem. Thus, our of the computer system3. The ISO/IEC 25012:2008 on data
proposed methodology is composed by two main blocks: quality has 15 characteristics: 5 of them belongs to the
“inherent” point of view (i.e., the quality relies only on the
A. Risk analysis: measuring the risk that a training characteristics of the data per se), 3 of them are system-
set could contain unbalanced data, integrating the dependent (i.e., the quality depends on the characteristics of
SQuaRE approach with ISO 31000 risk the system hosting the data and making it available), the
management principles; remaining 7 belonging to both points of view. Data balance is
B. Risk evaluation: verify that a high level of risk not recognized as a characteristic of data quality in ISO/IEC
corresponds to unfairness, and in positive case 25012:2008: it is proposed here as an additional inherent
characteristic. Because of its role in the generation of biased
Figure 1. The proposed methodology
1 for example the aircraft system. It follows that a computer system
See http://www.sic.shibaura-it.ac.jp/~tsnaka/iwesq.html
2
A software product is a “set of computer programs, procedures, is “a system containing one or more components and elements
and possibly associated documentation and data” as defined in such as computers (hardware), associated software, and data”, for
ISO/IEC 12207:1998. In SQuaRE standards, software quality example a conference registration system. An ADM system that
stands for software product quality. determines eligibility for aid for drinking water is a software
3
A system is the “combination of interacting elements organized system.
to achieve one or more stated purposes” (ISO/IEC 15288:2008),
software output, data balance reflects the propagation balance and completeness are used as indicators, due to the
principle of SQuaRE: the quality of the software product, propagation effect previously described.
service and data affects the quality in use. Therefore,
evaluating and improving product/service/data quality is one Risk evaluation, as the last step, is the process in which the
results of the analysis are taken into consideration to decide
mean of improving the system quality in use. A simplification
whether additional action is required. If affirmative, this
of this concept is the GIGO principle (“garbage in, garbage
out”): data that is outdated, inaccurate and incomplete make process would then outline available risk treatment options
and the need for conducting additional analyses. In our case,
the output of the software unreliable. Similarly, unbalanced
specific thresholds for the measures should be decided for the
data will probably cause unbalanced software output,
especially in the context of machine learning and AI systems specific prediction/classification algorithms used, the social
context, the legal requirements of the domain, and other
trained with that data. This principle applies also to
completeness, which is already an inherent characteristic of relevant factors for the case at hand. In addition to the
technical actions, the process would define other types of
data quality in SQuaRE: in this work we propose an additional
required actions (e.g., reorganization of decision processes,
metric to those proposed in ISO/IEC 25024:2015, that is more
communication to the public, etc.) and the actors who must
suitable for the problem of biased software.
undertake them.
To better address the problem of biased software output,
we consider the measures of data balance and completeness 1) Completeness measure
not only as extension of SQuaRE data quality modelling but The completeness measure proposed is agnostic with respect
also as risk factors. Here comes the integration of SQuaRE to classical ML data classification because for our purposes
theoretical and measurement framework with the ISO we are interested in evaluating those columns that assume
31000:2018 standard for risk management. The standard values in finite and discrete intervals, which we will call
defines guiding principles and a process of three phases: risk categorical with respect to the row data. This characteristic
identification, risk analysis and risk evaluation. Here, we
will allow us to consider the set of their values as the digits
briefly describe them and specify the relation with our
constituting a number in a variable base numbering system.
approach.
The idea of the present study is based on the principle that a
Risk identification refers to finding, recognizing and learning system provides predictions consistent with the data
describing risks within a certain context and scope, and with with which it has been trained. Therefore, if it is fed with non-
respect to specific criteria defined prior to risk assessment. In homogeneous data it will provide unbalanced and
this paper, this is implicitly contained in the motivations and discriminatory predictions with respect to reality. For this
in the problem formulation: it is the risk of discriminating reason, the methodology we propose starts with the analysis
individuals or groups of individuals by operating software phase of the reality of interest and of the dataset, an activity
systems that automate high-stake decisions for the lives of that must be carried out even before starting the pre-training
people. phase in line with previous studies where some of the authors
Risk analysis is the understanding of the characteristics and proposed the use of balance measures in automated decision
levels of the risk. This is the phase where measures of data
Index Formula Normalized formula Notes
𝑚 𝑚
Gini 𝑚 m is the number of classes
𝒢 = 1 − ∑ 𝑓𝑖2 𝒢𝑛 = ∙ (1 − ∑ 𝑓𝑖2 )
𝑚−1 f is the relative frequency of each class
𝑖=1 𝑖=1 𝑛
𝑓𝑖 = ∑𝑚 𝑖 𝑛
𝑖=1 𝑖
ni= absolute frequency
The higher G and Gn, the higher is the
heterogeneity: it means that categories
have similar frequencies
The lower the index, the lower is the
heterogeneity: a few classes account for
majority of instances
Simpson 1 1 1 For m, f, fi and ni check Gini
𝐷 = 𝐷𝑛 = ( 𝑚 2 − 1)
∑𝑚
𝑖=1 𝑓𝑖
2
∑
𝑚 − 1 𝑖=1 𝑓𝑖 Higher values of D and Dn indicate higher
diversity in terms of probability of
belonging to different classes
The lower the index, the lower is the
diversity, because frequencies are
concentrated in a few classes
Table 1. Example of measures of balance
making systems [14][15][16][17]. In particular, during this M= df.groupby(['CS0',...,'CSm-1']).
phase, it is necessary to identify all the independent columns size().reset_index(name='counts').
that define whether the instance belongs to a class or category. counts.max()
Suppose we have a structured dataset as follows:
len(df)/(M∗k)
DS = { C0, C1, ... , Cn−1 } (1)
Indicating with the set S the positions of the columns 2) Balance measures
categorising the instances, functionally independent of the Since imbalance is defined as an unequal distribution
other columns in the dataset: between classes [9], we focus on categorical data. In fact,
S ⊆ { 0, 1, ... , n – 1 } , dim(S) = m , m ≤ n (2) most of the sensitive attributes are considered categorical
data, such as gender, hometown, marital status, and job.
we can analyze the new dataset consisting of the columns Alternatively, if they are numeric, they are either discrete and
CS(j) with j ∈ [0, m − 1]. within a short range, such as family size, or they are
Having said that, we can decide to use two different notions continuous but often re-conducted to distinct categories, such
of completeness: maximum or minimum. In the first case the as information on “age” which is often discretized into ranges
presence in the dataset of a greater number of distinct such as “< 18”, “19-35”, “36-50”, “51-65”, etc. We show two
instances that belong to the same categorising classes examples of measures in Table 1, retrieved from the literature
constitutes a constraint for all the other instances of the of social and natural sciences, where imbalance is known in
dataset. That is, one must ensure that one has the same number terms of (lack of) heterogeneity and diversity. They are
of replicas of distinct class combinations for distinct instances. normalized in the range 0-1, where 1 correspond to maximum
Instead, in the second case it is sufficient to have at least one balance and 0 to minimum balance, i.e. imbalance. Hence,
combination of distinct classes among those possible for each lower level of balance measures mean a higher risk of bias in
instance. For simplicity, but without loss of generality of the the software output.
procedure, we will explore the minimum completeness of the B. Risk evaluation with fairness measures
dataset, then we will reduce the dataset to just the columns (𝑗) The majority of fairness measures in machine learning
by removing duplicate rows. We will use the Python language literature rely on the comparison of accuracy, computed for
to explicate the calculation formulas and make the each population group of interest [18]. For computing the
mathematical logic implied less abstract. The Python language
accuracy, two different approaches can be adopted: the first
has the pandas library, which makes it possible to carry out attempts to measure the intensity of errors, i.e. the deviation
analysis and data manipulation in a fast, powerful, flexible and between prediction and actual value (precision), while the
easy-to-use manner. Through the DataFrame class it is other measures the general direction of the error. Indicating
possible to load data frames from a simple csv file: with ei the ith error, with fi and di respectively the ith forecast
import pandas as pd and demand, we have:
df=pd.read_csv()
𝑒𝑖 = 𝑓𝑖 − 𝑑𝑖 (3)
The ideal value of minimum completeness for the
combinatorial metric is when in the dataset there is at least one At this point we can add up all the errors with sign and find
instance that belongs to each distinct combination of the average error:
categories. The absence of some combination could create the 1
lack of information that we do not want to exist. To calculate 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑒𝑟𝑟𝑜𝑟 = 𝑛 ∑𝑖 𝑒𝑖 (4)
the total number of distinct combinations we need to calculate
the product of the distinct replicas per single category. However, this measure is very crude because error
compensation phenomena may be present so generically it is
k=( df['CS0'].unique().size * preferred to use the mean of the absolute error or the square
df['CS1'].unique().size *...*
root of the mean square error:
df['CSm-1'].unique().size )
1
𝑀𝐴𝐸 = 𝑛 ∑𝑖 |𝑒𝑖 | (5)
On the other hand, in the dataset we only have the
characterising columns so we can derive the true number of
distinct instances in order to determine how far the data 1
deviates from the ideal case. 𝑅𝑀𝑆𝐸 = √𝑛 ∑𝑖 𝑒𝑖 2 (6)
len (df.drop_duplicates())/ k RMSE is sensitive to important errors, while from this
point of view MAE is fairer because it considers all errors at
The value for maximum completeness is calculated from the same level. Moreover, if our prediction tends to the
the maximum number of duplicates of the same combinations median it will get a good value of MAE, vice versa if it
of characterising columns. For this reason it is necessary to approaches the mean it will get a better result on RMSE.
maintain in the dataset in addition to the columns (𝑗) a
discriminating identification field of the rows with the same Under conditions where the median is lower than the
values in these columns. To determine the potential total, once mean, for example in processes where there are peaks of
the maximum number of duplications (M) has been demands compared to normal steady state operation, it will
determined, it is necessary to extend this multiplication factor not be convenient to use MAE which will introduce a bias
to all other classes. while it will be more convenient to use RMSE. Things are
reversed if outliers are present in the distribution as MAE is Other approaches which can be connected to ours are in
less sensitive than RMSE. the direction of labeling datasets: for example “The Dataset
Nutrition Label Project” 4 aims to identify the “key
To measure model performance, you can choose to ingredients” in a dataset such as provenance, populations, and
measure error with one or more KPI. missing data. The label takes the form of an interactive
In the case of classification algorithms instead you can use visualization that allows for exploring the previously
the confusion matrix that allows you to compute the number mentioned aspects and spot flawed, incomplete, or
of true positives (TPs), true negatives (TNs), false positives problematic data. One of the author of this paper took
(FPs) and false negatives (FNs): inspiration from this study in previous works for “Ethically
𝑝11 … 𝑝1𝑛 and socially-aware labeling” [16] and for a data annotation
(7) and visualization schema based on Bayesian statistical
𝑃 = [ … ... … ] inference [17] always for the purpose of warning about the
𝑝𝑛1 … 𝑝𝑛𝑛 risk of discriminatory outcomes due to poor quality of
You can use the following equations to calculate the datasets. We started from that experience to conduct
following values: preliminary case studies on the reliability of the balance
measures [14] [15]: in this work we continue in that direction
𝑇𝑃(𝑖) = 𝑝𝑖𝑖 (8) by adding a measure of completeness and proposing an
𝑛 explicit workflow of activities for the combination of SQuaRE
𝐹𝑃(𝑖) = ∑ 𝑝𝑘𝑖 (9) with ISO 31000.
𝑘=1,𝑘≠𝑖
𝑛 IV. CONCLUSION AND FUTURE WORK
𝐹𝑁(𝑖) = ∑ 𝑝𝑖𝑘 (10) We propose a methodology that integrates SQuaRE
𝑘=1,𝑘≠𝑖 measurement framework with the ISO 31000 process with the
𝑛
goal of evaluating balance and completeness in a dataset as
𝑇𝑁(𝑖) = ∑ 𝑝𝑘𝑘 (11) risk factors of discriminatory outputs of software systems. We
𝑘=1,𝑘≠𝑖
believe that the methodology can be a useful instrument for all
At this point, the single values could be computed for each the actors involved in the development and regulation of
population subgroup (e.g., “Asian” vs “Caucasian” vs ADM systems, and, from a more general perspective, it can
“African-American” etc. , or “Male” vs “Female”, etc.) and play an important role in the collective attempt of placing
the same applies to the concepts of precision, recall, and democratic control on the development of these systems, that
accuracy, known from the literature and reported here: should be more accountable and less harmful than how they
𝑇𝑃 are now. In fact, the adverse effects of ADM systems are
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (12) posing a significant danger for human rights and freedoms as
𝑇𝑃 + 𝐹𝑃
our societies increasingly rely on automated decision making.
𝑇𝑃 It must be stressed that this is still at a prototypical stage and
𝑅𝑒𝑐𝑎𝑙𝑙 = (13) further studies are necessary to improve the methodology and
𝑇𝑃 + 𝐹𝑁
to assess the reliability of the proposed measures, for example
𝑇𝑃 + 𝑇𝑁 to find meaningful risk thresholds in relation to the context of
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (14) use and the severity of the impact on individuals. The current
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
paper is also way to seek engagement from other researchers
Fairness measures should be then compared with in a community effort to test the workflow in real settings,
appropriate thresholds selected with respect to social context improve it and build an open registry of additional measures
in which the software application is used. If the unfairness is combined with evaluations benchmark. Finally, we are
higher than the maximum allowed thresholds, then the conscious that technical adjustments are not enough, and they
original dataset should be integrated with synthetic data to should be integrated with other types of actions because of the
mitigate the problem. One way to repopulate the dataset socio-technical nature of the problem.
without causing distortion in the data is to add replicas of data
selected from the same set at random (known as bootstrapping REFERENCES
[19]); other rebalancing techniques have been proposed in the [1] F. Chiusi, S. Fischer, N. Kayser-Bril, and M.
literature (e.g. SMOTE [20], ROSE [21]). Spielkamp, “Automating Society Report 2020,”
III. RELATION WITH LITERATURE AND OUR PAST STUDIES Berlin, Oct. 2020. Accessed: Nov. 10, 2020. [Online].
Available:
An approach similar to ours is the work of Takashi https://automatingsociety.algorithmwatch.org
Matsumoto and Arisa Ema [22], who proposed a risk chain [2] E. Brynjolfsson and A. McAfee, The Second Machine
model (RCM) for risk reduction in Artificial Intelligence Age: Work, Progress, and Prosperity in a Time of
services: the authors consider both data quality and data Brilliant Technologies, Reprint edition. New York
imbalance as risk factors. Our work can be easily integrated London: W. W. Norton & Company, 2016.
into the RCM framework, because we offer a quantitative [3] F. Pasquale, The Black Box Society: The Secret
way to measure balance and completeness, and because it is Algorithms That Control Money and Information.
natively related to the ISO/IEC standards on data quality Cambridge: Harvard Univ Pr, 2015.
requirements and risk management.
4
It is a joint initiative of MIT Media Lab and Berkman Klein
Center at Harvard University: https://datanutrition.org/.
[4] K. Crawford, Atlas of AI: Power, Politics, and the [14] M. Mecati, F. E. Cannavò, A. Vetrò, and M.
Planetary Costs of Artificial Intelligence: The Real Torchiano, “Identifying Risks in Datasets for
Worlds of Artificial Intelligence. New Haven: Yale Automated Decision–Making,” in Electronic
Univ Pr, 2021. Government, Cham, 2020, pp. 332–344. doi:
[5] P. N. Howard, Lie Machines: How to Save Democracy 10.1007/978-3-030-57599-1_25.
from Troll Armies, Deceitful Robots, Junk News [15] A. Vetrò, M. Torchiano, and M. Mecati, “A data
Operations, and Political Operatives. New Haven ; quality approach to the identification of discrimination
London: Yale Univ Pr, 2020. risk in automated decision making systems,” Gov. Inf.
[6] V. Eubanks, Automating Inequality: How High-Tech Q., p. 101619, Sep. 2021, doi:
Tools Profile, Police, and Punish the Poor. New York, 10.1016/j.giq.2021.101619.
NY: St. Martin’s Press, 2018. [16] E. Beretta, A. Vetrò, B. Lepri, and J. C. De Martin,
[7] B. Friedman and H. Nissenbaum, “Bias in Computer “Ethical and Socially-Aware Data Labels,” in
Systems,” ACM Trans Inf Syst, vol. 14, no. 3, pp. 330– Information Management and Big Data, Cham, 2019,
347, Jul. 1996, doi: 10.1145/230538.230561. pp. 320–327.
[8] C. O’Neil, Weapons of Math Destruction: How Big [17] E. Beretta, A. Vetrò, B. Lepri, and J. C. D. Martin,
Data Increases Inequality and Threatens Democracy, “Detecting discriminatory risk through data annotation
Reprint edition. New York: Broadway Books, 2017. based on Bayesian inferences,” in Proceedings of the
[9] H. He and E. A. Garcia, “Learning from Imbalanced 2021 ACM Conference on Fairness, Accountability,
Data,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, and Transparency, New York, NY, USA, Mar. 2021,
pp. 1263–1284, Sep. 2009, doi: pp. 794–804. doi: 10.1145/3442188.3445940.
10.1109/TKDE.2008.239. [18] S. Barocas, M. Hardt, and A. Narayanan, Fairness and
[10] B. Krawczyk, “Learning from imbalanced data: open Machine Learning. fairmlbook.org, 2019.
challenges and future directions,” Prog. Artif. Intell., [19] “Handling imbalanced data sets with synthetic
vol. 5, no. 4, pp. 221–232, Nov. 2016, doi: boundary data generation using bootstrap re-sampling
10.1007/s13748-016-0094-0. and AdaBoost techniques - ScienceDirect.”
[11] N. Japkowicz and S. Stephen, “The class imbalance https://doi.org/10.1016/j.patrec.2013.04.019 (accessed
problem: A systematic study,” Intell. Data Anal., vol. Sep. 18, 2021).
6, no. 5, pp. 429–449, Oct. 2002. [20] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P.
[12] International Organization for Standardization, Kegelmeyer, “SMOTE: synthetic minority over-
“ISO/IEC 25000:2014 Systems and software sampling technique,” J. Artif. Intell. Res., vol. 16, no.
engineering — Systems and software Quality 1, pp. 321–357, Jun. 2002.
Requirements and Evaluation (SQuaRE) — Guide to [21] G. Menardi and N. Torelli, “Training and assessing
SQuaRE,” ISO-International Organization for classification rules with imbalanced data,” Data Min.
Standardization, 2014. Knowl. Discov., vol. 28, no. 1, pp. 92–122, Jan. 2014,
https://www.iso.org/standard/64764.html (accessed doi: 10.1007/s10618-012-0295-5.
Nov. 10, 2020). [22] T. Matsumoto and A. Ema, “RCModel, a Risk Chain
[13] International Organization for Standardization, “ISO Model for Risk Reduction in AI Services,”
31000:2018 Risk management — Guidelines,” ISO - ArXiv200703215 Cs, Jul. 2020, [Online]. Available:
International Organization for Standardization, 2018. http://arxiv.org/abs/2007.03215
https://www.iso.org/cms/render/live/en/sites/isoorg/co
ntents/data/standard/06/56/65694.html (accessed Nov.
10, 2020).