=Paper= {{Paper |id=Vol-3114/paper4 |storemode=property |title=Integrating SQuARE with ISO 31000 risk management to measure and mitigate software bias |pdfUrl=https://ceur-ws.org/Vol-3114/paper-04.pdf |volume=Vol-3114 |authors=Alessandro Simonetta,Antonio Vetrò,Maria Cristina Paoletti,Marco Torchiano |dblpUrl=https://dblp.org/rec/conf/apsec/SimonettaVPT21 }} ==Integrating SQuARE with ISO 31000 risk management to measure and mitigate software bias== https://ceur-ws.org/Vol-3114/paper-04.pdf
       Integrating SQuaRE data quality model with ISO
       31000 risk management to measure and mitigate
                        software bias
        Alessandro Simonetta                                       Antonio Vetrò                                     Maria Cristina Paoletti
Department of Enterprise Engineering                    Dept. of Control and Computer Eng.                                 Rome, Italy
   University of Rome Tor Vergata                              Politecnico di Torino                            mariacristina.paoletti@gmail.com
             Rome, Italy                                            Turin, Italy                                 ORCID: 0000-0001-6850-1184
  alessandro.simonetta@gmail.com                              antonio.vetro@polito.it
   ORCID: 0000-0003-2002-9815                             ORCID: 0000-0003-2027-3308

           Marco Torchiano
 Dept. of Control and Computer Eng.
        Politecnico di Torino
             Turin, Italy
     marco.torchiano@polito.it
  ORCID: 0000-0001-5328-368X

    Abstract — In the last decades the exponential growth of                   concern mainly scalability, efficiency, and removal of
available information, together with the availability of systems               decision makers’ subjectivity. However, several critical
able to learn the knowledge that is present in the data, has                   aspects have emerged: lack of accountability and transparency
pushed towards the complete automation of many decision-                       [3], massive use of natural resources and low-unpaid labor to
making processes in public and private organizations. This                     building extensive training sets [4] , the distortion of the public
circumstance is posing impelling ethical and legal issues since a              sphere of political discussion [5], and the amplification of
large number of studies and journalistic investigations showed                 existing inequalities in society [6]. This paper focuses on the
that software-based decisions, when based on historical data,                  latter problem, which occurs when automated software
perpetuate the same prejudices and bias existing in society,
                                                                               decisions “systematically and unfairly discriminate against
resulting in a systematic and inescapable negative impact for
individuals from minorities and disadvantaged groups. The
                                                                               certain individuals or groups of individuals in favor of others
problem is so relevant that the terms data bias and algorithm                  [by denying] an opportunity for a good or [assigning] an
ethics have become familiar not only to researchers, but also to               undesirable outcome to an individual or groups of individuals
industry leaders and policy makers. In this context, we believe                on grounds that are unreasonable or inappropriate” [7] . In
that the ISO SQuaRE standard, if appropriately integrated with                 practice, software systems may perpetuate the same bias of
risk management concepts and procedures from ISO 31000, can                    our societies, systematically discriminating the weakest
play an important role in democratizing the innovation of                      people and exacerbating existing inequalities [8]. A recurring
software-generated decisions, by making the development of                     cause for this phenomenon is the use of incomplete and biased
this type of software systems more socially sustainable and in                 data, because of errors or limitations in the data collection
line with the shared values of our societies. More in details, we              (e.g., under-sampling of a specific population group) or
identified two additional measure for a quality characteristic                 simply because the distributions of the original population are
already present in the standard (completeness) and another that                skewed. From a data engineering perspective, this translates
extends it (balance) with the aim of highlighting information                  into imbalanced data, i.e. a condition with an unequal
gaps or presence of bias in the training data. Those measures                  distribution of data between the classes of a given attribute,
serve as risk level indicators to be checked with common                       which causes highly heterogeneous accuracy across the
fairness measures that indicate the level of polarization of the               classifications [9] [10]. Imbalanced data is known to be
software classifications/predictions. The adoption of additional               problematic in the machine learning domain since long time
features with respect to the standard broadens its scope of
                                                                               [11]. In fact, imbalanced datasets may lead to imbalanced
application, while maintaining consistency and conformity. The
proposed methodology aims to find correlations between quality
                                                                               results, which in the context of ADM systems means
deficiencies and algorithm decisions, thus allowing to verify and              differentiation of products, information and services based on
mitigate their impact.                                                         personal characteristics. In applications such as allocation of
                                                                               social benefits, insurance tariffs, job profiles matching, etc.,
   Keywords— ISO Square, ISO31000, data ethics, data quality,                  such differentiations can lead to unjustified unequal treatment
data bias, algorithm fairness, discrimination risk                             or discrimination.

  I.    INTRODUCTION                                                               For this reason, we maintain that imbalanced and
                                                                               incomplete data shall be considered as a risk factor in all the
    Software nowadays replace most human decisions in                          ADM systems that rely on historical data and operate in
many contexts [1] ; the rapid pace of innovation suggests that                 relevant aspects of the lives of individuals. Our proposal relies
this phenomenon will further increase in the future [2]. This                  on the integration of the measurement principles of the ISO
trend has been enabled by the large availability of data and of                SQuaRE [12] with the risk management process defined in
the technical means to analyze them for building the                           ISO 31000 [13] to assess the potential risk of discriminating
predictive, classification, and ranking models that are at the                 software output and take action for remediations. In the paper,
core of automated decision making (ADM) systems.                               we describe the theoretical foundations, and we provide a
Advantages for using ADM systems are evident and they                          workflow of activities. We believe that the approach can be



Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
useful to a variety of stakeholders for assessing the risk of                       enrich original data with synthetic data to
discriminations, including the creators or commissioners of                         mitigate the problem.
software systems, researchers, policymakers, regulators,
certification or audit authorities. Assessments should prompt         A. Risk analysis: where SQuaRE and ISO3100 meet
taking appropriate action to prevent adverse effects.                     We integrate the SQuaRE theoretical framework with the
                                                                      ISO 31000 risk management principles to measure the risk
    II.   METHODOLOGY                                                 that an unbalanced or incomplete training set might cause
    Figure 1 gives an overview of the proposed methodology.           discriminating software output. Since the primary recipients
The process begins with the common subdivision of the                 of this document are the participants of the “3rd International
original data into training and test data. At this point, it is       Workshop on Experience with SQuaRE series and its Future
possible to measure the quality in the training data (balance         Direction”1, we do not describe here the standard, however we
and completeness) and the fairness in the results obtained on         summarize the most important aspects for the scope of the
the test data. Data balance measures extend the characteristics       paper. Firstly, we remind that SQuaRE includes quality
of the data quality model (ISO/IEC 25012), while                      modeling and measurements of software products2, data and
completeness measures complement it. Data quality measures            software services. According to the philosophy and
give rise to an indicator for unbalanced or incomplete data for       organization of this family of standards, quality is categorized
the sensitive characteristics, which implicates a risk of biased      into one or more quantifiable characteristics and sub-
classifications by the algorithm. In this circumstance, it is         characteristics. For example, the standard ISO/IEC
necessary to also assess the fairness of the algorithms used          25012:2011 formalizes the product quality model as
through the measures outlined in this paper. The presence of          composed of eight characteristics, which are further
unfair results from the point of view of sensitive features in        subdivided into sub-characteristics. Each (sub) characteristic
correspondence with poor quality data leads to the necessary          relates to static properties of software and dynamic properties
data enrichment step to try to mitigate the problem. Thus, our        of the computer system3. The ISO/IEC 25012:2008 on data
proposed methodology is composed by two main blocks:                  quality has 15 characteristics: 5 of them belongs to the
                                                                      “inherent” point of view (i.e., the quality relies only on the
          A. Risk analysis: measuring the risk that a training        characteristics of the data per se), 3 of them are system-
             set could contain unbalanced data, integrating the       dependent (i.e., the quality depends on the characteristics of
             SQuaRE approach with ISO 31000 risk                      the system hosting the data and making it available), the
             management principles;                                   remaining 7 belonging to both points of view. Data balance is
          B. Risk evaluation: verify that a high level of risk        not recognized as a characteristic of data quality in ISO/IEC
             corresponds to unfairness, and in positive case          25012:2008: it is proposed here as an additional inherent
                                                                      characteristic. Because of its role in the generation of biased




                                                    Figure 1. The proposed methodology



1                                                                     for example the aircraft system. It follows that a computer system
  See http://www.sic.shibaura-it.ac.jp/~tsnaka/iwesq.html
2
   A software product is a “set of computer programs, procedures,     is “a system containing one or more components and elements
and possibly associated documentation and data” as defined in         such as computers (hardware), associated software, and data”, for
ISO/IEC 12207:1998. In SQuaRE standards, software quality             example a conference registration system. An ADM system that
stands for software product quality.                                  determines eligibility for aid for drinking water is a software
3
   A system is the “combination of interacting elements organized     system.
to achieve one or more stated purposes” (ISO/IEC 15288:2008),
software output, data balance reflects the propagation               balance and completeness are used as indicators, due to the
principle of SQuaRE: the quality of the software product,            propagation effect previously described.
service and data affects the quality in use. Therefore,
evaluating and improving product/service/data quality is one         Risk evaluation, as the last step, is the process in which the
                                                                     results of the analysis are taken into consideration to decide
mean of improving the system quality in use. A simplification
                                                                     whether additional action is required. If affirmative, this
of this concept is the GIGO principle (“garbage in, garbage
out”): data that is outdated, inaccurate and incomplete make         process would then outline available risk treatment options
                                                                     and the need for conducting additional analyses. In our case,
the output of the software unreliable. Similarly, unbalanced
                                                                     specific thresholds for the measures should be decided for the
data will probably cause unbalanced software output,
especially in the context of machine learning and AI systems         specific prediction/classification algorithms used, the social
                                                                     context, the legal requirements of the domain, and other
trained with that data. This principle applies also to
completeness, which is already an inherent characteristic of         relevant factors for the case at hand. In addition to the
                                                                     technical actions, the process would define other types of
data quality in SQuaRE: in this work we propose an additional
                                                                     required actions (e.g., reorganization of decision processes,
metric to those proposed in ISO/IEC 25024:2015, that is more
                                                                     communication to the public, etc.) and the actors who must
suitable for the problem of biased software.
                                                                     undertake them.
    To better address the problem of biased software output,
we consider the measures of data balance and completeness                 1) Completeness measure
not only as extension of SQuaRE data quality modelling but           The completeness measure proposed is agnostic with respect
also as risk factors. Here comes the integration of SQuaRE           to classical ML data classification because for our purposes
theoretical and measurement framework with the ISO                   we are interested in evaluating those columns that assume
31000:2018 standard for risk management. The standard                values in finite and discrete intervals, which we will call
defines guiding principles and a process of three phases: risk       categorical with respect to the row data. This characteristic
identification, risk analysis and risk evaluation. Here, we
                                                                     will allow us to consider the set of their values as the digits
briefly describe them and specify the relation with our
                                                                     constituting a number in a variable base numbering system.
approach.
                                                                     The idea of the present study is based on the principle that a
Risk identification refers to finding, recognizing and               learning system provides predictions consistent with the data
describing risks within a certain context and scope, and with        with which it has been trained. Therefore, if it is fed with non-
respect to specific criteria defined prior to risk assessment. In    homogeneous data it will provide unbalanced and
this paper, this is implicitly contained in the motivations and      discriminatory predictions with respect to reality. For this
in the problem formulation: it is the risk of discriminating         reason, the methodology we propose starts with the analysis
individuals or groups of individuals by operating software           phase of the reality of interest and of the dataset, an activity
systems that automate high-stake decisions for the lives of          that must be carried out even before starting the pre-training
people.                                                              phase in line with previous studies where some of the authors
Risk analysis is the understanding of the characteristics and        proposed the use of balance measures in automated decision
levels of the risk. This is the phase where measures of data

    Index             Formula                   Normalized formula                                  Notes
                               𝑚                                𝑚
     Gini                                          𝑚                                      m is the number of classes
                  𝒢 = 1 − ∑ 𝑓𝑖2             𝒢𝑛 =        ∙ (1 − ∑ 𝑓𝑖2 )
                                                 𝑚−1                               f is the relative frequency of each class
                              𝑖=1                                   𝑖=1                                    𝑛
                                                                                                   𝑓𝑖 = ∑𝑚 𝑖 𝑛
                                                                                                         𝑖=1 𝑖
                                                                                            ni= absolute frequency

                                                                                    The higher G and Gn, the higher is the
                                                                                    heterogeneity: it means that categories
                                                                                           have similar frequencies
                                                                                    The lower the index, the lower is the
                                                                                   heterogeneity: a few classes account for
                                                                                            majority of instances

   Simpson                   1                        1     1                            For m, f, fi and ni check Gini
                    𝐷 =                      𝐷𝑛 =        ( 𝑚 2 − 1)
                          ∑𝑚
                           𝑖=1 𝑓𝑖
                                 2
                                                          ∑
                                                    𝑚 − 1 𝑖=1 𝑓𝑖                  Higher values of D and Dn indicate higher
                                                                                     diversity in terms of probability of
                                                                                        belonging to different classes
                                                                                     The lower the index, the lower is the
                                                                                      diversity, because frequencies are
                                                                                        concentrated in a few classes

                                               Table 1. Example of measures of balance
making systems [14][15][16][17]. In particular, during this             M= df.groupby(['CS0',...,'CSm-1']).
phase, it is necessary to identify all the independent columns             size().reset_index(name='counts').
that define whether the instance belongs to a class or category.           counts.max()
Suppose we have a structured dataset as follows:
                                                                        len(df)/(M∗k)
                    DS = { C0, C1, ... , Cn−1 }               (1)
Indicating with the set S the positions of the columns                  2) Balance measures
categorising the instances, functionally independent of the          Since imbalance is defined as an unequal distribution
other columns in the dataset:                                        between classes [9], we focus on categorical data. In fact,
          S ⊆ { 0, 1, ... , n – 1 } , dim(S) = m , m ≤ n      (2)    most of the sensitive attributes are considered categorical
                                                                     data, such as gender, hometown, marital status, and job.
we can analyze the new dataset consisting of the columns             Alternatively, if they are numeric, they are either discrete and
CS(j) with j ∈ [0, m − 1].                                           within a short range, such as family size, or they are
Having said that, we can decide to use two different notions         continuous but often re-conducted to distinct categories, such
of completeness: maximum or minimum. In the first case the           as information on “age” which is often discretized into ranges
presence in the dataset of a greater number of distinct              such as “< 18”, “19-35”, “36-50”, “51-65”, etc. We show two
instances that belong to the same categorising classes               examples of measures in Table 1, retrieved from the literature
constitutes a constraint for all the other instances of the          of social and natural sciences, where imbalance is known in
dataset. That is, one must ensure that one has the same number       terms of (lack of) heterogeneity and diversity. They are
of replicas of distinct class combinations for distinct instances.   normalized in the range 0-1, where 1 correspond to maximum
Instead, in the second case it is sufficient to have at least one    balance and 0 to minimum balance, i.e. imbalance. Hence,
combination of distinct classes among those possible for each        lower level of balance measures mean a higher risk of bias in
instance. For simplicity, but without loss of generality of the      the software output.
procedure, we will explore the minimum completeness of the           B. Risk evaluation with fairness measures
dataset, then we will reduce the dataset to just the columns (𝑗)     The majority of fairness measures in machine learning
by removing duplicate rows. We will use the Python language          literature rely on the comparison of accuracy, computed for
to explicate the calculation formulas and make the                   each population group of interest [18]. For computing the
mathematical logic implied less abstract. The Python language
                                                                     accuracy, two different approaches can be adopted: the first
has the pandas library, which makes it possible to carry out         attempts to measure the intensity of errors, i.e. the deviation
analysis and data manipulation in a fast, powerful, flexible and     between prediction and actual value (precision), while the
easy-to-use manner. Through the DataFrame class it is                other measures the general direction of the error. Indicating
possible to load data frames from a simple csv file:                 with ei the ith error, with fi and di respectively the ith forecast
         import pandas as pd                                         and demand, we have:
         df=pd.read_csv()
                                                                                               𝑒𝑖 = 𝑓𝑖 − 𝑑𝑖                         (3)
    The ideal value of minimum completeness for the
combinatorial metric is when in the dataset there is at least one    At this point we can add up all the errors with sign and find
instance that belongs to each distinct combination of                the average error:
categories. The absence of some combination could create the                                                   1
lack of information that we do not want to exist. To calculate                          𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑒𝑟𝑟𝑜𝑟 = 𝑛 ∑𝑖 𝑒𝑖                     (4)
the total number of distinct combinations we need to calculate
the product of the distinct replicas per single category.            However, this measure is very crude because error
                                                                     compensation phenomena may be present so generically it is
     k=( df['CS0'].unique().size *                                   preferred to use the mean of the absolute error or the square
         df['CS1'].unique().size *...*
                                                                     root of the mean square error:
         df['CSm-1'].unique().size )
                                                                                                       1
                                                                                             𝑀𝐴𝐸 = 𝑛 ∑𝑖 |𝑒𝑖 |                       (5)
    On the other hand, in the dataset we only have the
characterising columns so we can derive the true number of
distinct instances in order to determine how far the data                                                  1
deviates from the ideal case.                                                              𝑅𝑀𝑆𝐸 = √𝑛 ∑𝑖 𝑒𝑖 2                        (6)

     len (df.drop_duplicates())/ k                                      RMSE is sensitive to important errors, while from this
                                                                     point of view MAE is fairer because it considers all errors at
    The value for maximum completeness is calculated from            the same level. Moreover, if our prediction tends to the
the maximum number of duplicates of the same combinations            median it will get a good value of MAE, vice versa if it
of characterising columns. For this reason it is necessary to        approaches the mean it will get a better result on RMSE.
maintain in the dataset in addition to the columns (𝑗) a
discriminating identification field of the rows with the same           Under conditions where the median is lower than the
values in these columns. To determine the potential total, once      mean, for example in processes where there are peaks of
the maximum number of duplications (M) has been                      demands compared to normal steady state operation, it will
determined, it is necessary to extend this multiplication factor     not be convenient to use MAE which will introduce a bias
to all other classes.                                                while it will be more convenient to use RMSE. Things are
reversed if outliers are present in the distribution as MAE is           Other approaches which can be connected to ours are in
less sensitive than RMSE.                                            the direction of labeling datasets: for example “The Dataset
                                                                     Nutrition Label Project” 4 aims to identify the “key
   To measure model performance, you can choose to                   ingredients” in a dataset such as provenance, populations, and
measure error with one or more KPI.                                  missing data. The label takes the form of an interactive
    In the case of classification algorithms instead you can use     visualization that allows for exploring the previously
the confusion matrix that allows you to compute the number           mentioned aspects and spot flawed, incomplete, or
of true positives (TPs), true negatives (TNs), false positives       problematic data. One of the author of this paper took
(FPs) and false negatives (FNs):                                     inspiration from this study in previous works for “Ethically
                        𝑝11 … 𝑝1𝑛                                    and socially-aware labeling” [16] and for a data annotation
                                                            (7)      and visualization schema based on Bayesian statistical
                 𝑃 = [ … ... … ]                                     inference [17] always for the purpose of warning about the
                        𝑝𝑛1 … 𝑝𝑛𝑛                                    risk of discriminatory outcomes due to poor quality of
   You can use the following equations to calculate the              datasets. We started from that experience to conduct
following values:                                                    preliminary case studies on the reliability of the balance
                                                                     measures [14] [15]: in this work we continue in that direction
                       𝑇𝑃(𝑖) = 𝑝𝑖𝑖                             (8)   by adding a measure of completeness and proposing an
                                 𝑛                                   explicit workflow of activities for the combination of SQuaRE
                    𝐹𝑃(𝑖) =    ∑ 𝑝𝑘𝑖                           (9)   with ISO 31000.
                              𝑘=1,𝑘≠𝑖
                                 𝑛                                    IV.   CONCLUSION AND FUTURE WORK
                    𝐹𝑁(𝑖) =     ∑ 𝑝𝑖𝑘                      (10)          We propose a methodology that integrates SQuaRE
                              𝑘=1,𝑘≠𝑖                                measurement framework with the ISO 31000 process with the
                                𝑛
                                                                     goal of evaluating balance and completeness in a dataset as
                    𝑇𝑁(𝑖) =    ∑ 𝑝𝑘𝑘                       (11)      risk factors of discriminatory outputs of software systems. We
                              𝑘=1,𝑘≠𝑖
                                                                     believe that the methodology can be a useful instrument for all
   At this point, the single values could be computed for each       the actors involved in the development and regulation of
population subgroup (e.g., “Asian” vs “Caucasian” vs                 ADM systems, and, from a more general perspective, it can
“African-American” etc. , or “Male” vs “Female”, etc.) and           play an important role in the collective attempt of placing
the same applies to the concepts of precision, recall, and           democratic control on the development of these systems, that
accuracy, known from the literature and reported here:               should be more accountable and less harmful than how they
                                   𝑇𝑃                                are now. In fact, the adverse effects of ADM systems are
                  𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =                              (12)      posing a significant danger for human rights and freedoms as
                                 𝑇𝑃 + 𝐹𝑃
                                                                     our societies increasingly rely on automated decision making.
                                 𝑇𝑃                                  It must be stressed that this is still at a prototypical stage and
                    𝑅𝑒𝑐𝑎𝑙𝑙 =                               (13)      further studies are necessary to improve the methodology and
                               𝑇𝑃 + 𝐹𝑁
                                                                     to assess the reliability of the proposed measures, for example
                                𝑇𝑃 + 𝑇𝑁                              to find meaningful risk thresholds in relation to the context of
              𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =                                   (14)      use and the severity of the impact on individuals. The current
                           𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
                                                                     paper is also way to seek engagement from other researchers
     Fairness measures should be then compared with                  in a community effort to test the workflow in real settings,
appropriate thresholds selected with respect to social context       improve it and build an open registry of additional measures
in which the software application is used. If the unfairness is      combined with evaluations benchmark. Finally, we are
higher than the maximum allowed thresholds, then the                 conscious that technical adjustments are not enough, and they
original dataset should be integrated with synthetic data to         should be integrated with other types of actions because of the
mitigate the problem. One way to repopulate the dataset              socio-technical nature of the problem.
without causing distortion in the data is to add replicas of data
selected from the same set at random (known as bootstrapping                                  REFERENCES
[19]); other rebalancing techniques have been proposed in the        [1]    F. Chiusi, S. Fischer, N. Kayser-Bril, and M.
literature (e.g. SMOTE [20], ROSE [21]).                                    Spielkamp, “Automating Society Report 2020,”
    III.   RELATION WITH LITERATURE AND OUR PAST STUDIES                    Berlin, Oct. 2020. Accessed: Nov. 10, 2020. [Online].
                                                                            Available:
An approach similar to ours is the work of Takashi                          https://automatingsociety.algorithmwatch.org
Matsumoto and Arisa Ema [22], who proposed a risk chain              [2]    E. Brynjolfsson and A. McAfee, The Second Machine
model (RCM) for risk reduction in Artificial Intelligence                   Age: Work, Progress, and Prosperity in a Time of
services: the authors consider both data quality and data                   Brilliant Technologies, Reprint edition. New York
imbalance as risk factors. Our work can be easily integrated                London: W. W. Norton & Company, 2016.
into the RCM framework, because we offer a quantitative              [3]    F. Pasquale, The Black Box Society: The Secret
way to measure balance and completeness, and because it is                  Algorithms That Control Money and Information.
natively related to the ISO/IEC standards on data quality                   Cambridge: Harvard Univ Pr, 2015.
requirements and risk management.

4
 It is a joint initiative of MIT Media Lab and Berkman Klein
Center at Harvard University: https://datanutrition.org/.
[4]    K. Crawford, Atlas of AI: Power, Politics, and the         [14] M. Mecati, F. E. Cannavò, A. Vetrò, and M.
       Planetary Costs of Artificial Intelligence: The Real            Torchiano, “Identifying Risks in Datasets for
       Worlds of Artificial Intelligence. New Haven: Yale              Automated Decision–Making,” in Electronic
       Univ Pr, 2021.                                                  Government, Cham, 2020, pp. 332–344. doi:
[5]    P. N. Howard, Lie Machines: How to Save Democracy               10.1007/978-3-030-57599-1_25.
       from Troll Armies, Deceitful Robots, Junk News             [15] A. Vetrò, M. Torchiano, and M. Mecati, “A data
       Operations, and Political Operatives. New Haven ;               quality approach to the identification of discrimination
       London: Yale Univ Pr, 2020.                                     risk in automated decision making systems,” Gov. Inf.
[6]    V. Eubanks, Automating Inequality: How High-Tech                Q., p. 101619, Sep. 2021, doi:
       Tools Profile, Police, and Punish the Poor. New York,           10.1016/j.giq.2021.101619.
       NY: St. Martin’s Press, 2018.                              [16] E. Beretta, A. Vetrò, B. Lepri, and J. C. De Martin,
[7]    B. Friedman and H. Nissenbaum, “Bias in Computer                “Ethical and Socially-Aware Data Labels,” in
       Systems,” ACM Trans Inf Syst, vol. 14, no. 3, pp. 330–          Information Management and Big Data, Cham, 2019,
       347, Jul. 1996, doi: 10.1145/230538.230561.                     pp. 320–327.
[8]    C. O’Neil, Weapons of Math Destruction: How Big            [17] E. Beretta, A. Vetrò, B. Lepri, and J. C. D. Martin,
       Data Increases Inequality and Threatens Democracy,              “Detecting discriminatory risk through data annotation
       Reprint edition. New York: Broadway Books, 2017.                based on Bayesian inferences,” in Proceedings of the
[9]    H. He and E. A. Garcia, “Learning from Imbalanced               2021 ACM Conference on Fairness, Accountability,
       Data,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9,            and Transparency, New York, NY, USA, Mar. 2021,
       pp. 1263–1284, Sep. 2009, doi:                                  pp. 794–804. doi: 10.1145/3442188.3445940.
       10.1109/TKDE.2008.239.                                     [18] S. Barocas, M. Hardt, and A. Narayanan, Fairness and
[10]   B. Krawczyk, “Learning from imbalanced data: open               Machine Learning. fairmlbook.org, 2019.
       challenges and future directions,” Prog. Artif. Intell.,   [19] “Handling imbalanced data sets with synthetic
       vol. 5, no. 4, pp. 221–232, Nov. 2016, doi:                     boundary data generation using bootstrap re-sampling
       10.1007/s13748-016-0094-0.                                      and AdaBoost techniques - ScienceDirect.”
[11]   N. Japkowicz and S. Stephen, “The class imbalance               https://doi.org/10.1016/j.patrec.2013.04.019 (accessed
       problem: A systematic study,” Intell. Data Anal., vol.          Sep. 18, 2021).
       6, no. 5, pp. 429–449, Oct. 2002.                          [20] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P.
[12]   International Organization for Standardization,                 Kegelmeyer, “SMOTE: synthetic minority over-
       “ISO/IEC 25000:2014 Systems and software                        sampling technique,” J. Artif. Intell. Res., vol. 16, no.
       engineering — Systems and software Quality                      1, pp. 321–357, Jun. 2002.
       Requirements and Evaluation (SQuaRE) — Guide to            [21] G. Menardi and N. Torelli, “Training and assessing
       SQuaRE,” ISO-International Organization for                     classification rules with imbalanced data,” Data Min.
       Standardization, 2014.                                          Knowl. Discov., vol. 28, no. 1, pp. 92–122, Jan. 2014,
       https://www.iso.org/standard/64764.html (accessed               doi: 10.1007/s10618-012-0295-5.
       Nov. 10, 2020).                                            [22] T. Matsumoto and A. Ema, “RCModel, a Risk Chain
[13]   International Organization for Standardization, “ISO            Model for Risk Reduction in AI Services,”
       31000:2018 Risk management — Guidelines,” ISO -                 ArXiv200703215 Cs, Jul. 2020, [Online]. Available:
       International Organization for Standardization, 2018.           http://arxiv.org/abs/2007.03215
       https://www.iso.org/cms/render/live/en/sites/isoorg/co
       ntents/data/standard/06/56/65694.html (accessed Nov.
       10, 2020).