<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>How a socio-technical approach to AI auditing can change how we understand and measure fairness in machine learning systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gemma Galdon Clavell</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ariane Aumaitre</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Toon Calders</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>European University Institute</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Most AI accountability approaches are either too general or too specific. Impact assessments, on the one hand, tend to focus on principles and general commitments, with little or no technical input or transparency. AI auditing tools, on the other, focus on model outputs and predictive accuracy metrics, with no visibility over impacts or structural bias dynamics. We have observed how existing approaches may end up validating unfair and ineficient systems. In our work at Eticas.ai, we explore the possibilities of a socio-technical approach to AI auditing. We capture key metrics at diferent stages of a system and put them in relation to relevant social, demographic and sector-specific data. This allows us to assess how protected and other attributes perform at diferent times in the decision-making process (from training and pre-processing to post-processing), but also in relation to society. Most crucially, our proposal focuses on impact, which is central to all regulatory eforts around the world, and at the heart of impact assessments. For this specific paper, we show how current approaches to bias measurement are problematic. We use a public dataset, the Adult Census Income Dataset, to show how diferent bias metrics and approaches lead to diferent outcomes, and how these may validate unfair AI systems. In the paper we reproduce the criteria used by two well-known bias tools, Fairness 360 and Aequitas, and one regulatory piece, the NYC Bias audit Law, and compare our results with our own approach, which emphasises representativity (as opposed to accuracy or impact in output data) and bias diagnosis. We show how bias metrics that only capture data from model performance and outputs are incomplete and potentially harmful, as they fail to incorporate data from actual, real-world impacts and contextual dynamics such as structural discrimination and power relations. Our auditing methodology overcomes many of these shortcomings and build a robust basis for the development of AI audits as a professional practice. With this paper and our argument for our “E2E-ST” approach, we wish to open a discussion in the AI bias community and among policy-makers as to how and when to measure impact, and what impact means in AI system performance. Our main claim is that AI bias is not a technical problem with technical solutions, but a socio-technical problem that requires socio-technical approaches to be addressed and lead to the efective protection and fair treatment of discriminated groups and outliers.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Responsible AI</kwd>
        <kwd>Bias audit</kwd>
        <kwd>Fair machine learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The increasing use of personal data to make automated decisions has led to a generalised
concern over bias and, specifically, over how AI systems capture and reproduce bias when
making decisions. Scholars and researchers have explored various dimensions of bias, aiming to
understand its origins, manifestations, and implications across diferent domains. Contributions
from Timnit Gebru [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Joy Buolamwini [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Cathy O’Neil [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Safiya Umoja Noble [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], Ruha
Benjamin [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] or Latanya Sweeney [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] have raised awareness on this issue and provided specific
examples of AI bias discriminating against women, people of colour, non-English speakers,
people with disabilities, and people from specific geographical areas. Overall, the literature
shows a continued problem with bias in AI systems, and a clear amplification challenge: data
systems may not only be capturing but also amplifying existing, human bias.
      </p>
      <p>
        The bias problem in AI has found echo in current regulatory debates and developments.
Most existing regulation and best practice documents on AI capture the need to conduct risk
assessments that look at bias and inclusivity in AI development. So far, the most specific
achievement of the AI bias community in terms of turning metrics into regulatory requirements
is Law 144 of 2021, also known as the New York City’s Bias Audit Law [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], efective since
January 1, 2023. It prohibits employers from using automated employment decision tools to
evaluate applicants unless the employer meets several requirements prior to its use, including
conducting independent audits of such systems, performed by an independent third-party,
provide notice of their use and publish a summarized version of the audit/s. Interestingly, NYC’s
law focuses on a specific bias metric: disparate impact (DI).
      </p>
      <p>
        This issue has also raised interest in the technical field as well, and conferences such as
Fairness, Accountability, and Transparency in Machine Learning (FAccT) are a clear example of
the increasing attempts to address this issue and come up with solutions. This attention and
awareness of the bias problem in AI has also prompted multiple actors to produce tools to help
practitioners measure and address bias. These include:
• IBM AI Fairness 360 (AIF360) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], an open-source toolkit that provides a set of algorithms,
metrics, and bias mitigation techniques for assessing and addressing bias in machine
learning models. It includes capabilities for measuring bias in datasets and models, as well
as tools for applying pre-processing and post-processing techniques to enhance fairness.
• Aequitas [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]: another open-source bias audit toolkit that allows users to assess bias
in machine learning models and datasets. It provides statistical and visual analyses to
identify and understand disparities in model predictions across diferent demographic
groups. Aequitas is designed to be customizable and works with various fairness metrics.
• TensorFlow Model Analysis [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], which includes fairness indicators and Mtools for
computing fairness metrics and visualizing fairness disparities in machine learning models.
The toolkit supports metrics such as disparate impact, equalized odds, and more. It enables
users to monitor and evaluate model fairness during development.
      </p>
      <p>This list is far from comprehensive. The tools available can be counted in the dozens. But
for all this apparent diversity, existing and available tools are focused on measuring fairness in
model performance. As we will show below, this is highly problematic.</p>
      <p>
        We are not the first to point to the need to explore bias beyond model performance. Many
critical voices [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] have pointed to the need to complement existing model-centred approaches
with further checks, with a particular focus on the need to ensure that end-users are involved
in the training and testing of AI systems. In fact, all authors mentioned above have contributed
with their work to shaping the debate on the ethical and social dimensions of AI, emphasizing
the significance of including end users in the design, testing, and deployment of AI systems and
calling for a more inclusive and user-centred approach to AI development.
      </p>
      <p>Our work, however, shows that current approaches to AI bias are fundamentally flawed. As
technical and non-technical approaches to bias have developed independently of each other,
current socio-technical approaches resemble a sort of methodological Frankenstein where
quantitative and qualitative approaches coexist without dialogue or interaction, and the overall
picture on how systems impact on people and societies is lost.</p>
      <p>In our work, we look at bias from a truly socio-technical perspective and a focus on what we
believe (and propose) must be the main entry point for any attempt to make AI accountable:
impact. Indeed, all bias metrics we have seen limit their analysis to model data. In the best
cases, training data may be taken into account. Therefore, what they all measure is model
outputs, not actual societal impacts. This opens a dangerous door to auditing and accountability
exercises that validate systems with unfair, problematic or otherwise undesirable impacts. What
is the contribution of checking fairness in the limited world of and AI model, when there is no
relation between model fairness metrics and actual impact on the world where the AI model
will be run? A focus on model metrics and output is also blind to the impact of structural
discrimination, power relations and barriers to access, which should be at the core of any AI
inspection mechanism.</p>
      <p>Drawing on the NYC bias law mentioned above, we found ourselves with a crucial problem:
as AI auditors, can we validate a system that complies with DI metrics, where an employer
is deemed to be fair when employing 50 of each 100 male candidates and 1 of each 2 female
candidates? Ignoring structural discrimination and barriers leads to systems where perpetuating
bias is not only seen as acceptable, but also “fair”.</p>
      <p>To address these shortcomings and contribute to the development of a robust AI
auditing ecosystem, what we propose is an auditing process that captures key model metrics
(in-processing), as well as training and labelling data (with a focus on protected attributes,
pre-processing). Additionally, having worked extensively in Europe, where AI decisions require
a “human in the loop”, we have also come across an additional problem: what if we can assess
the predictive accuracy of a system, but have no access to final, real-life decisions? Could we
validate a system based on outputs knowing that a series of “humans in the loop” may have
altered the initial data?</p>
      <p>Our approach is heavily informed by how other sectors have addressed similar challenges.
In the medical field, for instance, proving that a solution works in a lab environment is only
a stepping stone towards market approval, and many medical products never see the light of
day as they fail to demonstrate their benefits and side efects (impacts) in clinical trials. The
scalability of impacts is so relevant that clinical trials are organized in phases to ensure that
developers adhere to a precautionary principle [12].</p>
      <p>Our work on AI auditing at Eticas.ai addresses these questions. As we develop AI auditing
software that is socio-technical, we are determined to avoid the methodological Frankenstein
and look at systems from the perspective of impact and by using demographic and other relevant
data, and not just model data, as our threshold and reference.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Existing Fairness Criteria</title>
      <p>Existing tools for evaluating model fairness are based on criteria that can be computed based on
a dataset containing both the sensitive attributes of the instances, a ground truth label, and the
decisions of the model being assessed. In our approach we extend these fairness criteria based
on additional demographic and contextual (real-world impact) data to get a more complete
picture of the moments and sources of bias in the creation of the model. But, before we dive
into the details of our approach, we first revisit the existing criteria as they are used by tools
such as AIF and Aequitas, as this helps us build a common ground.</p>
      <p>Barocas et al. [13] classify these fairness criteria into 3 groups: those based on Independence,
Separation, or Suficiency:
• Criteria that assess disparity in outcomes. These criteria compare the rate of favourable
outcomes for the protected group to the rate of favourable outcomes for the unprotected
group. Disparate Impact Ratio (DIR) for instance computes the ratio between them. A
DIR value close to 1 indicates fairness, while values significantly diferent from 1 suggest
disparate impact. Similar in nature, Statistical Parity Diference (SPD) measures the
diference; an SPD close to 0 indicates fairness, while positive or negative values suggest
disparities. This criterion is also known as Demographic Disparity (DD).
• Criteria based on disparity in prediction errors, comparing true positive rates (TPR) and
false positive rates (FPR) between protected and unprotected groups. A high diference
in TPR indicates that the group with the higher TPR is advantaged as more members
that should receive the positive label do so, while a higher FPR indicates that a higher
number of members of the group with the higher FPR that should not receive the positive
label do receive it nevertheless. The most well-known representant in this category,
combining both quantities, is Equalized Odds (EO), which compares the true positive rate
and false positive rate between diferent groups with the following formula: EO=max
(TPR protected TPR unprotected, FPR protected FPR unprotected), where EO should be
close to 0 for fairness; a higher value indicates disparities in error rates. The Theil index
also falls into this category of criteria that are based on disparities in errors; based on
the type of error an individual benefit for the receiver of the label is determined, and the
Theil index measures how much the distribution of the benefits difers from those in a
situation of perfect equality.
• Calibration, which assesses disparities in the interpretation of the predicted labels. For
instance, if 70% of instances receiving a positive prediction indeed has the positive label,
the interpretation of a positive prediction becomes: 70% chance to be positive. Similarly,
the interpretation of the negative label could be: 20% chance to be positive.
Calibrationbased fairness criteria require these quantities to be the same for the protected and
unprotected group. Hence, the interpretation of what it means in terms of true probability
of having a positive label given the label assigned by the classifier should not depend on
the protected or unprotected group one is in.</p>
      <p>Besides these criteria, there are also relaxations that do not require them to hold in general,
but only in subgroups of the data. If for instance the protected group overall has fewer financial
means than the unprotected group, demanding demographic parity in a loan application may
be too strict of a condition. However, within subgroups of people with similar financial means,
we would expect similar outcomes and hence demographic parity to hold. Conditional fairness
criteria such as Conditional Demographic Disparity (CDD) take this relaxation into account by
for instance averaging demographic parity diferences over these conditional subgroups.</p>
      <p>There are also individual fairness criteria that quantify to what extent an instance in a
protected group was treated fairly, by comparing the treatment of that instance with similar
instances in reference groups. These individual fairness measures could be used to explore
specific cases but are hard to generalize to a fairness audit aimed at assessing a model or
automated decision process.</p>
    </sec>
    <sec id="sec-3">
      <title>3. The Eticas.ai Approach: from output to impact</title>
      <p>We propose an end-to-end, socio-technical approach (“E2E-ST”) that considers not only the
model, but also the diferent phases in the model construction and deployment. Whereas other
tools typically only consider a single dataset with ground-truth labels and model predictions,
we consider the training data that was used, reference data, and operational data. In this way
a more complete picture can be made, analysing not only the model’s predictions, but also
how it was constructed while putting it into the more global context with reference data of the
community in which the model is deployed. This approach allows us to produce a rich screen
of the model under scrutiny, containing a wide variety of probes and signals.</p>
      <p>The architecture of our E2E-ST approach is depicted in Figure 1. Starting from three data
sources connected respectively to context, training, and deployment of the model under scrutiny,
we first compute statistics called probes. The probes are roughly comparable to measures such
as demographic parity diference listed in Section 2. These measures are then logically grouped
into signals, that is higher-level indicators of potential issues. For instance, a signal could
combine several probes measuring similar properties of the datasets. Based on the signals,
a final diagnosis including potential moments of bias can be made. We will now detail the
diferent steps.</p>
      <sec id="sec-3-1">
        <title>3.1. Data used by Eticas.ai’s E2E-ST approach</title>
        <p>Our approach relies on several datasets provided by the client/system owner:
• The data used to train the model, called training data. This dataset includes the features
on which the model bases its decisions, the sensitive attributes indicating membership
of protected groups, the ground-truth label, and the predicted label. We assume the
predicted label was generated using a methodology that avoids data leakage, such as
cross-validation. The predicted label and the ground-truth label allow hence to estimate
the accuracy of the model.
• During deployment, data is collected in what we call the “operational dataset”. This
dataset contains the features on which the model bases its decisions, and the predicted
label. We assume that it is uncommon to collect the sensitive data during the operational
phase. Also, the ground truth label is not available in the operational dataset.
• Next to training and operational data, we also consider “reference data” that reflects the
distribution of the protected groups in the whole population. This dataset can be collected
from reliable data sources such as a central bureau of statistics.</p>
        <p>We denote the non-sensitive attributes as 1, . . . , , the sensitive attributes as 1, . . . , ,
and the ground truth label as  . A model  is trained and produces a decision  based on 1,
. . . , .</p>
        <p>• The training dataset contains 1, . . . , , 1, . . . , ,  , .
• The operational dataset contains attributes 1, . . . , , .</p>
        <p>• The reference dataset contains attributes 1, . . . , .</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Probes and Signals reported in our method</title>
        <p>Based on the available data, Eticas.ai tests for diferent signals. The signals are tested using
several probes, that is, statistics computed from the data. By combining diferent signals, a
coherent story is built of the data, pinpointing potential issues of the data in the form of types
of bias that potentially afected the data and led to bias.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Signals for data inconsistencies</title>
          <p>The first set of signals tests for inconsistencies between the three data sources. Such consistencies
could be indicative for several types of bias, including measurement bias, sample bias, and
access bias.</p>
          <p>Signals:
• Training distribution shift: the distribution of the training data difers from the distribution
of the reference data.
• Operational distribution shift: The distribution of the operational data difers from the
distribution of the reference data.
• Training-operation inconsistency: the distribution of the training data difers from that
of the operational data.</p>
          <p>These signals are based upon statistical tests of diferences between distributions. For the
third one, a non-parametric test is used based on the performance of a classifier separating
training and operational data.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Signals for suspicious data conditions</title>
          <p>The second class of signals tests for specific conditions in the data that give rise to potential
model problems. For instance, correlations between sensitive data and the ground-truth label
may give rise to models picking up or even amplifying historical bias. Another signal is the
existence of proxies, or the sensitive attributes having predictive power.
• Existence of proxies: there are attributes that can serve as a proxy for the sensitive
attributes.
• Training label disparity: the distribution of the given label in the training data difers
between sensitive groups.
• Informative sensitive data: the sensitive data contains information that helps to predict
the label.</p>
          <p>These signals are based upon probes computed on the training data.</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>3.2.3. Signals for suspicious model conditions</title>
          <p>The third class of signals are directly related to model performance.</p>
          <p>• Poor performance: the predictive performance is low.
• Prediction disparity: the distribution of the predicted label difers between the sensitive
groups
• Predictive error disparity: the distribution of the errors difers between the sensitive
groups
• Non-calibration: the meaning of the label difers for the diferent sensitive groups.
These signals roughly correspond to the most used existing fairness measures. As we will
argue, however, the interpretation of these signals in absence of the other signals is not possible.
Without context these signals are meaningless.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. From Signals to Diagnostics</title>
        <p>The third and last step of our E2E-ST audit is the interpretation to come to a diagnosis. Consider
the following, entirely fictional example:
Example 1: When screening potential job candidates, a recruiter considers the appearance of
a candidate (professional attire) when deciding. The appraisal of the appearance, however, is
not recorded. Later, based on historical data generated by the hiring process, a model is trained
to automate the candidate selection process. Suppose now that the candidates’ attire was more
likely to positively influence the decision for female candidates. In the presence of proxies for
gender, a model will likely pick that up and exploit correlations between the proxies and the
label, leading to a biased model because of an omitted variable bias.</p>
        <p>In this case, there will be a combination of signals that together can be interpreted to come
to the right diagnosis. For instance, there will be the signals: existence of proxies, training label
disparity, prediction disparity, and likely predictive error disparity.</p>
        <p>Example 2: The same setting, but now the recruiter does not consider attire. The company,
however, has a bad reputation of not hiring females. Because of this reputation, only strong
female candidates apply. Furthermore, being aware of its bad reputation, the company starts
monitoring the acceptance rates, actively encouraging equal acceptance rates between females
and non-females. Also here, based on historical data generated by the hiring process, a model
is trained to automate the candidate selection process. This model will pick up the higher
requirements for females, but this will not result in a prediction disparity signal, because of the
combination of a self-selection bias and a historical bias.</p>
        <p>In this case, the end-to-end, socio-technical approach considering diferent perspectives,
including impact, will raise an alert pointing to training distribution shift and the existence of
proxies, for instance.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Comparative overview</title>
        <p>We believe that the main contributions of our approach are, on the one hand, taking context and
impact into account, including demographic reference information and operational data. On
the other hand, we move beyond identifying probes and signals to diagnosing where the source
for bias may be located. The following table summarizes the types of data used by diferent
approaches, showing how our end-to-end, socio-technical auditing approach has visibility over
crucial, impact data and identifies bias sources:</p>
        <p>Eticas.ai
AI Fairness 360
Aequitas
NYC methodology
×
×
×
Based on
training data
×
×
Based on
operational data
collected
during
deployment</p>
        <p>Using Combining
impact, de- diferent
mographic measures in
data a diagnostic
phase to
identify the
moments of
bias
×
×</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Eticas.ai vs other approaches: a comparative case study</title>
      <p>To show how our end-to-end, socio-technical approach compares to other proposals, we have
compared bias detection results in the Adult Census Income dataset using four distinct fairness
methodologies: Fairness 360, Aequitas, the NYC methodology, and our E2E-ST method. Our
claim is that current tools do not identify bias in impact, nor provide useful workflows for
auditing. Fairness 360 and Aequitas are toolboxes that can be used flexibly, they do not include
guidelines or clear workflows for auditing. The closest to workflows for auditing are the demos
that are available for Fairness 360 [14] and Aequitas [15]. Therefore, we opted to include the
results produced by these demos in our analysis. The NYC methodology is named after the
description of what an audit needs to include in the NYC local law 144 [16]. This legislation
makes audits mandatory for automated employment decision tools. The law states that such a
bias audit shall include but not be limited to the testing of an automated employment decision
tool to assess the tool’s disparate impact on persons of any component 1 category [...]. In our
comparison we hence report disparate impact as per the “NYC methodology”.</p>
      <p>Our analysis aims to evaluate how each methodology identifies and measures bias across two
primary sensitive attributes: sex and ethnicity. Furthermore, we adopt an intersectional lens,
applying these methodologies to the intersection of sex and ethnicity wherever possible. This
approach not only aligns with recent advances in understanding discrimination and bias but
also shifts the focus from mere output analysis to assessing real-world impacts. Following the
scenario set forth in Section 3.1, we assume that two datasets are available: the training dataset
and the operational dataset, and we assume to have reference data at our disposal. As the tools
of Aequitas and Fairness 360 both require the model’s decisions and the ground-truth label,
they can only be applied on the training dataset. On the other hand, the NYC methodology
concerns the impact, post-hoc, of existing automated decision support systems, making the
operational dataset the most logical one to apply the methodology on.</p>
      <sec id="sec-4-1">
        <title>4.1. The data: the Adult Census Income Dataset</title>
        <p>To illustrate our approach, we generated a realistic scenario utilizing data from the Adult Census
Income dataset, a prominent dataset employed in machine learning for the purpose of income
prediction. This dataset originates from the 1994 US Census database and is esteemed for its
comprehensive demographic and employment information. It includes a variety of attributes
such as age, education, race, sex, income, and native country, making it an indispensable
resource for analytical studies in machine learning, particularly in the evaluation of fairness
and bias. The principal predictive target within this dataset is income, which is bifurcated into
two categories: individuals earning more than $ 50,000 annually and those earning less.</p>
        <p>For the purposes of simplification and to facilitate our analysis, several data transformations
were applied:
• The race variable was recoded into two categories: White and Non-White.
• The sex variable was recoded into a binary category: “Female” and “Other”1
• Education levels were consolidated into three categories: Low, Medium, and High, to
provide a more generalized view of educational attainment.</p>
        <p>On this dataset then decision tree models were trained, using 10-fold cross validation. In this
way for each data instance a prediction was made based on a model trained on all folds, but the
fold the instance was in. This labeled dataset was then split into two parts, one to represent
the training data, and the other the operational data. The reference dataset for demographic
comparisons was collected from the 2020 Census of the U.S. Census Bureau and uses national
averages for gender and ethnicity.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Methods</title>
        <p>To ensure a robust and fair comparison, our analysis maintains consistency in both the dataset
and the model used across all methodologies. Recognizing that the potential for bias detection
may vary significantly depending on the model applied, we employ the same predictive model
across all methodologies.</p>
        <p>As a methodological note, it should be mentioned that for certain probes where the dataset is
compared against real-world data, the analysis compares demographic data from 2022, while the
census data is from 1994. We believe that this is not relevant to our argument in any way that
disqualifies our results, but must be mentioned. We are working to further test our approach
and will continue to share results and lessons learned with the AI auditing community.
1We acknowledge that gender is a spectrum, not a binary construct. The recoding of the sex variable into “Female”
and “Other” categories in our dataset transformations is a simplification made for analytical purposes. This approach
does not fully capture the diversity of gender identities. Our use of binary classification is constrained by the
limitations of the original dataset and is not intended to overlook or diminish the significance of non-binary and
transgender identities.</p>
        <p>Equal Parity
Proportional Parity
False Positive Rate Parity
False Discovery Rate Parity
False Negative Rate Parity
False Omission Rate Parity</p>
        <p>Result</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Comparative analysis</title>
        <sec id="sec-4-3-1">
          <title>4.3.1. Aequitas—gender</title>
        </sec>
        <sec id="sec-4-3-2">
          <title>4.3.2. Aequitas—ethinicity</title>
          <p>All criteria computed by Aequitas require the sensitive attribute. In our setting, we assume
this information to only be available in the training data.</p>
        </sec>
        <sec id="sec-4-3-3">
          <title>4.3.3. NYC Methodology</title>
        </sec>
        <sec id="sec-4-3-4">
          <title>1) GENDER (reference: male)</title>
          <p>Result</p>
          <p>Male, non-white
Female, white
Female, non-white</p>
          <p>The NYC methodology requires the sensitive attribute to be recorded in the operational data
but could also be executed on the training data. Although we assume the sensitive attributes
are not recorded in the operational data, we made an exception for the NYC methodology as it
is explicitly requiring this.
4.3.4. AIF 360</p>
        </sec>
        <sec id="sec-4-3-5">
          <title>1) gender</title>
        </sec>
        <sec id="sec-4-3-6">
          <title>2) ethnicity</title>
          <p>Result
Statistical Parity Diference -0.158
Equal Opportunity Diference -0.0522
Average Odds Diference 0.058
Disparate Impact 2.85
Theil Index 0.342
Bias</p>
          <p>Bias
No bias
No bias</p>
          <p>Bias
No threshold</p>
          <p>Result
N/A
N/A
N/A</p>
          <p>N/A
N/A
Statistical Parity Diference
Equal Opportunity Diference
Average Odds Diference
Disparate Impact
Theil Index</p>
          <p>Result</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Discussion</title>
        <p>Looking at the statistics computed for the model by IBM 360 fairness demo, we observe that
the disparate impact (DI) measures for race show deviations that could indicate lack of fairness
in the model. The Statistical Parity Diference (SPD), however, does not raise a flag, although
both measures look at the diference in acceptance rates between ethnic groups and compare
them using a ratio (DI) or the diference (SP). The diference is around -7%, which means that
one ethnic group has 7% lower chance of receiving the positive label than the other. The
ratio is 1.56%, which indicates that this 7% diference makes the relative diference 156%. The
error-based measures show that the errors made by the model show no significant diference
between the ethnic groups. The conclusions that can be drawn from these measures, however,
are very limited. It is unclear to what extent the disparity in the predictions present an existing
disparity in the population, or maybe a historical bias being amplified by the model, or the
result of a selection bias where members of one of the groups were more self-selective leading
to a higher acceptance rate.</p>
        <p>The Aequitas toolkit computes roughly the same statistics as the IBM 360 fairness toolkit demo,
although it further splits up the error-based measures in all four categories. The conclusions
here are similar as for the IBM360 fairness toolkit, showing a disparity in impact of the model.
In the Aequitas tool the negative result for the proportional parity test is deemed to be alarming,
stating: If your desired outcome is to intervene proportionally on people from all races, then
you care about this criteria. The IBM 360 tool does not make any claims regarding whether
there is bias in the model and if we should care about it, but lets the user chose between diferent
mitigation techniques, which is, at the very least, an implicit indication that something needs
to be corrected.</p>
        <p>There are issues with these error measures, however:
• Suppose that the data has diferent base rates for diferent protected communities. Then,
no matter what model is presented in the tool, either the statistical disparity of the model
will be too high, or there will be a disparity in one of the error-based measures. Indeed:
in the original data there is a diference of 19% between males receiving the positive label
and females receiving the positive label. The only situation that allows to close this gap
is if the way in which predictions diverge from training labels difers between males
and females. To close the gap, either many more females with a negative label receive a
positive prediction nevertheless than males, or many more males with a positive label
receive a negative prediction. Hence, either respectively the FPR or the TPR for females
is higher than for males, triggering the other alert. For the FPR criterion Aequitas states
that it “is important [to meet the criterion] in cases where your intervention is punitive
and has a risk of adverse outcomes for individuals.” So, there is no winner here: every
model will be deemed unfair.
• The measures are computed on the given data without questioning the data itself. Is the
data representative for the population? Or was there selection bias? Are the labels in the
training data correct? Are all attributes correctly recorded?</p>
        <p>Tools like those described here sufer from their own techno-solutionism bias. Specifically,
the belief that the problem can be solved solely by a technological intervention in the form of a
set of measures and a pre-programmed mitigation technique making models fair by design.</p>
        <p>We propose an alternative, end-to-end, socio-technical approach that considers not only
training data and model outcomes on this training data, but also reference distributions of the
relevant population, and operational data. This allows for a much broader analysis of the quality
of not only the model, but also training data and operational data. Following on our medical
analogy, just like a doctor does not look at one measure or a set of measures to assess a patient’s
health condition, but instead builds up a diagnosis combining diferent measurements to paint a
complete picture, bias assessment and auditing needs to follow a similar approach.</p>
        <p>For this case study, the additional tests conducted by the team at Eticas.ai reveal the following
additional issues, not reported by the other tools:
• The data is not representative for the general population. Whites and males are
significantly over-represented in the training data raising the question of potential sample bias
in the data.
• The training data labels themselves show a high disparity, even higher than the disparity
of the model. This raises concerns regarding label bias or historical bias; is the data even
suitable for learning a model? Or evaluating a model?
• The sensitive attribute gender does not contribute much for the predictions by the model.</p>
        <p>Given that there is a significant correlation between gender and the label (SPD is a form
of correlation between sensitive attributes and labels), this implies that other attributes
capture the same information with respect to the label as gender does. Any historical
gender-bias in the data could hence be picked up by a model, even if it does not use
gender in its predictions.</p>
        <p>The combination of all these factors paints a quite worrisome picture: the data does not
correctly represent the general population, potentially leading to less qualitative models for
underrepresented groups. Additionally, the training data shows a significant label disparity
stemming from either label bias (e.g. using a biased proxy for the label) or historical bias.
Without further justification of the data explaining the disparities, the training data should
be disqualified. Furthermore, the probes indicate a real danger of the models picking up label
bias, because gender can be predicted with high accuracy given the other attributes. This is
confirmed by both the statistical parity diference in the model’s predictions and the fact that
the sensitive attribute does not carry additional information on the label.</p>
        <p>Overall, we show that most of what has been said about AI fairness, and the existing tools
and approaches to bias identification and auditing are very problematic. If we continue to see
AI bias as a technical problem with technical solutions, we will validate models that treat people
diferently for reasons due to their race, gender or other protected attributes.</p>
        <p>What we suggest is to mobilise a socio-technical approach to AI inspection that incorporates
methods and insights from other fields, and specifically quantitative social sciences. By focusing
on impact, we shift the emphasis from the technical to the social, and make AI models and tools
accountable not only for their inner workings, but their efects on the societies they draw their
data from.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. End-to-end, socio-technical methods and the future of AI auditing</title>
      <p>This is the first part of a long-term project on how to conduct end-to-end, socio-technical AI
audits in practice. This initial work on probe comparisons in recommender/ranking systems
will be followed by further testing in diferent verticals. We are building libraries that allow us to
quickly establish what is the context data relevant for testing and establishing thresholds –while
in this paper we have used general demographic data, we are also testing
recommender/ranking systems that may be relevant only to population subsets (patients with specific medical
pathologies, for instance). We are also working on adding new probes that add complexity to
AI auditing. We aim to further develop our auditing software to test for bias in ways that are
robust but also contextual. Our goal is to show that AI auditing can be automated without
reducing complexity and context.</p>
      <p>This first product has a very specific objective: to open a debate in the AI bias community on
how to capture impacts, and not just outputs. Our work and experience show that the AI bias
community can benefit from moving away from model-centric approaches and incorporating
context data into the metrics and methodologies developed for bias assessment. As we put
it earlier, the current focus on model metrics and output leaves out the impact of structural
discrimination, power relations and barriers to access, which should be at the core of any AI
inspection mechanism if AI auditing is to help developers, policy-makers and society at large
better understand and measure how AI systems perform and impact on society.
algorithmic behavior, Proceedings of the ACM on Human-Computer Interaction 6 (2022)
1–34.
[12] E. M. Agency, From laboratory to patient: the journey of a medicine
assessed by EMA, https://www.ema.europa.eu/en/documents/other/
laboratory-patient-journey-centrally-authorised-medicine_en.pdf, 2024. Accessed:
2024-03-21.
[13] S. Barocas, M. Hardt, A. Narayanan, Fairness and Machine Learning: Limitations and</p>
      <p>Opportunities, MIT Press, 2023.
[14] IBM, Ibm’s ai fairness 360 demo, https://aif360.res.ibm.com/data, 2024. Accessed:
2024-0321.
[15] AEQUITAS, Aequitas‘ bias report, http://aequitas.dssg.io/, 2024. Accessed: 2024-03-21.
[16] New York City, Local law 144 of 2021: Automated employment decision
tools, https://legistar.council.nyc.gov/View.ashx?M=F&amp;ID=10399761&amp;GUID=
F99584B7-57C8-469E-9637-46A0E780690E, 2021. Accessed: 2024-03-21.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gebru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Morgenstern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vecchione</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Vaughan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. D.</given-names>
            <surname>Iii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Crawford</surname>
          </string-name>
          , Datasheets for datasets,
          <source>Communications of the ACM</source>
          <volume>64</volume>
          (
          <year>2021</year>
          )
          <fpage>86</fpage>
          -
          <lpage>92</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Buolamwini</surname>
          </string-name>
          , T. Gebru,
          <article-title>Gender shades: Intersectional accuracy disparities in commercial gender classification</article-title>
          , in: Conference on fairness,
          <source>accountability and transparency, PMLR</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>77</fpage>
          -
          <lpage>91</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>O'neil, Weapons of math destruction: How big data increases inequality and threatens democracy</article-title>
          ,
          <source>Crown</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S. U.</given-names>
            <surname>Noble</surname>
          </string-name>
          ,
          <article-title>Algorithms of oppression: How search engines reinforce racism</article-title>
          , in: Algorithms of oppression, New York university press,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Benjamin</surname>
          </string-name>
          ,
          <article-title>Race after technology: Abolitionist tools for the new Jim code</article-title>
          , Oxford University Press,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Sweeney</surname>
          </string-name>
          ,
          <article-title>Discrimination in online ad delivery</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>56</volume>
          (
          <year>2013</year>
          )
          <fpage>44</fpage>
          -
          <lpage>54</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7] New York City Department of Consumer and Worker Protection,
          <article-title>Dcwp-aedt faq</article-title>
          , https: //www.nyc.gov/assets/dca/downloads/pdf/about/DCWP-AEDT-FAQ.pdf,
          <year>2024</year>
          . Accessed:
          <fpage>2024</fpage>
          -03-21.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>IBM</surname>
          </string-name>
          ,
          <article-title>Aif 360 github repository</article-title>
          , https://github.com/IBM/AIF360,
          <year>2024</year>
          . Accessed:
          <fpage>2024</fpage>
          -03- 21.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>AEQUITAS</surname>
          </string-name>
          , Aequitas github repository, https://github.com/dssg/aequitas,
          <year>2024</year>
          . Accessed:
          <fpage>2024</fpage>
          -03-21.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Tensorflow</surname>
          </string-name>
          , Tensorflow fairness indicators, https://www.tensorflow.org/tfx/guide/fairness_ indicators,
          <year>2024</year>
          . Accessed:
          <fpage>2024</fpage>
          -03-21.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Lam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Gordon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Metaxa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Hancock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Landay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <article-title>End-user audits: A system empowering communities to lead large-scale investigations of harmful</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>