<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>European Workshop on Algorithmic Fairness, July</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Exploration of Potential New Benchmark for Fairness Evaluation in Europe</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Magali Legast</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lisa Koutsoviti-Koumeri</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yasaman Yousefi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Axel Legay</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>0</volume>
      <fpage>1</fpage>
      <lpage>03</lpage>
      <abstract>
        <p>With the increase use of artificial intelligence systems and the associated concerns regarding automated discrimination, research in the field of fairness has increased in the past years. To evaluate their work in fair machine learning, researchers have often been using the same three datasets (Adult, COMPAS, and German credit) as benchmarks. However, those datasets each present serious limitations. In this work, we first explore what other datasets could potentially be used as replacement, specifically in a European context. We then use an experimental approach to compare Adult and COMPAS with a new candidate, Student Performance (a.k.a Student Alcohol Consumption). Our early results highlight the scarcity of easily accessible European datasets suitable as benchmarks for fairness evaluation of problems with positive or negative outcome, as well as the high influence dataset selection can have on experimental results.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Fairness</kwd>
        <kwd>Datasets</kwd>
        <kwd>Machine learning</kwd>
        <kwd>Bias mitigation</kwd>
        <kwd>Classification</kwd>
        <kwd>Fair Classification</kwd>
        <kwd>Benchmark</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        With the increase use of artificial intelligence systems, legitimate ethical and legal concerns have
been growing, including the risk that some people may be treated more negatively than others,
thus resulting in discrimination [
        <xref ref-type="bibr" rid="ref18 ref21 ref22">18, 21, 22</xref>
        ]. This problem is prevalent in machine learning,
where the training data is of paramount importance, while usually retaining historical and
social biases that are then learned by the prediction models [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
      </p>
      <p>
        Research in the field of fairness has been constantly growing in the past few years, with
the problem of fair classification receiving the most attention [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Many fairness metrics and
bias mitigation methods have been developed in that sub-field, but less attention has been
given towards quality benchmark datasets, specially regarding European data. A subset of only
three datasets has been surpassing all others in term of popularity [
        <xref ref-type="bibr" rid="ref12 ref17">12, 17</xref>
        ], namely Adult [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
COMPAS [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and German credit [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Those popular datasets have nevertheless been shown
to present serious limitations [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], such as, but not limited to, old age (Adult and German credit),
noisy data [
        <xref ref-type="bibr" rid="ref14 ref2 ref9">2, 9, 14</xref>
        ], label bias (COMPAS [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]), or even coding mistakes preventing retrieval of
the sensitive attribute (German credit [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]).
      </p>
      <p>
        Given these observations, the emerging consensus is that their use as benchmarks should
be avoided, lest duly justified. Eforts have been made to provide alternatives (for instance a
dataset suite to replace Adult [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]), to disseminate good practices (like in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]), or to facilitate
access to other datasets (such as a large collection of fairness datasets [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] along with its search
engine [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]). Building on these works, we explore potential alternatives to the three popular yet
lfawed historical benchmarks, while focusing on the European context. We also compare, with
an experimental approach, results stemming from the use of Adult and COMPAS as benchmarks
with a potential replacement, Student Performance [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. More specifically, we compare the
results of a bias mitigation method and several popular fairness metrics for models trained with
each of those three datasets.
      </p>
      <p>Our results highlight the scarcity of open access and easily accessible datasets for fairness
evaluation in a European context. Further, the analysis of those three examples confirms that
diferent datasets may lead to diferent results when evaluating bias mitigation methods and
fairness metrics. This stresses the importance of dataset selection in experiments. We also
aim to expand this work with a more thorough search for datasets and further experiments
encompassing more bias mitigation methods, fairness metrics and sensitive attributes.</p>
    </sec>
    <sec id="sec-2">
      <title>2. European Datasets</title>
      <p>As mentioned in the introduction, the datasets Adult, COMPAS and German credit have been
widely used as benchmarks in fair classification studies, even though they are not as suitable for
fair machine learning research as previously thought. All three of them contain real life tabular
data about individuals with one or more attribute(s) recognized as sensitive/protected and a
target label that is deemed positive or negative for the individual.</p>
      <p>
        Those key characteristics for fair classification should be shared by potential replacement
datasets. Being open access also increased their appeal. We thus aim to find a potential new
benchmark that is also easily accessible, specially considering the fast past environment of
computer science conferences where there is often little time dedicated to the selection of
datasets. Lastly, we focus on the European context. Indeed, the worldwide influence of the
United-States and diference in data privacy culture and legislation across countries have made
datasets from the USA dominant in the field, while data from other areas, including Europe,
remains less accessible. This can cause further bias given that models and results from a certain
place aren’t necessary applicable to other geographical locations [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>Merging the positive characteristics of the popular datasets with our context of interest, we
formulate our dataset selection criteria as follows: An open access dataset with tabular data
about European subjects that is no more than 25 years old and is adapted to the problem of fair
classification leading either to a positive or negative outcome for the subject.</p>
      <p>
        To find such dataset, we used the search engine for fairness datasets 1 presented in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
as it is the most complete collection of datasets for fair machine learning to the best of our
knowledge. Its database comprises over 200 datasets for diverse domains and fairness tasks.
1This search engine is available at http://fairnessdata.dei.unipd.it/
Filtering on tabular data and fair classification with a positive or negative outcome, we are
left with 22 datasets (as of January 29 and February 2, 2024). Out of those, fourteen datasets
contain data from the United-States, two from elsewhere in America, three have no mention
of localization in their description, three are European, and only one is from Asia. There are
thus no dataset referenced for Africa nor Oceania. Out of the three European datasets, only
two contain data collected in this century, Dutch Census [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and Student Performance [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
(a.k.a "Student" and "Student Alcohol Consumption"). Since Dutch Census is part of the IPUMS
International collection 2 and requires approval to be accessed, only Student Performance fully
ifts our criteria. Presented in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], this dataset contains social, gender and study information
about students in two Portuguese schools for the core classes of Mathematics and Portuguese in
secondary education (high school). It has been used in a few fairness studies and is referenced
in the dataset survey [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiment</title>
      <p>
        With this experiment3, we compare diferent models trained with bias mitigation on Adult,
COMPAS, and Student Performance. For Student Performance, we use data related to the
Portuguese subject version of the dataset since it has the most instances (649). We consider
the sensitive attributes sex and age, with students who are 18 or older as the protected group,
as in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. We take the usual sensitive attributes sex and race for Adult and COMPAS. We did
not include German credit, as it is impossible to retrieve its protected attribute (sex), making
interpretation of results misleading.
      </p>
      <p>
        For each of the considered datasets, we compute several classifiers using the training and
in-processing bias mitigation meta-algorithm presented in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This algorithm considers a
constrained optimization problem that is approximately solved with provable guarantees. The
constraint enforces a minimal value, the fairness penalty parameter  , for a chosen fairness
metric. We use the AIF360 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] implementation with Statistical Parity ratio as the fairness
constraint and gradient descent. We train diferent models with  value ranging from 0 (no bias
mitigation) to 1 (constraint of perfect statistical parity). We then evaluate the performance and
fairness of the resulting models using diferent metrics to assess their evolution with diferent
constraint levels. The fairness metrics we use are Statistical Rate diference (SR) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], the most
used metric and based on prediction only, Equality of Opportunity (Eq. Opp) and Equalized
Odds (Eq. Odds) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], the most used metrics based both on predictions and ground truth, and
Consistency [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], the most used metric based on similarity [
        <xref ref-type="bibr" rid="ref17 ref23">23, 17</xref>
        ]. We evaluate each metric for
each classifier, then report the average over 10 folds and the corresponding confidence interval.
      </p>
      <p>You can see in Figure 1 the results for the diferent models, each represented by the fairness
penalty parameter  it was trained with. Results for Adult and COMPAS with sensitive attribute
race are close to those with sex and are not presented here due to space restriction.</p>
      <p>Let us first note that Consistency, an individual fairness metric, is not impacted by the
Statistical Parity mitigation, which is based on group fairness. We thus focus only on the three
2See Harmonized International Census Data for Social Science and Health Research https://www.ipums.org/projects/
ipums-international
3The full code of the experiment is available at https://github.com/Magalii/AIF360/tree/EWAF2024
(a) Adult dataset with sensitive attribute sex</p>
      <p>Women protected, Men privileged
(b) COMPAS dataset with sensitive attribute sex</p>
      <p>Men protected, Women privileged
0.2
0.4
0.6
0.8
1.0
0.2
0.4
0.6
0.8
1.0
(c) Student dataset with sensitive attribute sex</p>
      <p>Girls protected, Boys privileged
(d) Student dataset with sensitive attribute age
≥ 18 protected, &lt; 18 privileged
other metrics, related to group fairness, for the remainder of this section.</p>
      <p>In Figure 1a, we see that the model trained on Adult without bias mitigation ( = 0) is
already very close to fairness according to all metrics. Bias mitigation still induces a general
improvement for all group fairness metrics. Eq. Opp indicates that the protected group (Women)
becomes the one with a slightly higher true positive rate when the fairness constraint is greater
than or equal to 0.7 . We also note a significant drop in accuracy and increase of the F1-score as
soon as bias mitigation is introduced, but only marginal changes after that.</p>
      <p>In Figure 1b, for models trained on COMPAS, there is first an overall decrease in fairness for
all group metrics from  = 0.1 to  = 0.7. Fairness then increases again for higher values of  ,
which is coincident with a significant drop in accuracy, while the F1-score remains steady.</p>
      <p>In Figure 1c, the unconstrained model trained on Student Performance with sex as protected
attribute is near perfect fairness according to SR and Eq. Opp. The constrained models never
surpass these values. Eq. Odds shows more bias than both SR and Eq. Opp for all models but
one. This metric is less mitigated than for Adult, even though the original value was higher to
start with.</p>
      <p>In Figure 1d, the model trained on Student Performance with age as sensitive attribute and
no bias mitigation shows significant bias according to SR. This bias is eficiently mitigated by
the model, even with the lowest level of fairness constraint. Bias reported by SR and Eq. Opp is
very low even before bias mitigation. Except for a brief increase when  is 0.1, their bias level is
reduced, with Eq. Opp extremely close to equality.</p>
      <p>Overall, even though the same algorithm as been used for the training and bias mitigation
of all of these models, the results vary significantly for each of the diferent datasets studied.
Additionally, for Adult and COMPAS the overall tendencies are similar for the two sensitive
attributes considered (sex and race), but we see a very significant diference when considering
sex or age for Student Performance.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>
        On the one hand, the search for new potential benchmarks highlights the scarcity of European
datasets for use in fair classification with a positive or negative outcome. Indeed, out of the over
200 datasets referenced in the search engine used [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], only Student Performance [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] fits our
selection criteria. This dataset shows very little to no bias with regard to the sensitive attribute
sex, at least regarding the most common fairness definitions, strongly reducing its interest when
studying fairness related problems. The other most common sensitive attribute considered for
this dataset is age. However, being an older student is most often a direct result of past failures,
which is itself usually deemed an appropriate criterion to predict future exam results. It is thus
questionable whether this attribute should be considered protected or not. Other attributes
we have not explored here could also be of interest. For example, attributes related to alcohol
consumption are studied in [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], which also extends the discussion to label bias.
      </p>
      <p>So, despite the existing eforts to mitigate the collective data documentation debt and ofer
new alternatives to Adult, COMPAS and German credit, there is still a need to bring forward
new European datasets, as well as data from other underrepresented continents. This may
include improvement of the visibility and centralization of existing datasets or collection of new
data adapted to fairness related questions.</p>
      <p>
        On the other hand, our study illustrates that the same procedure applied to diferent datasets
may lead to significantly diferent results, which is congruent with the results in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. This
showcases the importance the choice of dataset can have on fairness evaluation and when
presenting results. We thus recommend to use several diferent datasets since they may lead to
varying results, to look beyond open access data if needed, and to avoid making broad claims
based on few examples, which echoes some of the recommendations in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. We also
encourage researchers to consider the selection of data not as a minor step, but as a meaningful
part of the research, and to provide justification on the choices made in that regard.
      </p>
      <p>Beyond this discussion, we aim to expand this work to study more datasets, include non-open
access data, as well as more bias mitigation methods, fairness metrics and sensitive attributes.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Michelle</given-names>
            <surname>Bao</surname>
          </string-name>
          , Angela Zhou, Samantha Zottola, Brian Brubach, Sarah Desmarais, Aaron Horowitz, Kristian Lum, and
          <string-name>
            <given-names>Suresh</given-names>
            <surname>Venkatasubramanian</surname>
          </string-name>
          .
          <article-title>It's COMPASlicated: The messy relationship between RAI datasets and algorithmic fairness benchmarks</article-title>
          .
          <source>ArXiv</source>
          ,
          <year>2021</year>
          . URL https://arxiv.org/abs/2106.05498.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Matias</given-names>
            <surname>Barenstein</surname>
          </string-name>
          .
          <article-title>Propublica's compas data revisited</article-title>
          .
          <source>ArXiv</source>
          ,
          <year>2019</year>
          . URL http://arxiv.org/ abs/
          <year>1906</year>
          .04711.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Barry</given-names>
            <surname>Becker</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ronny</given-names>
            <surname>Kohavi</surname>
          </string-name>
          .
          <source>Adult. UCI Machine Learning Repository</source>
          ,
          <year>1996</year>
          . URL https://doi.org/10.24432/C5XW20.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Rachel</surname>
            <given-names>K. E.</given-names>
          </string-name>
          <string-name>
            <surname>Bellamy</surname>
            , Kuntal Dey, Michael Hind, Samuel C. Hofman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic, Seema Nagar, Karthikeyan Natesan Ramamurthy, John Richards, Diptikalyan Saha, Prasanna Sattigeri, Moninder Singh,
            <given-names>Kush R.</given-names>
          </string-name>
          <string-name>
            <surname>Varshney</surname>
          </string-name>
          , and Yunfeng Zhang.
          <source>AI Fairness</source>
          <volume>360</volume>
          :
          <article-title>An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias</article-title>
          .
          <source>ArXiv</source>
          ,
          <year>2018</year>
          . URL https://arxiv.org/abs/
          <year>1810</year>
          .
          <year>01943</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Elisa</surname>
          </string-name>
          <string-name>
            <surname>Celis</surname>
          </string-name>
          , Lingxiao Huang,
          <string-name>
            <given-names>Vijay</given-names>
            <surname>Keswani</surname>
          </string-name>
          , and
          <article-title>Nisheeth K Vishnoi</article-title>
          .
          <article-title>Classification with fairness constraints: A meta-algorithm with provable guarantees</article-title>
          .
          <source>In Proceedings of the conference on fairness, accountability, and transparency</source>
          , pages
          <fpage>319</fpage>
          -
          <lpage>328</lpage>
          ,
          <year>2019</year>
          . URL https://doi.org/10.1145/3287560.3287586.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Minnesota</given-names>
            <surname>Population Center</surname>
          </string-name>
          .
          <article-title>Integrated public use microdata series</article-title>
          ,
          <source>international: Version 6</source>
          .
          <article-title>4 2001 dutch census</article-title>
          . Minneapolis: University of Minnesota,
          <year>2015</year>
          . URL http://doi.org/10. 18128/D020.V6.4.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Paulo</given-names>
            <surname>Cortez</surname>
          </string-name>
          .
          <article-title>Student performance</article-title>
          .
          <source>UCI Machine Learning Repository</source>
          ,
          <year>2008</year>
          . URL https://archive.ics.uci.edu/dataset/320.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Paulo</given-names>
            <surname>Cortez</surname>
          </string-name>
          and
          <article-title>Alice Maria Gonçalves Silva</article-title>
          .
          <article-title>Using data mining to predict secondary school student performance</article-title>
          .
          <source>In Proceedings of 5th Annual Future Business Technology Conference, Porto</source>
          , pages
          <fpage>5</fpage>
          -
          <lpage>12</lpage>
          . EUROSIS-ETI,
          <year>2008</year>
          . URL https://repositorium.sdum.uminho. pt/handle/1822/8024.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Frances</given-names>
            <surname>Ding</surname>
          </string-name>
          , Moritz Hardt, John Miller, and
          <string-name>
            <given-names>Ludwig</given-names>
            <surname>Schmidt</surname>
          </string-name>
          .
          <article-title>Retiring adult: New datasets for fair machine learning</article-title>
          .
          <source>ArXiv</source>
          ,
          <year>2022</year>
          . URL http://arxiv.org/abs/2108.04884.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Julia</given-names>
            <surname>Dressel</surname>
          </string-name>
          and
          <string-name>
            <given-names>Hany</given-names>
            <surname>Farid</surname>
          </string-name>
          .
          <article-title>The accuracy, fairness, and limits of predicting recidivism</article-title>
          .
          <source>Science Advances</source>
          ,
          <volume>4</volume>
          ,
          <year>2018</year>
          . URL https://www.science.org/doi/10.1126/sciadv.aao5580.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Cynthia</surname>
            <given-names>Dwork</given-names>
          </string-name>
          , Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel.
          <article-title>Fairness through awareness</article-title>
          .
          <source>In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, ITCS '12</source>
          , pages
          <fpage>214</fpage>
          -
          <lpage>226</lpage>
          . Association for Computing Machinery,
          <year>2012</year>
          . URL https://doi.org/10.1145/2090236.2090255.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Alessandro</surname>
            <given-names>Fabris</given-names>
          </string-name>
          , Stefano Messina, Gianmaria Silvello, and Gian Antonio Susto.
          <article-title>Algorithmic fairness datasets: the story so far</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          ,
          <volume>36</volume>
          (
          <issue>6</issue>
          ):
          <fpage>2074</fpage>
          -
          <lpage>2152</lpage>
          ,
          <year>2022</year>
          . URL https://link.springer.com/10.1007/s10618-022-00854-z.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Alessandro</surname>
            <given-names>Fabris</given-names>
          </string-name>
          , Fabio Giachelle, Alberto Piva, Gianmaria Silvello, and Gian Antonio Susto.
          <article-title>A search engine for algorithmic fairness datasets</article-title>
          .
          <source>In Proceedings of the 2nd European Workshop on Algorithmic Fairness. CEUR Workshop Proceedings</source>
          ,
          <year>2023</year>
          . URL https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3442</volume>
          /paper-08.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Ulrike</given-names>
            <surname>Grömping</surname>
          </string-name>
          .
          <article-title>South german credit data: Correcting a widely used data set</article-title>
          .
          <source>Technical report, Mathematics, Physics and Chemistry</source>
          ,
          <string-name>
            <surname>Department</surname>
            <given-names>II</given-names>
          </string-name>
          , Beuth University of Applied Sciences, Berlin, Germany,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Moritz</surname>
            <given-names>Hardt</given-names>
          </string-name>
          , Eric Price, and
          <string-name>
            <given-names>Nathan</given-names>
            <surname>Srebro</surname>
          </string-name>
          .
          <article-title>Equality of opportunity in supervised learning</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , volume
          <volume>29</volume>
          ,
          <year>2016</year>
          . URL https://proceedings.neurips.cc/paper_files/paper/2016/file/ 9d2682367c3935defcb1f9e247a97c0d-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Hans</given-names>
            <surname>Hofmann</surname>
          </string-name>
          .
          <article-title>Statlog (German Credit Data)</article-title>
          .
          <source>UCI Machine Learning Repository</source>
          ,
          <year>1994</year>
          . URL https://doi.org/10.24432/C5NC77.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Max</surname>
            <given-names>Hort</given-names>
          </string-name>
          , Zhenpeng Chen,
          <string-name>
            <surname>Jie</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            , Mark Harman, and
            <given-names>Federica</given-names>
          </string-name>
          <string-name>
            <surname>Sarro</surname>
          </string-name>
          .
          <article-title>Bias mitigation for machine learning classifiers: A comprehensive survey</article-title>
          .
          <volume>1</volume>
          (
          <issue>2</issue>
          ):
          <volume>11</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          :
          <fpage>52</fpage>
          . URL https: //doi.org/10.1145/3631326.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Jon</surname>
            <given-names>Kleinberg</given-names>
          </string-name>
          , Jens Ludwig, Sendhil Mullainathan, and
          <string-name>
            <given-names>Ashesh</given-names>
            <surname>Rambachan</surname>
          </string-name>
          .
          <article-title>Algorithmic fairness</article-title>
          .
          <source>In EAE Papers and Proceedings</source>
          , volume
          <volume>108</volume>
          , pages
          <fpage>22</fpage>
          -
          <lpage>27</lpage>
          ,
          <year>2018</year>
          . URL https: //pubs.aeaweb.org/doi/10.1257/pandp.20181018.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Tai</given-names>
            <surname>Le</surname>
          </string-name>
          <string-name>
            <surname>Quy</surname>
          </string-name>
          , Arjun Roy, Vasileios Iosifidis, Wenbin Zhang, and
          <string-name>
            <given-names>Eirini</given-names>
            <surname>Ntoutsi</surname>
          </string-name>
          .
          <article-title>A survey on datasets for fairness-aware machine learning</article-title>
          .
          <source>WIREs Data Mining and Knowledge Discovery</source>
          ,
          <volume>12</volume>
          (
          <issue>3</issue>
          ):e1452,
          <year>2022</year>
          . URL https://doi.org/10.1002/widm.1452.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Daphne</given-names>
            <surname>Lenders</surname>
          </string-name>
          and
          <string-name>
            <given-names>Toon</given-names>
            <surname>Calders</surname>
          </string-name>
          .
          <article-title>Real-life performance of fairness interventions - introducing a new benchmarking dataset for fair ML</article-title>
          .
          <source>In Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing</source>
          , pages
          <fpage>350</fpage>
          -
          <lpage>357</lpage>
          . ACM,
          <year>2023</year>
          . URL https://dl.acm.org/doi/ 10.1145/3555776.3577634.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Ninareh</surname>
            <given-names>Mehrabi</given-names>
          </string-name>
          , Fred Morstatter,
          <string-name>
            <given-names>Nripsuta</given-names>
            <surname>Saxena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Kristina</given-names>
            <surname>Lerman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Aram</given-names>
            <surname>Galstyan</surname>
          </string-name>
          .
          <article-title>A survey on bias and fairness in machine learning</article-title>
          .
          <source>ACM Computing Surveys</source>
          ,
          <volume>54</volume>
          (
          <issue>6</issue>
          ):
          <volume>115</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>115</lpage>
          :
          <fpage>35</fpage>
          ,
          <year>2021</year>
          . URL https://doi.org/10.1145/3457607.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22] Shira Mitchell, Eric Potash, Solon Barocas,
          <string-name>
            <surname>Alexander D'Amour</surname>
            ,
            <given-names>and Kristian</given-names>
          </string-name>
          <string-name>
            <surname>Lum</surname>
          </string-name>
          .
          <article-title>Algorithmic fairness: Choices, assumptions, and definitions</article-title>
          .
          <source>Annual Review of Statistics and Its Application</source>
          ,
          <volume>8</volume>
          (
          <issue>1</issue>
          ):
          <fpage>141</fpage>
          -
          <lpage>163</lpage>
          ,
          <year>2021</year>
          . URL https://doi.org/10.1146/annurev-statistics-
          <volume>042720</volume>
          -125902.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Sahil</given-names>
            <surname>Verma</surname>
          </string-name>
          and
          <string-name>
            <given-names>Julia</given-names>
            <surname>Rubin</surname>
          </string-name>
          .
          <article-title>Fairness definitions explained</article-title>
          .
          <source>In Proceedings of the International Workshop on Software Fairness</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . IEEE, ACM,
          <year>2018</year>
          . URL https: //dl.acm.org/doi/10.1145/3194770.3194776.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Richard</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Zemel</surname>
            , Ledell Yu Wu, Kevin Swersky, Toniann Pitassi, and
            <given-names>Cynthia</given-names>
          </string-name>
          <string-name>
            <surname>Dwork</surname>
          </string-name>
          .
          <article-title>Learning fair representations</article-title>
          .
          <source>In International Conference on Machine Learning. PMLR</source>
          ,
          <year>2013</year>
          . URL https://proceedings.mlr.press/v28/zemel13.html.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>