=Paper=
{{Paper
|id=Vol-3908/paper_31
|storemode=property
|title=Unveiling the Blindspots: Examining Availability and Usage of Protected Attributes
in Fairness Datasets
|pdfUrl=https://ceur-ws.org/Vol-3908/paper_31.pdf
|volume=Vol-3908
|authors=Jan Simson,Alessandro Fabris,Christoph Kern
|dblpUrl=https://dblp.org/rec/conf/ewaf/SimsonF024
}}
==Unveiling the Blindspots: Examining Availability and Usage of Protected Attributes
in Fairness Datasets==
Unveiling the blindspots: Examining availability and
usage of protected attributes in fairness datasets⋆
Jan Simson1,2 , Alessandro Fabris3 and Christoph Kern1,2
1
LMU Munich, Ludwigstr. 33, 80809 München, Germany
2
Munich Center for Machine Learning (MCML), Oettingenstraße 67, 80538 München, Germany
3
Max Planck Institute for Security and Privacy, Universitätsstraße 140, 44799 Bochum, Germany
Abstract
This work examines the representation of protected attributes across tabular datasets used in algorithmic
fairness research. Drawing from international human rights and anti-discrimination laws, we compile
a set of protected attributes and investigate both their availability and usage in the literature. Our
analysis reveals a significant underrepresentation of certain attributes in datasets that is exacerbated
by a strong focus on race and sex in dataset usage. We identify a geographical bias towards the
Global North, particularly North America, potentially limiting the applicability of fairness detection
and mitigation strategies in less-represented regions. The study exposes critical blindspots in fairness
research, highlighting the need for a more inclusive and representative approach to data collection and
usage in the field. We propose a shift away from a narrow focus on a small number of datasets and
advocate for initiatives aimed at sourcing more diverse and representative data.
Keywords
critical data studies, dataset usage, protected groups, generalization
1. Introduction
Algorithmic fairness has become a significant area of research in recent years, with a growing
body of work aimed at addressing bias and discrimination in machine learning systems. Identi-
fying and mitigating harmful practices against vulnerable individuals and groups in prediction
algorithms lies at the core of this field and to study these issues adequate and nuanced data
sources are needed.
In this work, we examine datasets and how they are used within the fairness literature. We
present an overview of attributes which are protected by anti-discrimination legislation across
multiple continents and study their availability in datasets and usage in fairness research. We
identify issues regarding the diversity of protected attributes represented in datasets and their
geographic representativeness, highlighting how populations are neglected in the literature.
EWAF’24: European Workshop on Algorithmic Fairness, July 01–03, 2024, Mainz, Germany
⋆
This is an extended abstract of [1], published at FAccT 2024.
$ jan.simson@lmu.de (J. Simson); alessandro.fabris@mpi-sp.org (A. Fabris); christoph.kern@lmu.de (C. Kern)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
2. Methodology
For this work, we collected and manually annotated usage of tabular datasets in fair classification
tasks. We built on top of a comprehensive survey of fairness datasets by Fabris et al. [2],
leveraging the same inclusion criteria. We focus on tabular datasets used for fair classification in
this work, due to their important role in the fairness literature [2, 3]. We study the use of tabular
datasets (𝑁 = 36) across 142 articles. Since datasets appear in multiple publications and most
publications use multiple datasets, the total number of dataset and publication combinations
examined was 𝑁 = 280, with 𝑛 = 233 instances of sufficient information to reconstruct (or
reasonably guess) protected attribute usage.
To define protected attributes, we draw from domain-specific legislation and human rights
law. We define as protected all attributes which are explicitly mentioned as prohibited drivers of
discrimination and inequality. For example, Article 21 of the Charter of Fundamental Rights of
the European Union states “Any discrimination based on any ground such as sex, race, colour,
ethnic or social origin, genetic features, language, religion or belief, political or any other
opinion, membership of a national minority, property, birth, disability, age or sexual orientation
shall be prohibited” [4].
We try to adress the Global North and especially U.S. focus in AI ethics and fairness research
[5, 6, 7] by including works from the European Union (Charter of Fundamental Rights of the
European Union [4], EU legislation on fair hiring [8]), from other continents (African Charter on
Human and Peoples’ Rights [9], the Arab Charter on Human Rights [10], the ASEAN Declaration
of Human Rights [11]) and global works (Universal Declaration of Human Rights [12]) besides
works from North America (the American Declaration of the Rights and Duties of Man [13],
US fair lending legislation [14]). However, we acknowledge that we fail to mitigate the Global
North bias completely, given the strong presence of said regions in research.
Drawing from this literature, we provide a shallow categorization of protected attributes,
identifying seven main categories (Table 1). It is worth noting that this is not a complete
categorization of all protected attributes around the globe and across sectors. Rather, our
categorization aims to guide an inclusive discussion of algorithmic fairness research through
the lens of protected attributes.
3. Results
The geographical provenance of datasets used in the examined literature is clearly skewed.
Among datasets from a single continent, 21 come from North America, 5 from Europe, 2 from
Asia, and 2 from South America, confirming a Global North and especially North American
dominance in AI ethics research [5, 6, 7]. Since fairness is highly contextual, there is a risk
that the fairness detection and mitigation strategies built by this research community will not
transfer and, therefore, underserve neglected geographical areas [15].
We further notice a highly uneven distribution of both the availability and usage of protected
attributes. The left bar chart in Figure 1 depicts protected attributes available in fairness datasets
and the right chart their usage in the examined literature. There is a particular focus in both
availability (n=17) and usage (n=167) on race as a protected attribute. On the other hand,
Table 1
Protected attributes under global anti-discrimination legislation. Attributes considered protected
under international human rights works and anti-discrimination law. We report a tick (✓) when the
literal phrasing (in the original law or official clarifications) matches the row header and report a tilde
(∼) if a similar concept is present, but wording is different.
UN African Arab ASEAN American US Fair EU EU Fair
Charter Charter Charter Declara- Declara- Lending Charter Hiring
[12] [9] [10] tion [11] tion [13] [14] [4] [8]
Gender and Sexual Identity
Sex ✓ ✓ ✓ ✓ ✓ ✓ ✓
Sexual orientation ✓ ✓ ✓
Gender ✓ ∼ ∼
Racial and Ethnic Origin
Race ✓ ✓ ✓ ✓ ✓ ✓ ✓ ∼
Color ✓ ✓ ✓ ✓ ✓
Ethnic origin ∼ ∼ ✓ ✓
National origin ✓ ✓ ✓ ✓ ∼ ∼
Language ✓ ✓ ✓ ✓ ✓ ✓
National minority ✓
Socioeconomic Status
Social origin ✓ ✓ ✓ ✓ ✓
Property ✓ ∼ ∼ ∼ ✓
Recipient of public ✓
assistance
Religion, Belief and Opinion
Religion ✓ ✓ ∼ ✓ ∼ ✓ ∼ ∼
Political opinion ✓ ✓ ✓ ✓
Other opinion ✓ ✓ ∼ ✓ ✓
Family
Birth ∼ ∼ ✓ ✓ ✓
Familial status ✓
Marital status ✓
Disability and Health Conditions
Disability ✓ ✓ ✓ ✓ ✓
Genetic features ✓
Age
Age ✓ ✓ ✓ ✓
attributes about religion, belief and opinion are entirely missing on both sides. Information on
disability and health conditions is also infrequently available (𝑛 = 3) and never used in the
surveyed literature. Socioeconomic status descriptors are more commonly available yet often
neglected. This threatens the applicability of research findings across contexts, as information
on race for example is hardly available in EU data [16].1
4. Discussion
We unveil blindspots in fairness research, demonstrating a neglect of vulnerable subpopulations
in the literature. We will further present additional results from our data collection at the
conference, indicating other troubling practices in the field, such as a lack of sufficient reporting
1
There was also a small number of protected attributes used in the literature but not referenced in legislation, such
as employment status, alcohol consumption, neighborhood, body-mass index, and profession.
Availability Usage
n=14 Sex n=96
Sexual orientation
n=8 Gender n=27
n=17 Race n=167
Color
n=8 Ethnic origin n=4
n=4 National origin n=4
n=3 Language n=1
National minority
Social origin
n=8 Property n=2
n=3 Receipient of public assistance
Religion
Political opinion
Other opinion
n=1 Birth status
n=8 Familial status
n=7 Marital status n=8
n=3 Disability
Genetic features
n=21 Age n=33
20 15 10 5 0 0 50 100 150
Sensitive Attribute
Gender and Sexual Identity Socio−Economic Status Familial Status Age
Category
Racial and Ethnic Origin Religion, Belief and Opinion Disability and Health Conditions
Figure 1: There is a stark difference between attributes considered protected under interna-
tional legislation and their availability, as well as usage in datasets. Bar charts displaying the
availability in datasets (left) and usage in the literature (right) of protected attributes for all categories
of protected attributes in Table 1.
of dataset usage impacting reproducibility and potentially harmful practices in the processing
of protected attributes leading to a neglect of minorities.
While valid reasons exist against the collection of protected data [17], motivating e.g. the line
of work on fairness under unawareness [14, 18], we believe they are not sufficient to explain
the observed lack in usage of particular attributes. We observe a clear trend towards certain
protected attributes being more readily available in datasets which, however, is amplified by
a strong tendency of papers to (1) repeatedly focus on the same small number of datasets
and (2) especially rely on race and sex as protected attributes. It is worth noting that this
trend extends to fairness research more broadly, including qualitative studies. These practices
also have a tendency to self-reinforce, increasing the likelihood of future research to conform.
Recent articles published at fairness conferences, such as FAccT (the ACM Conference on
Fairness, Accountability, and Transparency) and AIES (the AAAI/ACM Conference on Artificial
Intelligence, Ethics and Society), for example, mention race and gender by an order of magnitude
more frequently than religion, disability, socioeconomic status, and sexual orientation [19].
We argue for a move towards a research roadmap to tackle these issues within the complex
social, legal and technical landscape they reside in (as advocated, for example, in Guo et al.
[20]). In particular, we propose a move away from focusing exclusively on a small number
of datasets[2], such as Adult, German Credit or COMPAS. Instead we suggest an increased
focus on using a diverse set of datasets and sourcing more representative data to fill in the gaps
of available datasets. We call for dedicated initiatives, including for example data donation
campaigns and citizen science initiatives, capable of filling this gap and responsibly handling
the collected data. We refer readers to the full paper [1] for a more nuanced discussion. A list of
datasets and their protected attributes, as well as further analyses are available on Github.
References
[1] J. Simson, A. Fabris, C. Kern, Lazy data practices harm fairness research, in: FAccT ’24:
2021 ACM Conference on Fairness, Accountability, and Transparency, Rio de Janeiro,
Brasil, June 3-6, 2024, ACM, 2024. URL: https://doi.org/10.1145/3630106.3658931. doi:10.
1145/3630106.3658931.
[2] A. Fabris, S. Messina, G. Silvello, G. A. Susto, Algorithmic fairness datasets: the story
so far, Data Min. Knowl. Discov. 36 (2022) 2074–2152. URL: https://doi.org/10.1007/
s10618-022-00854-z. doi:10.1007/S10618-022-00854-Z.
[3] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, A. Galstyan, A survey on bias and
fairness in machine learning, ACM Comput. Surv. 54 (2022) 115:1–115:35. URL: https:
//doi.org/10.1145/3457607. doi:10.1145/3457607.
[4] Euopeam Union, Charter of fundamental rights of the european union c-364/01, 2000.
https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32000X1218%2801%29.
[5] C. T. Okolo, N. Dell, A. Vashistha, Making ai explainable in the global south: A system-
atic review, in: ACM SIGCAS/SIGCHI Conf. on Computing and Sustainable Societies
(COMPASS), 2022, pp. 439–452.
[6] C. Roche, D. Lewis, P. Wall, Artificial intelligence ethics: An inclusive global discourse?,
arXiv preprint arXiv:2108.09959 (2021).
[7] A. A. Septiandri, M. Constantinides, M. Tahaei, D. Quercia, WEIRD faccts: How western,
educated, industrialized, rich, and democratic is facct?, in: Proceedings of the 2023 ACM
Conference on Fairness, Accountability, and Transparency, FAccT 2023, Chicago, IL, USA,
June 12-15, 2023, ACM, 2023, pp. 160–171. URL: https://doi.org/10.1145/3593013.3593985.
doi:10.1145/3593013.3593985.
[8] A. Fabris, N. Baranowska, M. J. Dennis, D. Graus, P. Hacker, J. Saldivar, F. Z. Borgesius,
A. J. Biega, Fairness and bias in algorithmic hiring: a multidisciplinary survey (2024).
[9] Organisation of African Unity, African charter on human and peoples’ rights,
1981. https://au.int/sites/default/files/treaties/36390-treaty-0011_-_african_charter_on_
human_and_peoples_rights_e.pdf.
[10] Council of the League of Arab States, Arab charter on human rights, 2004.
[11] Association of Southeast Asian Nations, Asean declaration of human rights, 2012. https:
//asean.org/asean-human-rights-declaration/.
[12] United Nations, Universal declaration of human rights, 1948. https://www.un.org/en/
about-us/universal-declaration-of-human-rights.
[13] Organization of American States, American declaration of the rights
and duties of man, 1948. https://www.oas.org/en/iachr/mandate/Basics/
american-declaration-rights-duties-of-man.pdf.
[14] J. Chen, N. Kallus, X. Mao, G. Svacha, M. Udell, Fairness under unawareness: Assessing
disparity when protected class is unobserved, in: danah boyd, J. H. Morgenstern (Eds.),
Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* 2019,
Atlanta, GA, USA, January 29-31, 2019, ACM, 2019, pp. 339–348. URL: https://doi.org/10.
1145/3287560.3287594. doi:10.1145/3287560.3287594.
[15] N. Sambasivan, E. Arnesen, B. Hutchinson, T. Doshi, V. Prabhakaran, Re-imagining
algorithmic fairness in india and beyond, in: M. C. Elish, W. Isaac, R. S. Zemel (Eds.), FAccT
’21: 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event /
Toronto, Canada, March 3-10, 2021, ACM, 2021, pp. 315–328. URL: https://doi.org/10.1145/
3442188.3445896. doi:10.1145/3442188.3445896.
[16] S. Jaime, C. Kern, Ethnic classifications in algorithmic fairness: Concepts, measures and
implications in practice, in: The 2024 ACM Conference on Fairness, Accountability, and
Transparency, FAccT ’24, Association for Computing Machinery, New York, NY, USA,
2024, p. 237–253. URL: https://doi.org/10.1145/3630106.3658902. doi:10.1145/3630106.
3658902.
[17] M. Andrus, E. Spitzer, J. Brown, A. Xiang, What we can’t measure, we can’t understand:
Challenges to demographic data procurement in the pursuit of fairness, in: M. C. Elish,
W. Isaac, R. S. Zemel (Eds.), FAccT ’21: 2021 ACM Conference on Fairness, Accountability,
and Transparency, Virtual Event / Toronto, Canada, March 3-10, 2021, ACM, 2021, pp. 249–
260. URL: https://doi.org/10.1145/3442188.3445888. doi:10.1145/3442188.3445888.
[18] A. Fabris, A. Esuli, A. Moreo, F. Sebastiani, Measuring fairness under unawareness of
sensitive attributes: A quantification-based approach, J. Artif. Intell. Res. 76 (2023) 1117–
1180. URL: https://doi.org/10.1613/jair.1.14033. doi:10.1613/JAIR.1.14033.
[19] A. Birhane, E. Ruane, T. Laurent, M. S. Brown, J. Flowers, A. Ventresque, C. L. Dancy,
The forgotten margins of AI ethics, in: FAccT ’22: 2022 ACM Conference on Fairness,
Accountability, and Transparency, Seoul, Republic of Korea, June 21 - 24, 2022, ACM,
2022, pp. 948–958. URL: https://doi.org/10.1145/3531146.3533157. doi:10.1145/3531146.
3533157.
[20] A. Guo, E. Kamar, J. W. Vaughan, H. M. Wallach, M. R. Morris, Toward fairness in AI for
people with disabilities sbg@a research roadmap, ACM SIGACCESS Access. Comput. 125
(2020) 2. URL: https://doi.org/10.1145/3386296.3386298. doi:10.1145/3386296.3386298.