The Many Facets of Data Equity
                                                                       Extended Abstract
                  H. V. Jagadish                                        Julia Stoyanovich                                  Bill Howe
             University of Michigan                                     New York University                       University of Washington
                jag@umich.edu                                          stoyanovich@nyu.edu                           billhowe@uw.edu

ABSTRACT                                                                               of administrative systems in creating and reinforcing discrimi-
Data-driven systems can be unfair, in many different ways. All                         nation. Spade argues that administrative systems facilitate state
too often, as data scientists, we focus narrowly on one technical                      violence encoded in laws, policies, and schemes that arrange and
aspect of fairness. In this paper, we attempt to address equity                        define people by categories of indigeneity, race, gender, ability,
broadly, and identify the many different ways in which it is                           and national origin [44]. Hoffman considered how these effects
manifest in data-driven systems.                                                       are amplified through data technologies and their purveyors [21].
                                                                                       Decision systems, regardless of consideration of equity, mecha-
                                                                                       nize existing structures, such that any effort to define and address
1    INTRODUCTION                                                                      data equity issues are at risk of becoming mere technological
There is concern about fairness today, whenever data-driven sys-                       “happy talk.” To combat these outcomes, we emphasize the need
tems are used. It is no longer believed that data are impartial and                    to think about equity broadly, and to own the outcomes realized.
neutral. Nevertheless, the scope of fairness considered is often                       Ideally, we have a primacy of equity in the design: the goal is not
narrow. Computer scientists are trained to develop algorithms                          just to automate and correct for equity, but to design systems that
that can solve cleanly stated formal problems. If fairness could                       exist to further equity. For example, a machine learning system to
be reduced to a mathematical constraint, it would have been                            help submit insurance claims to maximize payment is designed
addressed by now. The difficulty, of course, is that fairness is                       to counteract the discrimination effected by corporate models to
more complicated than that. Technical solutions can help, but are                      minimize payments. However, we know that no everyone data-
not enough in themselves to address the real problems. Our goal                        driven system can have equity as its purpose. So we also must
is to get past these limitations and address data equity broadly                       develop a framework to recognize and remedy the many different
defined.                                                                               ways in which a data-driven system may introduce inequities.
    To reach our goal, we begin with a discussion of data equity in                        Data and administrative systems construct the very identities
Section 2. Based on this understanding, in Section 3, we examine                       and categories presented to us as "natural,” both inventing and
multiple facets of data equity that must all be addressed.                             producing meaning for the categories they administer [45](:pp.
                                                                                       31–32). Administrative systems facilitate state violence encoded
2    WHAT IS DATA EQUITY                                                               in laws, policies, and schemes that arrange and define people by
Equity as a social concept promotes fairness by treating people                        categories of indigeneity, race, gender, ability, and national origin,
differently depending on their endowments and needs (focused                           which Spade calls "administrative violence" [45](:pp. 20–21).
on equality of outcome), whereas equality aims to achieve fair-                            Similarly, transportation apps like Ghettotracker and SafeR-
ness through equal treatment regardless of need (focused on                            oute are designed to help users navigate around “dangerous” or
equality of opportunity) [15, 16, 31, 35, 36]. Equity is not a legal                   unsafe areas. In practice, they often target neighborhoods popu-
framework per se, yet underpins civil rights laws in the U.S. that                     lated by people of color by encoding racist articulations of what
restrict preferences based on protected classes, for example in                        constitutes danger [19].
housing or employment [50–52]. It has recently also been oper-                             That social inequity is reinforced and amplified by data-intensive
ationalized in computer science scholarship, primarily through                         systems is not new. We know from other domains that advances
fairness in machine learning research [5]. However, equity is a                        in data science and AI can be undermined by similar problems:
much richer concept than a simple mathematical criterion that                          automated decisions based on biased data can operationalize, en-
can be captured in a fairness constraint.                                              trench, and legitimize new forms of discrimination. For example,
   Even in the best of circumstances, underlying structural in-                        a defendant’s immediate social network may reveal many convic-
equities in access to health care, employment, and housing exhibit                     tions, but that information must be interpreted through the lens
themselves in the data record and are propagated through deci-                         of socioeconomic conditions and prior structural discrimination
sion systems, automated or otherwise, to become reinforced by                          in the criminal justice system before concluding that an individ-
policy. A key thing that’s missing is a treatment of how decision                      ual is at a higher risk of recidivism or bail violation. Similarly,
systems, regardless of consideration of equity, reinforce existing                     standardized test scores are sufficiently impacted by prepara-
structures. Therefore, any effort to define and improve "Data                          tion courses that the score itself says more about socioeconomic
equity" may consist largely of "happy talk" [6], which involves a                      conditions than an individual’s academic potential.
willingness to acknowledge and even revel in cultural difference                           In summary, the manner in which data systems are built and
without seriously challenging ongoing structural inequality.                           used can compound and exacerbate inequities we have in society.
   We consider data equity in the context of automated decision                        It can also introduce inequities where there previously were none.
systems, while recognizing a broader literature around the role                        Avoiding these harms results in data equity, and is accomplished
                                                                                       through constructing socio-technical systems that we call data
© 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed-   equity systems.
ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus)
on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0)
3     FACETS OF DATA EQUITY                                              women as unsuitable for hiring, a problem that exacerbated his-
We have examined dozens of examples of inequities in data sys-           torical difficulties. Amazon had to cancel this project even before
tems, such as those cited in the preceding section. Based on our         it launched.
empirical study, we have identified four distinct facets of data             Representation issues typically, but not exclusively, occur in
equity [24], which we present here as a rough taxonomy of the            data about people. But there are many exceptions, which can still
issues to be considered in the construction of data equity systems.      have inequitable impacts on people. The city of Boston released
                                                                         an app, called StreetBump, to report potholes in its streets. The
                                                                         app was downloaded and installed by many citizens with smart-
                                                                         phones, and reported many potholes to the city. The difficulty
3.1    Representation equity                                             was that smartphones were more frequently owned by the better
There often are material deviations between the data record and          off residents of the city, and these were also more likely to take
the world the data is meant to represent, often with respect to          the effort to install the app because of their history-driven belief
historically disadvantaged groups [10]. Perhaps the best-known           in government. The consequence would have been a data record
case in this regard has to do with crime records used for predictive     with inadequate representation of streets in poor neighborhoods:
policing. Many offenses are recorded only when there is police           a problem that was proactively corrected by the city, through
presence. While citizens may call the police in for some types           sending out its own pothole recording crews to use the app in
of crimes, both major (such as a murder) and minor (such as              poorer neighborhoods. Similarly, richer countries have many
a noisy party), it would be unusual for the police to be called          more weather stations measuring conditions in the atmosphere
in because of a report of jaywalking or minor drug possession.           and in the ocean. The disparity of representation in the data
Rather, these offenses are only entered into the record when             record can lead to weather predictions being less accurate for
police happen to observe them, and choose not to ignore them.            poor countries.
Therefore, crimes are more likely to be observed in areas with               Data representation issues, and the harms they cause, may
greater police presence, and among these observed, crimes are            first appear in the input, output, or at any intermediate data
more likely to be recorded where the police officer chooses not          processing step, but the majority of research in AI bias and fair
to give the offender a pass, a choice that has historically been         ML pertains only to learning. We must develop techniques to
racially biased. In other words, the data record reflects, and can       introspect and intervene at any stage of the data pipeline. It is
enshrine, historical injustices. The use of this record for future       not enough to hope that we will mitigate the propagation of data
police deployments can lead to a vicious cycle of victimizing            representation issues during a final learning step.
communities that have suffered in the past.                                  Our solution is to adopt database repair [38] as the guid-
   Representation issues can arise even when there is no histor-         ing principle. We have developed techniques to detect under-
ical record involved. For example, confirmed COVID-19 cases              representation efficiently for a high number of small-domain
require testing, and there can be racial disparities in both the         discrete-valued attributes, such as those that result from joining
availability of testing and in the desire of individuals to be tested,   multiple tables in a relational schema [4, 25, 28]. Once representa-
leading to systematic biases in collected data. These disparities        tion gaps are detected, we consider cases where they can be filled
are found in contemporary data, even if they are rooted in his-          by collecting more data. We have shown how to satisfy multiple
torical discrimination. For example, there may be fewer test sites       gaps at the same time efficiently [4, 25]. We have linked causal
located in minority neighborhoods, or poor people lacking insur-         models to the conditional independence relationships used in
ance may worry about the cost of testing and this may reflect in         the database repair literature, suggesting a new algorithm for
racial statistics. Similarly, a long history of being unfairly treated   causal database repair such that any reasonable classifier trained
by the medical profession may make African-Americans natu-               on the repaired data set will satisfy interventional fairness and
rally wary of such interactions and hence induce reluctance in           empirically perform well on other definitions [37]. We have de-
testing. Whatever be the reasons, the point is that contemporary         veloped [40], a design and evaluation framework for fairness-
data may under-represent racial minorities, particularly African-        enhancing interventions in data-intensive pipelines that treats
Americans, and hence potentially lead to under-estimating the            data equity as a first-class citizen and supports sound experimen-
prevalence of COVID-19 in these communities.                             tation [39, 42].
   Representation inequities in the data can lead to systemic
biases in the decisions based on the data. But it can also lead          3.2    Feature equity
to greater errors for under-represented groups. Consider facial
                                                                         All the features required for a particular analysis, or to represent
recognition as an example. It has been extensively documented,
                                                                         members of some group adequately, may not be available in
across numerous current systems, that these systems are con-
                                                                         a dataset. Feature equity refers to the availability of variables
siderably more accurate with white males than with women or
                                                                         needed to represent members of every group in the data, and to
people of color. Higher error rates for a community is also a
                                                                         perform desired analyses, particularly those to study inequity. For
harm, in this case caused by a lack of representation. These error
                                                                         example, if attributes such as race and income are not recorded
rates may not only be higher, but they could additionally also
                                                                         along with other data, it becomes hard to discover systematic
be biased. For example, Amazon developed software to screen
                                                                         biases that may exist, let alone correct for them.
candidates for employment and trained this software on data
                                                                            In the recent COVID-19 pandemic, significant racial disparities
from the employees it already had. Since its employees were
                                                                         have been reported in the United States on both infection rates
mostly male, women were under-represented in the data record.
                                                                         and mortality rates. Since race is not typically recorded as part
Worse still, because of historical discrimination, the few women
                                                                         of medical care in many jusrisdictions, it has been challenging
previously in the company had done poorly compared to their
                                                                         for policymakers and analysts to explore these racial differences
potential. A model trained on this data set began classifying most
                                                                         as deeply as they would like, and to devise suitable remedies.
    Similarly, eviction data does not typically include race and        very large, and furthermore the customer may not have access to
gender information, and this makes it hard to assess equity.            sophisticated tools to predict company actions. In other words,
    Intuitively, it is not unreasonable to think about representation   data-driven systems create, and exacerbate, asymmetries, with
equity as being concerned with rows in the data-table and feature       power going to the entity with more information.
equity as being concerned with the columns. However, feature               Access equity refers to equitable access to data and models,
equity includes the full scope of modeling choices made, of which       across domains and levels of expertise, and across roles: data
attribute choice is is only one component, albeit a very important      subjects, aggregators, and analysts.
one. Another manifestation of feature equity has to do with                Fundamental asymmetries in information access are difficult
choice for the domain of attribute values. If a gender attribute        to address. Some amelioration is possible through regulation, or
is defined to permit exactly two values, male and female, this is       voluntary transparency. Privacy policies are a tiny step in this
a modeling choice that explicitly does not accommodate other,           direction, though they are far from enough in themselves, and
more complex, gender expressions. Similarly, if age has been            leave a great deal to be desired the way they are currently imple-
recorded in age ranges ( <20, 20-30, 30-40, 40-50, 50-60, and >60),     mented in most cases. Right to access information about oneself,
it is not possible to distinguish between toddlers and teenagers,       as provided through GDPR is Europe, is a more substantial step.
or between a 61 year old still able to work a full day and a 95            Access to data is a challenge not just for data owned by private
year old no longer able to do so. If these distinctions are not         companies. We sometimes see similar issues in other domains as
important for the desired analyses, the chosen age range values         well. Researchers may hoard their data for competitive advantage
are reasonable. However, many analyses may care, and may find           in their research: if they put in the effort to collect the data in
these value choices very restricting.                                   the first place, they want to analyze the data and publish their
    When a desired attribute is not recorded at all, or has been        findings before releasing the collected data. Government agencies
recorded in a limited way, we may seek to impute its value. Ideally,    may also act similarly, driven by parochial thinking, local politics,
we will be able to do this by linkage across datasets. For example,     or other such reasons.
it may be possible to determine race based on census data joined           One major impediment to making data public is the need to
on geography and statistical patterns in first and last names [53].     respect the privacy of the data subjects. A classic example is
    Where values for missing attributes cannot be determined            medical records: there is great potential value in making these
through direct linkage, they may sometimes still be estimated           available for analysis: surely many new patterns will be found
through the use of auxiliary data sets. Choices among competing         that improve health and save lives. Yet, most people are very
sources may introduce other issues; income recorded to deter-           sensitive about sharing medical information and it has proved
mine eligibility for housing services will have different biases        all too easy to re-identify anonymized data, with enough effort
than income estimated from buying history. Furthermore, inte-           and ingenuity And this is even before one considers regulatory
gration among datasets involves schema mapping decisions that           constraints on such sharing. Similarly, as citizens, we all desire
can change the result.                                                  open government, and would like government agencies to make
    Finally, imputation of missing attribute values may involve         their data public. But, as subjects, we also may be sensitive about
an algorithm that depends on some model, which may itself be            some of our information with the government, and not want it
biased. For instance, zip code can be used to "determine" race.         made public. This is a difficult balance, which has to be managed
Obviously, this cannot work at the individual level, because not        in each instance. Technical solutions can be helpful. For instance,
everyone in a zip code is of the same race. Furthermore, even           differential privacy may permit privacy preserving release of
in the aggregate, we cannot always assume that the proportion           some information aggregates.
of entries in our data with a particular value for race is equal to        Even when actual access to data is not restricted, the opacity
the proportion who live in that zip code. For instance, there have      of data systems, as perceived by different groups, can also be an
been several COVID-19 outbreaks in prisons, where the racial            access equity violation. Researchers’ reluctance to release data
composition of prisoners is likely quite different from that of the     they have invested to collect contributes to the reproducibility
surrounding community.                                                  crisis. Private companies’ tight control of their data impedes
    Using a novel concept of EquiTensors, we have demonstrated          external equity audits. Inadequate data release can promote mis-
that pre-integrated, fairness-adjusted features from arbitrary in-      interpretation and therefore misinformation and misuse. Data
puts can help avoid propagating discrimination from biased data,        access must be accompanied by sufficient metadata to permit
while still delivering prediction accuracy comparable to oracle         correct interpretation and to determine fitness for use.
networks trained with hand-selected data sets [54, 55].                    A typical data science pipeline will have a sequence of data ma-
                                                                        nipulations, with multiple intermediate data sets created, shared,
                                                                        and manipulated. Often, these data sets will be from disparate
3.3    Access equity
                                                                        sources, and much of the processing may be conducted at remote
So far, we have looked at what is in a data set. Now we look            sites. When using a remote data source, it is important to under-
at who has access to it. Typically, data sets are owned by big          stand not just what the various fields are, but also how certain
companies, which spend substantial resources to construct the           values were computed and whether the dataset could be used
data set, and want to obtain competitive advantages by keeping it       for the desired purpose. Provenance descriptions can contain all
proprietary. On the other hand, customers may not have access to        this information, but is usually far too much detailed information
this data, and hence be at a disadvantage in any interaction with       for a user to be able to make use of. Additionally, proprietary
the company, even with regard to their own information. Worse           concerns and privacy limits may limit what can be disclosed. The
still, the company has knowledge of multiple customers, which           idea of a nutritional label has been proposed by us, and indepen-
it can exploit. In contrast, the customer has access to only their      dently by others, as a way to capture succinctly a small amount
own actions with the company. The customer may interact with            of critical information required to determine fitness for use. The
multiple companies, but the number of companies is usually not
challenge is that the information that must be captured depends        and these reports are aggregated into a credit score, which you
on the intended use.                                                   can see. You have some sense of what goes into building a good
   We have developed RankingFacts [58], the first prototype of an      score, even though the specific details may not be known. More
automatically computed nutritional label that helps users inter-       importantly, you can see what has been reported about you by
pret algorithmic rankers. The work on a user-facing nutritional        your creditors and there is a process to challenge errors. The
label prototype motivated a deeper inquiry into fairness and di-       system is far from perfect, but most data-driven systems today
versity in set selection and ranking [48, 56, 57], and on designing    are much worse in so many respects, including in particular their
fair and stable ranking schemes [1, 2]. We also continued this         mechanisms for providing accountability and recourse.
work to compute properties to characterize data sets [49], and to
succinctly capture correlation between attributes [32].                4    CONCLUSION
   Finally, most individuals affected by data-driven systems likely
                                                                       Data equity issues are pervasive but subtle, requiring holistic
have many other things going on in their lives. So, they may have
                                                                       consideration of the socio-technical systems that induce them
limited time and attention that they wish to devote to data de-
                                                                       (as opposed to narrowly focusing on the technical components
tails. This makes it important that results and data be presented
                                                                       and tasks alone), and of the contexts in which such systems
fairly, in a manner that leads to good understanding. Otherwise,
                                                                       operate. The richness of issues surrounding equity cannot be
inequity in attention availability can lead to errors and misunder-
                                                                       addressed by framing it as a narrow, situational facet of “final
standing. To address such questions, we have initiated a stream
                                                                       mile” learning systems. We need a socio-technical framing that
of work in cherry-picking [3].
                                                                       shifts equity considerations upstream to the data infrastructure,
                                                                       combines technical and societal perspectives, and allows us to
3.4    Outcome equity                                                  reason about the proper role for technology in promoting equity
Controlling for inequity during processing does not guarantee          while linking to emergent social and legal contexts. This type
improvements in quality of life and societal outcomes, either in       of approach is rapidly gaining traction in global technology pol-
aggregate or at the individual level, due to, for example, emer-       icy [12]. From a technology perspective, we must appreciate that
gent bias [18]. It is, therefore, important to monitor and mitigate    multiple data sets are processed in a complex workflow, with
unintended consequences for any groups affected by a system            numerous design and deployment choices enroute [23]. Addi-
after deployment, directly or indirectly.                              tionally, our socio-technical framing mandates engagement with
   Outcome equity refers to downstream unanticipated conse-            stakeholders before, during, and after any technology develop-
quences outside the direct control of the system — evaluation of       ment, affords operationalization of socio-technical equity, as it
these consequences pertains directly to the socio-political notion     emphasizing their lived experience as design expertise. It there-
of equity focusing on equality of outcome. For example, families       fore centers intersectionality, a framework that focuses on how
rejected Boston’s optimized bus route system due to disruption         the interlocking systems of social identity (race, class, gender,
of their schedules, despite the system’s improvement in both           sexuality, disability) combine into experiences of privilege and
resource management and equity.                                        oppression [11, 13, 14, 20]. This framing expands data sciences’
   It take time, effort, and expense to build a model. In conse-       existing interpretation of intersectionality from external classifi-
quence, models developed in one context are often used in an-          cation, often a political act [8, 9], to active involvement of those
other. Such model transfer has to be done with care. We have           who are classified.
used 3D CNNs to generalize predictions in the urban domain [54].          In this extended abstract, we have identified four facets of data
We have shown that fairness adjustments applied to integrated          equity, each of which must be addressed by data equity systems.
representations (via adversarial models that attempt to learn the      For our ongoing work in this direction, please visit our project
protected attribute [30]) outperform methods that apply fairness       website at https://midas.umich.edu/FIDES.
adjustments on the individual data sets [55].
   The equity of a data-intensive system can be difficult to main-     ACKNOWLEDGMENT
tain over time [27, 34], due to distribution shifts [7, 22, 29, 41]    This work was supported in part by the US National Science
that can reduce performance, force periodic retraining, and gen-       Foundation, under grants 1934405, 1934464, and 1934565.
erally undermine trust. Techniques similar to the transferability
methods, described in the preceding paragraph, can help.               REFERENCES
   To minimize outcome inequity, data-driven systems must be            [1] Abolfazl Asudeh, H.V. Jagadish, Julia Stoyanovich, and Gautam Das. 2019.
accountable. Accountability requires public disclosure. For ex-             Designing Fair Ranking Schemes. In ACM SIGMOD.
ample, a job seeker must be informed which qualifications or            [2] Abolfazl Asudeh, H. V. Jagadish, Gerome Miklau, and Julia Stoyanovich. 2018.
                                                                            On Obtaining Stable Rankings. PVLDB 12, 3 (2018), 237–250. http://www.
characteristics were used by the tool, and why these are consid-            vldb.org/pvldb/vol12/p237-asudeh.pdf
ered job-relevant [46, 47].                                             [3] Abolfazl Asudeh, H. V. Jagadish, You Wu, and Cong Yu. 2020. On detecting
   But accountability is not enough in itself: the data subject also        cherry-picked trendlines. Proceedings of the VLDB Endowment 13, 6 (2020),
                                                                            939–952.
should have recourse. We seek contestability by design [33], an         [4] Abolfazl Asudeh, Zhongjun Jin, and H. V. Jagadish. 2019. Assessing and
active principle that goes beyond explanation and focuses on                remedying coverage for a given dataset. In IEEE International Conference on
                                                                            Data Engineering. IEEE, 554–565.
user engagement, fostering user understanding of models and             [5] Solon Barocas and Andrew Selbst. 2016. Big Data’s Disparate Impact. Califor-
outputs, and collaboration in systems design [17, 26, 43]. Our              nia Law Review 104, 3 (2016), 671–732.
goal is to empower users to question algorithmic results, and           [6] Ruha Benjamin. 2019. Race after technology: Abolitionist tools for the new
                                                                            jim code. Social Forces (2019).
thereby to correct output inequities where possible.                    [7] Steffen Bickel, Michael Brückner, and Tobias Scheffer. 2009. Discriminative
   As a starting point, consider credit scores: a simple tool that          Learning Under Covariate Shift. J. Mach. Learn. Res. 10 (2009), 2137–2155.
has existed for years in the US and in many other countries. A              https://dl.acm.org/citation.cfm?id=1755858
                                                                        [8] Geoffery C. Bowker and Susan Leigh Star. 2000. Sorting Things out: Classifica-
myriad of data sources report on your paying what you owe,                  tion and Its Consequences. MIT Press, Cambridge, MA, USA.
 [9] Joy Buolamwini and Timnit Gebru. 2018. Gender Shades: Intersectional               [35] John Rawls. 1971. A theory of justice. Harvard University Press.
     Accuracy Disparities in Commercial Gender Classification. In Conference on         [36] John E Roemer and Alain Trannoy. 2015. Equality of opportunity. In Handbook
     Fairness, Accountability and Transparency, FAT 2018, 23-24 February 2018, New           of income distribution. Vol. 2. Elsevier, 217–300.
     York, NY, USA. 77–91. http://proceedings.mlr.press/v81/buolamwini18a.html          [37] Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. 2019. Capuchin:
[10] Irene Chen, Fredrik D Johansson, and David Sontag. 2018. Why is my classifier           Causal Database Repair for Algorithmic Fairness. CoRR abs/1902.08283 (2019).
     discriminatory?. In Advances in Neural Information Processing Systems. 3539–            arXiv:1902.08283 http://arxiv.org/abs/1902.08283
     3550.                                                                              [38] Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. 2019. Interventional
[11] P. H. Collins. 2000. Black Feminist Thought: Knowledge, Consciousness, and the          fairness: Causal database repair for algorithmic fairness. In Proceedings of the
     Politics of Empowerment. Routledge, New York, NY.                                       2019 International Conference on Management of Data. 793–810.
[12] Council of Europe. 2020.                Ad Hoc Committee on Artificial             [39] Sebastian Schelter, Felix Biessmann, Tim Januschowski, David Salinas, Stephan
     Intelligence (CAHAI), Feasibility Study.                     https://rm.coe.int/        Seufert, Gyuri Szarvas, Manasi Vartak, Samuel Madden, Hui Miao, Amol Desh-
     cahai-2020-23-final-eng-feasibility-study-/1680a0c6da.                                  pande, et al. 2018. On Challenges in Machine Learning Model Management.
[13] Kimberle Crenshaw. 1989. Demarginalizing the intersection of race and sex:              IEEE Data Eng. Bull. 41, 4 (2018), 5–15.
     A black feminist critique of antidiscrimination doctrine, feminist theory and      [40] Sebastian Schelter, Yuxuan He, Jatin Khilnani, and Julia Stoyanovich. 2020.
     antiracist politics. University of Chicago Legal Forum 1 (1989), 139–167.               FairPrep: Promoting Data to a First-Class Citizen in Studies on Fairness-
[14] Kimberle Crenshaw. 1991. Mapping the Margins: Intersectionality, Identity               Enhancing Interventions. In EDBT, Angela Bonifati, Yongluan Zhou, Marcos
     Politics, and Violence against Women of Color. Stanford Law Review 43, 6                Antonio Vaz Salles, Alexander Böhm, Dan Olteanu, George H. L. Fletcher,
     (1991), 1241–1299. http://www.jstor.org/stable/1229039                                  Arijit Khan, and Bin Yang (Eds.). OpenProceedings.org, 395–398. https:
[15] Ronald Dworkin. 1981. What is Equality? Part 1: Equality of Welfare. Philoso-           //doi.org/10.5441/002/edbt.2020.41
     phy and Public Affairs 10, 4 (1981), 185–246.                                      [41] Sebastian Schelter, Tammo Rukat, and Felix Bießmann. 2020. Learning
[16] Ronald Dworkin. 1981. What is Equality? Part 2: Equality of Resources.                  to Validate the Predictions of Black Box Classifiers on Unseen Data. In
     Philosophy and Public Affairs 10, 4 (1981), 283–345.                                    Proceedings of the 2020 International Conference on Management of Data,
[17] Motahhare Eslami. 2017. Understanding and Designing around Users’ In-                   SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-
     teraction with Hidden Algorithms in Sociotechnical Systems. In Proceed-                 19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan,
     ings of the 2017 ACM Conference on Computer Supported Cooperative Work                  Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1289–1299. https:
     and Social Computing, CSCW 2017, Portland, OR, USA, February 25 - March                 //doi.org/10.1145/3318464.3380604
     1, 2017, Companion Volume, Charlotte P. Lee, Steven E. Poltrock, Louise            [42] David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips,
     Barkhuus, Marcos Borges, and Wendy A. Kellogg (Eds.). ACM, 57–60. http:                 Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and
     //dl.acm.org/citation.cfm?id=3024947                                                    Dan Dennison. 2015. Hidden technical debt in machine learning systems. In
[18] Batya Friedman and Helen Nissenbaum. 1996. Bias in Computer Systems. ACM                Advances in neural information processing systems. 2503–2511.
     Trans. Inf. Syst. 14, 3 (1996), 330–347. https://doi.org/10.1145/230538.230561     [43] Judith Simon. 2015. Distributed Epistemic Responsibility in a Hyperconnected
[19] Dominique DuBois Gilliard. 2018. Rethinking incarceration: Advocating for               Era. In The Online Manifesto: Being Human in a Hyperconnected Era. 145–159.
     justice that restores. InterVarsity Press.                                              https://doi.org/10.1007/978-3-319-04093-6_17
[20] P.L. Hammack. 2018. The Oxford Handbook of Social Psychology and So-               [44] Dean Spade. 2015. Normal Life: Administrative Violence, Critical Trans Politics,
     cial Justice. Oxford University Press. https://books.google.com/books?id=               and the Limits of Law. Duke University Press. http://www.jstor.org/stable/j.
     ZY9HDwAAQBAJ                                                                            ctv123x7qx
[21] Anna Lauren Hoffmann. 0. Terms of inclusion: Data, discourse, violence.            [45] Dean Spade. 2015. Normal life: Administrative violence, critical trans politics,
     New Media & Society 0, 0 (0), 1461444820958725. https://doi.org/10.1177/                and the limits of law. Duke University Press.
     1461444820958725 arXiv:https://doi.org/10.1177/1461444820958725                    [46] Julia Stoyanovich. 2020. Testimony of Julia Stoyanovich before New York City
[22] Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Karsten M. Borgwardt,                Council Committee on Technology regarding Int 1894-2020, Sale of automated
     and Bernhard Schölkopf. 2006. Correcting Sample Selection Bias by Unlabeled             employment decision tools. https://dataresponsibly.github.io/documents/
     Data. In Advances in Neural Information Processing Systems 19, Proceedings              Stoyanovich_Int1894Testimony.pdf.
     of the Twentieth Annual Conference on Neural Information Processing Systems,       [47] Julia Stoyanovich, Bill Howe, and H.V. Jagadish. 2020. Responsible Data
     Vancouver, British Columbia, Canada, December 4-7, 2006, Bernhard Schölkopf,            Management. PVLDB 13, 12 (2020), 3474–3489. https://doi.org/10.14778/
     John C. Platt, and Thomas Hofmann (Eds.). MIT Press, 601–608. http://papers.            3415478.3415570
     nips.cc/paper/3075-correcting-sample-selection-bias-by-unlabeled-data              [48] Julia Stoyanovich, Ke Yang, and H. V. Jagadish. 2018. Online Set Selection with
[23] H. V. Jagadish, Johannes Gehrke, Alexandros Labrinidis, Yannis Papakonstanti-           Fairness and Diversity Constraints. In Proceedings of the 21th International
     nou, Jignesh M Patel, Raghu Ramakrishnan, and Cyrus Shahabi. 2014. Big                  Conference on Extending Database Technology, EDBT 2018, Vienna, Austria,
     data and its technical challenges. Commun. ACM 57, 7 (2014), 86–94.                     March 26-29, 2018. 241–252. https://doi.org/10.5441/002/edbt.2018.22
[24] H. V. Jagadish, Julia Stoyanovich, and Bill Howe. 2021. COVID-19 Brings Data       [49] Chenkai Sun, Abolfazl Asudeh, HV Jagadish, Bill Howe, and Julia Stoyanovich.
     Equity Challenges to the Fore. ACM Digital Government: Research and Practice            2019. Mithralabel: Flexible dataset nutritional labels for responsible data
     2, 2 (2021).                                                                            science. In Proceedings of the ACM International Conference on Information and
[25] Zhongjun Jin, Mengjing Xu, Chenkai Sun, Abolfazl Asudeh, and H. V. Jagadish.            Knowledge Management. ACM, 2893–2896.
     2020. MithraCoverage: A System for Investigating Population Bias for Inter-        [50] Supreme Court of the United States. 1995. ADARAND CONSTRUCTORS, Inc.
     sectional Fairness. In Proceedings of the ACM SIGMOD International Conference           v. PEÑA, 515 U.S. 200 (1995), No.93-1841. https://supreme.justia.com/cases/
     on Management of Data. ACM, 2721–2724.                                                  federal/us/515/200/#tab-opinion-1959723.
[26] Daniel Kluttz et al. 2020. Shaping Our Tools: Contestability as a Means to         [51] Supreme Court of the United States. 1996. UNITED STATES v. VIRGINIA, 518
     Promote Responsible Algorithmic Decision Making in the Professions. In                  U.S. 515 (1996), No.94-1941. https://supreme.justia.com/cases/federal/us/515/
     After the Digital Tornado: Networks, Algorithms, Humanity.                              200/#tab-opinion-1959723.
[27] Arun Kumar, Robert McCann, Jeffrey F. Naughton, and Jignesh M. Patel.              [52] Supreme Court of the United States. 2009. Ricci v. DeStefano (Nos. 07-1428
     2015. Model Selection Management Systems: The Next Frontier of Advanced                 and 08-328), 530 F. 3d 87, reversed and remanded. https://www.law.cornell.
     Analytics. SIGMOD Rec. 44, 4 (2015), 17–22. https://doi.org/10.1145/2935694.            edu/supct/html/07-1428.ZO.html.
     2935698                                                                            [53] Ian Kennedy Timothy A. Thomas, Ott Toomet and Alex Ramiller. [n.d.]. The
[28] Yin Lin, Yifan Guan, Abolfazl Asudeh, and H. V. Jagadish. 2020. Identifying             State of Evictions: Results from the University of Washington Evictions Project.
     insufficient data coverage in databases with multiple relations. Proceedings of         https://evictions.study/. accessed 25-Apr-2019.
     the VLDB Endowment 13, 12 (2020), 2229–2242.                                       [54] An Yan and Bill Howe. 2019. FairST: Equitable Spatial and Temporal Demand
[29] Zachary C. Lipton, Yu-Xiang Wang, and Alexander J. Smola. 2018. Detect-                 Prediction for New Mobility Systems. In Proceedings of the 27th ACM SIGSPA-
     ing and Correcting for Label Shift with Black Box Predictors. In Proceedings            TIAL International Conference on Advances in Geographic Information Systems.
     of the 35th International Conference on Machine Learning, ICML 2018, Stock-             552–555.
     holmsmässan, Stockholm, Sweden, July 10-15, 2018 (Proceedings of Machine           [55] An Yan and Bill Howe. 2021. EquiTensors: Learning Fair Integrations of
     Learning Research), Jennifer G. Dy and Andreas Krause (Eds.), Vol. 80. PMLR,            Heterogeneous Urban Data. In ACM SIGMOD.
     3128–3136. http://proceedings.mlr.press/v80/lipton18a.html                         [56] Ke Yang, Joshua R. Loftus, and Julia Stoyanovich. 2020. Causal intersectionality
[30] David Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. 2018.                 for fair ranking. CoRR abs/2006.08688 (2020). arXiv:2006.08688 https://arxiv.
     Learning adversarially fair and transferable representations. arXiv preprint            org/abs/2006.08688
     arXiv:1802.06309 (2018).                                                           [57] Ke Yang and Julia Stoyanovich. 2017. Measuring Fairness in Ranked Outputs.
[31] Charles W Mills. 2014. The racial contract. Cornell University Press.                   In Proceedings of the 29th International Conference on Scientific and Statistical
[32] Yuval Moskovitch and H. V. Jagadish. 2020. COUNTATA: dataset labeling using             Database Management, Chicago, IL, USA, June 27-29, 2017. 22:1–22:6. https:
     pattern counts. Proceedings of the VLDB Endowment 13, 12 (2020), 2829–2832.             //doi.org/10.1145/3085504.3085526
[33] Deirdre K. Mulligan and Kenneth A. Bamberger. 2019. Procurement as Policy:         [58] Ke Yang, Julia Stoyanovich, Abolfazl Asudeh, Bill Howe, H. V. Jagadish, and
     Administrative Process for Machine Learning. Berkeley Technology Law Journal            Gerome Miklau. 2018. A Nutritional Label for Rankings. In Proceedings of the
     34, 3 (2019), 773–852.                                                                  2018 International Conference on Management of Data, SIGMOD Conference
[34] Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich.               2018, Houston, TX, USA, June 10-15, 2018. 1773–1776. https://doi.org/10.1145/
     2018. Data Lifecycle Challenges in Production Machine Learning: A Survey.               3183713.3193568
     SIGMOD Rec. 47, 2 (2018), 17–28. https://doi.org/10.1145/3299887.3299891