The Many Facets of Data Equity Extended Abstract H. V. Jagadish Julia Stoyanovich Bill Howe University of Michigan New York University University of Washington jag@umich.edu stoyanovich@nyu.edu billhowe@uw.edu ABSTRACT of administrative systems in creating and reinforcing discrimi- Data-driven systems can be unfair, in many different ways. All nation. Spade argues that administrative systems facilitate state too often, as data scientists, we focus narrowly on one technical violence encoded in laws, policies, and schemes that arrange and aspect of fairness. In this paper, we attempt to address equity define people by categories of indigeneity, race, gender, ability, broadly, and identify the many different ways in which it is and national origin [44]. Hoffman considered how these effects manifest in data-driven systems. are amplified through data technologies and their purveyors [21]. Decision systems, regardless of consideration of equity, mecha- nize existing structures, such that any effort to define and address 1 INTRODUCTION data equity issues are at risk of becoming mere technological There is concern about fairness today, whenever data-driven sys- “happy talk.” To combat these outcomes, we emphasize the need tems are used. It is no longer believed that data are impartial and to think about equity broadly, and to own the outcomes realized. neutral. Nevertheless, the scope of fairness considered is often Ideally, we have a primacy of equity in the design: the goal is not narrow. Computer scientists are trained to develop algorithms just to automate and correct for equity, but to design systems that that can solve cleanly stated formal problems. If fairness could exist to further equity. For example, a machine learning system to be reduced to a mathematical constraint, it would have been help submit insurance claims to maximize payment is designed addressed by now. The difficulty, of course, is that fairness is to counteract the discrimination effected by corporate models to more complicated than that. Technical solutions can help, but are minimize payments. However, we know that no everyone data- not enough in themselves to address the real problems. Our goal driven system can have equity as its purpose. So we also must is to get past these limitations and address data equity broadly develop a framework to recognize and remedy the many different defined. ways in which a data-driven system may introduce inequities. To reach our goal, we begin with a discussion of data equity in Data and administrative systems construct the very identities Section 2. Based on this understanding, in Section 3, we examine and categories presented to us as "natural,” both inventing and multiple facets of data equity that must all be addressed. producing meaning for the categories they administer [45](:pp. 31–32). Administrative systems facilitate state violence encoded 2 WHAT IS DATA EQUITY in laws, policies, and schemes that arrange and define people by Equity as a social concept promotes fairness by treating people categories of indigeneity, race, gender, ability, and national origin, differently depending on their endowments and needs (focused which Spade calls "administrative violence" [45](:pp. 20–21). on equality of outcome), whereas equality aims to achieve fair- Similarly, transportation apps like Ghettotracker and SafeR- ness through equal treatment regardless of need (focused on oute are designed to help users navigate around “dangerous” or equality of opportunity) [15, 16, 31, 35, 36]. Equity is not a legal unsafe areas. In practice, they often target neighborhoods popu- framework per se, yet underpins civil rights laws in the U.S. that lated by people of color by encoding racist articulations of what restrict preferences based on protected classes, for example in constitutes danger [19]. housing or employment [50–52]. It has recently also been oper- That social inequity is reinforced and amplified by data-intensive ationalized in computer science scholarship, primarily through systems is not new. We know from other domains that advances fairness in machine learning research [5]. However, equity is a in data science and AI can be undermined by similar problems: much richer concept than a simple mathematical criterion that automated decisions based on biased data can operationalize, en- can be captured in a fairness constraint. trench, and legitimize new forms of discrimination. For example, Even in the best of circumstances, underlying structural in- a defendant’s immediate social network may reveal many convic- equities in access to health care, employment, and housing exhibit tions, but that information must be interpreted through the lens themselves in the data record and are propagated through deci- of socioeconomic conditions and prior structural discrimination sion systems, automated or otherwise, to become reinforced by in the criminal justice system before concluding that an individ- policy. A key thing that’s missing is a treatment of how decision ual is at a higher risk of recidivism or bail violation. Similarly, systems, regardless of consideration of equity, reinforce existing standardized test scores are sufficiently impacted by prepara- structures. Therefore, any effort to define and improve "Data tion courses that the score itself says more about socioeconomic equity" may consist largely of "happy talk" [6], which involves a conditions than an individual’s academic potential. willingness to acknowledge and even revel in cultural difference In summary, the manner in which data systems are built and without seriously challenging ongoing structural inequality. used can compound and exacerbate inequities we have in society. We consider data equity in the context of automated decision It can also introduce inequities where there previously were none. systems, while recognizing a broader literature around the role Avoiding these harms results in data equity, and is accomplished through constructing socio-technical systems that we call data © 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed- equity systems. ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus) on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) 3 FACETS OF DATA EQUITY women as unsuitable for hiring, a problem that exacerbated his- We have examined dozens of examples of inequities in data sys- torical difficulties. Amazon had to cancel this project even before tems, such as those cited in the preceding section. Based on our it launched. empirical study, we have identified four distinct facets of data Representation issues typically, but not exclusively, occur in equity [24], which we present here as a rough taxonomy of the data about people. But there are many exceptions, which can still issues to be considered in the construction of data equity systems. have inequitable impacts on people. The city of Boston released an app, called StreetBump, to report potholes in its streets. The app was downloaded and installed by many citizens with smart- phones, and reported many potholes to the city. The difficulty 3.1 Representation equity was that smartphones were more frequently owned by the better There often are material deviations between the data record and off residents of the city, and these were also more likely to take the world the data is meant to represent, often with respect to the effort to install the app because of their history-driven belief historically disadvantaged groups [10]. Perhaps the best-known in government. The consequence would have been a data record case in this regard has to do with crime records used for predictive with inadequate representation of streets in poor neighborhoods: policing. Many offenses are recorded only when there is police a problem that was proactively corrected by the city, through presence. While citizens may call the police in for some types sending out its own pothole recording crews to use the app in of crimes, both major (such as a murder) and minor (such as poorer neighborhoods. Similarly, richer countries have many a noisy party), it would be unusual for the police to be called more weather stations measuring conditions in the atmosphere in because of a report of jaywalking or minor drug possession. and in the ocean. The disparity of representation in the data Rather, these offenses are only entered into the record when record can lead to weather predictions being less accurate for police happen to observe them, and choose not to ignore them. poor countries. Therefore, crimes are more likely to be observed in areas with Data representation issues, and the harms they cause, may greater police presence, and among these observed, crimes are first appear in the input, output, or at any intermediate data more likely to be recorded where the police officer chooses not processing step, but the majority of research in AI bias and fair to give the offender a pass, a choice that has historically been ML pertains only to learning. We must develop techniques to racially biased. In other words, the data record reflects, and can introspect and intervene at any stage of the data pipeline. It is enshrine, historical injustices. The use of this record for future not enough to hope that we will mitigate the propagation of data police deployments can lead to a vicious cycle of victimizing representation issues during a final learning step. communities that have suffered in the past. Our solution is to adopt database repair [38] as the guid- Representation issues can arise even when there is no histor- ing principle. We have developed techniques to detect under- ical record involved. For example, confirmed COVID-19 cases representation efficiently for a high number of small-domain require testing, and there can be racial disparities in both the discrete-valued attributes, such as those that result from joining availability of testing and in the desire of individuals to be tested, multiple tables in a relational schema [4, 25, 28]. Once representa- leading to systematic biases in collected data. These disparities tion gaps are detected, we consider cases where they can be filled are found in contemporary data, even if they are rooted in his- by collecting more data. We have shown how to satisfy multiple torical discrimination. For example, there may be fewer test sites gaps at the same time efficiently [4, 25]. We have linked causal located in minority neighborhoods, or poor people lacking insur- models to the conditional independence relationships used in ance may worry about the cost of testing and this may reflect in the database repair literature, suggesting a new algorithm for racial statistics. Similarly, a long history of being unfairly treated causal database repair such that any reasonable classifier trained by the medical profession may make African-Americans natu- on the repaired data set will satisfy interventional fairness and rally wary of such interactions and hence induce reluctance in empirically perform well on other definitions [37]. We have de- testing. Whatever be the reasons, the point is that contemporary veloped [40], a design and evaluation framework for fairness- data may under-represent racial minorities, particularly African- enhancing interventions in data-intensive pipelines that treats Americans, and hence potentially lead to under-estimating the data equity as a first-class citizen and supports sound experimen- prevalence of COVID-19 in these communities. tation [39, 42]. Representation inequities in the data can lead to systemic biases in the decisions based on the data. But it can also lead 3.2 Feature equity to greater errors for under-represented groups. Consider facial All the features required for a particular analysis, or to represent recognition as an example. It has been extensively documented, members of some group adequately, may not be available in across numerous current systems, that these systems are con- a dataset. Feature equity refers to the availability of variables siderably more accurate with white males than with women or needed to represent members of every group in the data, and to people of color. Higher error rates for a community is also a perform desired analyses, particularly those to study inequity. For harm, in this case caused by a lack of representation. These error example, if attributes such as race and income are not recorded rates may not only be higher, but they could additionally also along with other data, it becomes hard to discover systematic be biased. For example, Amazon developed software to screen biases that may exist, let alone correct for them. candidates for employment and trained this software on data In the recent COVID-19 pandemic, significant racial disparities from the employees it already had. Since its employees were have been reported in the United States on both infection rates mostly male, women were under-represented in the data record. and mortality rates. Since race is not typically recorded as part Worse still, because of historical discrimination, the few women of medical care in many jusrisdictions, it has been challenging previously in the company had done poorly compared to their for policymakers and analysts to explore these racial differences potential. A model trained on this data set began classifying most as deeply as they would like, and to devise suitable remedies. Similarly, eviction data does not typically include race and very large, and furthermore the customer may not have access to gender information, and this makes it hard to assess equity. sophisticated tools to predict company actions. In other words, Intuitively, it is not unreasonable to think about representation data-driven systems create, and exacerbate, asymmetries, with equity as being concerned with rows in the data-table and feature power going to the entity with more information. equity as being concerned with the columns. However, feature Access equity refers to equitable access to data and models, equity includes the full scope of modeling choices made, of which across domains and levels of expertise, and across roles: data attribute choice is is only one component, albeit a very important subjects, aggregators, and analysts. one. Another manifestation of feature equity has to do with Fundamental asymmetries in information access are difficult choice for the domain of attribute values. If a gender attribute to address. Some amelioration is possible through regulation, or is defined to permit exactly two values, male and female, this is voluntary transparency. Privacy policies are a tiny step in this a modeling choice that explicitly does not accommodate other, direction, though they are far from enough in themselves, and more complex, gender expressions. Similarly, if age has been leave a great deal to be desired the way they are currently imple- recorded in age ranges ( <20, 20-30, 30-40, 40-50, 50-60, and >60), mented in most cases. Right to access information about oneself, it is not possible to distinguish between toddlers and teenagers, as provided through GDPR is Europe, is a more substantial step. or between a 61 year old still able to work a full day and a 95 Access to data is a challenge not just for data owned by private year old no longer able to do so. If these distinctions are not companies. We sometimes see similar issues in other domains as important for the desired analyses, the chosen age range values well. Researchers may hoard their data for competitive advantage are reasonable. However, many analyses may care, and may find in their research: if they put in the effort to collect the data in these value choices very restricting. the first place, they want to analyze the data and publish their When a desired attribute is not recorded at all, or has been findings before releasing the collected data. Government agencies recorded in a limited way, we may seek to impute its value. Ideally, may also act similarly, driven by parochial thinking, local politics, we will be able to do this by linkage across datasets. For example, or other such reasons. it may be possible to determine race based on census data joined One major impediment to making data public is the need to on geography and statistical patterns in first and last names [53]. respect the privacy of the data subjects. A classic example is Where values for missing attributes cannot be determined medical records: there is great potential value in making these through direct linkage, they may sometimes still be estimated available for analysis: surely many new patterns will be found through the use of auxiliary data sets. Choices among competing that improve health and save lives. Yet, most people are very sources may introduce other issues; income recorded to deter- sensitive about sharing medical information and it has proved mine eligibility for housing services will have different biases all too easy to re-identify anonymized data, with enough effort than income estimated from buying history. Furthermore, inte- and ingenuity And this is even before one considers regulatory gration among datasets involves schema mapping decisions that constraints on such sharing. Similarly, as citizens, we all desire can change the result. open government, and would like government agencies to make Finally, imputation of missing attribute values may involve their data public. But, as subjects, we also may be sensitive about an algorithm that depends on some model, which may itself be some of our information with the government, and not want it biased. For instance, zip code can be used to "determine" race. made public. This is a difficult balance, which has to be managed Obviously, this cannot work at the individual level, because not in each instance. Technical solutions can be helpful. For instance, everyone in a zip code is of the same race. Furthermore, even differential privacy may permit privacy preserving release of in the aggregate, we cannot always assume that the proportion some information aggregates. of entries in our data with a particular value for race is equal to Even when actual access to data is not restricted, the opacity the proportion who live in that zip code. For instance, there have of data systems, as perceived by different groups, can also be an been several COVID-19 outbreaks in prisons, where the racial access equity violation. Researchers’ reluctance to release data composition of prisoners is likely quite different from that of the they have invested to collect contributes to the reproducibility surrounding community. crisis. Private companies’ tight control of their data impedes Using a novel concept of EquiTensors, we have demonstrated external equity audits. Inadequate data release can promote mis- that pre-integrated, fairness-adjusted features from arbitrary in- interpretation and therefore misinformation and misuse. Data puts can help avoid propagating discrimination from biased data, access must be accompanied by sufficient metadata to permit while still delivering prediction accuracy comparable to oracle correct interpretation and to determine fitness for use. networks trained with hand-selected data sets [54, 55]. A typical data science pipeline will have a sequence of data ma- nipulations, with multiple intermediate data sets created, shared, and manipulated. Often, these data sets will be from disparate 3.3 Access equity sources, and much of the processing may be conducted at remote So far, we have looked at what is in a data set. Now we look sites. When using a remote data source, it is important to under- at who has access to it. Typically, data sets are owned by big stand not just what the various fields are, but also how certain companies, which spend substantial resources to construct the values were computed and whether the dataset could be used data set, and want to obtain competitive advantages by keeping it for the desired purpose. Provenance descriptions can contain all proprietary. On the other hand, customers may not have access to this information, but is usually far too much detailed information this data, and hence be at a disadvantage in any interaction with for a user to be able to make use of. Additionally, proprietary the company, even with regard to their own information. Worse concerns and privacy limits may limit what can be disclosed. The still, the company has knowledge of multiple customers, which idea of a nutritional label has been proposed by us, and indepen- it can exploit. In contrast, the customer has access to only their dently by others, as a way to capture succinctly a small amount own actions with the company. The customer may interact with of critical information required to determine fitness for use. The multiple companies, but the number of companies is usually not challenge is that the information that must be captured depends and these reports are aggregated into a credit score, which you on the intended use. can see. You have some sense of what goes into building a good We have developed RankingFacts [58], the first prototype of an score, even though the specific details may not be known. More automatically computed nutritional label that helps users inter- importantly, you can see what has been reported about you by pret algorithmic rankers. The work on a user-facing nutritional your creditors and there is a process to challenge errors. The label prototype motivated a deeper inquiry into fairness and di- system is far from perfect, but most data-driven systems today versity in set selection and ranking [48, 56, 57], and on designing are much worse in so many respects, including in particular their fair and stable ranking schemes [1, 2]. We also continued this mechanisms for providing accountability and recourse. work to compute properties to characterize data sets [49], and to succinctly capture correlation between attributes [32]. 4 CONCLUSION Finally, most individuals affected by data-driven systems likely Data equity issues are pervasive but subtle, requiring holistic have many other things going on in their lives. So, they may have consideration of the socio-technical systems that induce them limited time and attention that they wish to devote to data de- (as opposed to narrowly focusing on the technical components tails. This makes it important that results and data be presented and tasks alone), and of the contexts in which such systems fairly, in a manner that leads to good understanding. Otherwise, operate. The richness of issues surrounding equity cannot be inequity in attention availability can lead to errors and misunder- addressed by framing it as a narrow, situational facet of “final standing. To address such questions, we have initiated a stream mile” learning systems. We need a socio-technical framing that of work in cherry-picking [3]. shifts equity considerations upstream to the data infrastructure, combines technical and societal perspectives, and allows us to 3.4 Outcome equity reason about the proper role for technology in promoting equity Controlling for inequity during processing does not guarantee while linking to emergent social and legal contexts. This type improvements in quality of life and societal outcomes, either in of approach is rapidly gaining traction in global technology pol- aggregate or at the individual level, due to, for example, emer- icy [12]. From a technology perspective, we must appreciate that gent bias [18]. It is, therefore, important to monitor and mitigate multiple data sets are processed in a complex workflow, with unintended consequences for any groups affected by a system numerous design and deployment choices enroute [23]. Addi- after deployment, directly or indirectly. tionally, our socio-technical framing mandates engagement with Outcome equity refers to downstream unanticipated conse- stakeholders before, during, and after any technology develop- quences outside the direct control of the system — evaluation of ment, affords operationalization of socio-technical equity, as it these consequences pertains directly to the socio-political notion emphasizing their lived experience as design expertise. It there- of equity focusing on equality of outcome. For example, families fore centers intersectionality, a framework that focuses on how rejected Boston’s optimized bus route system due to disruption the interlocking systems of social identity (race, class, gender, of their schedules, despite the system’s improvement in both sexuality, disability) combine into experiences of privilege and resource management and equity. oppression [11, 13, 14, 20]. This framing expands data sciences’ It take time, effort, and expense to build a model. In conse- existing interpretation of intersectionality from external classifi- quence, models developed in one context are often used in an- cation, often a political act [8, 9], to active involvement of those other. Such model transfer has to be done with care. We have who are classified. used 3D CNNs to generalize predictions in the urban domain [54]. In this extended abstract, we have identified four facets of data We have shown that fairness adjustments applied to integrated equity, each of which must be addressed by data equity systems. representations (via adversarial models that attempt to learn the For our ongoing work in this direction, please visit our project protected attribute [30]) outperform methods that apply fairness website at https://midas.umich.edu/FIDES. adjustments on the individual data sets [55]. The equity of a data-intensive system can be difficult to main- ACKNOWLEDGMENT tain over time [27, 34], due to distribution shifts [7, 22, 29, 41] This work was supported in part by the US National Science that can reduce performance, force periodic retraining, and gen- Foundation, under grants 1934405, 1934464, and 1934565. erally undermine trust. Techniques similar to the transferability methods, described in the preceding paragraph, can help. REFERENCES To minimize outcome inequity, data-driven systems must be [1] Abolfazl Asudeh, H.V. Jagadish, Julia Stoyanovich, and Gautam Das. 2019. accountable. Accountability requires public disclosure. For ex- Designing Fair Ranking Schemes. In ACM SIGMOD. ample, a job seeker must be informed which qualifications or [2] Abolfazl Asudeh, H. V. Jagadish, Gerome Miklau, and Julia Stoyanovich. 2018. On Obtaining Stable Rankings. PVLDB 12, 3 (2018), 237–250. http://www. characteristics were used by the tool, and why these are consid- vldb.org/pvldb/vol12/p237-asudeh.pdf ered job-relevant [46, 47]. [3] Abolfazl Asudeh, H. V. Jagadish, You Wu, and Cong Yu. 2020. On detecting But accountability is not enough in itself: the data subject also cherry-picked trendlines. Proceedings of the VLDB Endowment 13, 6 (2020), 939–952. should have recourse. We seek contestability by design [33], an [4] Abolfazl Asudeh, Zhongjun Jin, and H. V. Jagadish. 2019. Assessing and active principle that goes beyond explanation and focuses on remedying coverage for a given dataset. In IEEE International Conference on Data Engineering. IEEE, 554–565. user engagement, fostering user understanding of models and [5] Solon Barocas and Andrew Selbst. 2016. Big Data’s Disparate Impact. Califor- outputs, and collaboration in systems design [17, 26, 43]. Our nia Law Review 104, 3 (2016), 671–732. goal is to empower users to question algorithmic results, and [6] Ruha Benjamin. 2019. Race after technology: Abolitionist tools for the new jim code. Social Forces (2019). thereby to correct output inequities where possible. [7] Steffen Bickel, Michael Brückner, and Tobias Scheffer. 2009. Discriminative As a starting point, consider credit scores: a simple tool that Learning Under Covariate Shift. J. Mach. Learn. Res. 10 (2009), 2137–2155. has existed for years in the US and in many other countries. A https://dl.acm.org/citation.cfm?id=1755858 [8] Geoffery C. Bowker and Susan Leigh Star. 2000. Sorting Things out: Classifica- myriad of data sources report on your paying what you owe, tion and Its Consequences. MIT Press, Cambridge, MA, USA. [9] Joy Buolamwini and Timnit Gebru. 2018. Gender Shades: Intersectional [35] John Rawls. 1971. A theory of justice. Harvard University Press. Accuracy Disparities in Commercial Gender Classification. In Conference on [36] John E Roemer and Alain Trannoy. 2015. Equality of opportunity. In Handbook Fairness, Accountability and Transparency, FAT 2018, 23-24 February 2018, New of income distribution. Vol. 2. Elsevier, 217–300. York, NY, USA. 77–91. http://proceedings.mlr.press/v81/buolamwini18a.html [37] Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. 2019. Capuchin: [10] Irene Chen, Fredrik D Johansson, and David Sontag. 2018. Why is my classifier Causal Database Repair for Algorithmic Fairness. CoRR abs/1902.08283 (2019). discriminatory?. In Advances in Neural Information Processing Systems. 3539– arXiv:1902.08283 http://arxiv.org/abs/1902.08283 3550. [38] Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. 2019. Interventional [11] P. H. Collins. 2000. Black Feminist Thought: Knowledge, Consciousness, and the fairness: Causal database repair for algorithmic fairness. In Proceedings of the Politics of Empowerment. Routledge, New York, NY. 2019 International Conference on Management of Data. 793–810. [12] Council of Europe. 2020. Ad Hoc Committee on Artificial [39] Sebastian Schelter, Felix Biessmann, Tim Januschowski, David Salinas, Stephan Intelligence (CAHAI), Feasibility Study. https://rm.coe.int/ Seufert, Gyuri Szarvas, Manasi Vartak, Samuel Madden, Hui Miao, Amol Desh- cahai-2020-23-final-eng-feasibility-study-/1680a0c6da. pande, et al. 2018. On Challenges in Machine Learning Model Management. [13] Kimberle Crenshaw. 1989. Demarginalizing the intersection of race and sex: IEEE Data Eng. Bull. 41, 4 (2018), 5–15. A black feminist critique of antidiscrimination doctrine, feminist theory and [40] Sebastian Schelter, Yuxuan He, Jatin Khilnani, and Julia Stoyanovich. 2020. antiracist politics. University of Chicago Legal Forum 1 (1989), 139–167. FairPrep: Promoting Data to a First-Class Citizen in Studies on Fairness- [14] Kimberle Crenshaw. 1991. Mapping the Margins: Intersectionality, Identity Enhancing Interventions. In EDBT, Angela Bonifati, Yongluan Zhou, Marcos Politics, and Violence against Women of Color. Stanford Law Review 43, 6 Antonio Vaz Salles, Alexander Böhm, Dan Olteanu, George H. L. Fletcher, (1991), 1241–1299. http://www.jstor.org/stable/1229039 Arijit Khan, and Bin Yang (Eds.). OpenProceedings.org, 395–398. https: [15] Ronald Dworkin. 1981. What is Equality? Part 1: Equality of Welfare. Philoso- //doi.org/10.5441/002/edbt.2020.41 phy and Public Affairs 10, 4 (1981), 185–246. [41] Sebastian Schelter, Tammo Rukat, and Felix Bießmann. 2020. Learning [16] Ronald Dworkin. 1981. What is Equality? Part 2: Equality of Resources. to Validate the Predictions of Black Box Classifiers on Unseen Data. In Philosophy and Public Affairs 10, 4 (1981), 283–345. Proceedings of the 2020 International Conference on Management of Data, [17] Motahhare Eslami. 2017. Understanding and Designing around Users’ In- SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14- teraction with Hidden Algorithms in Sociotechnical Systems. In Proceed- 19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, ings of the 2017 ACM Conference on Computer Supported Cooperative Work Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1289–1299. https: and Social Computing, CSCW 2017, Portland, OR, USA, February 25 - March //doi.org/10.1145/3318464.3380604 1, 2017, Companion Volume, Charlotte P. Lee, Steven E. Poltrock, Louise [42] David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Barkhuus, Marcos Borges, and Wendy A. Kellogg (Eds.). ACM, 57–60. http: Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and //dl.acm.org/citation.cfm?id=3024947 Dan Dennison. 2015. Hidden technical debt in machine learning systems. In [18] Batya Friedman and Helen Nissenbaum. 1996. Bias in Computer Systems. ACM Advances in neural information processing systems. 2503–2511. Trans. Inf. Syst. 14, 3 (1996), 330–347. https://doi.org/10.1145/230538.230561 [43] Judith Simon. 2015. Distributed Epistemic Responsibility in a Hyperconnected [19] Dominique DuBois Gilliard. 2018. Rethinking incarceration: Advocating for Era. In The Online Manifesto: Being Human in a Hyperconnected Era. 145–159. justice that restores. InterVarsity Press. https://doi.org/10.1007/978-3-319-04093-6_17 [20] P.L. Hammack. 2018. The Oxford Handbook of Social Psychology and So- [44] Dean Spade. 2015. Normal Life: Administrative Violence, Critical Trans Politics, cial Justice. Oxford University Press. https://books.google.com/books?id= and the Limits of Law. Duke University Press. http://www.jstor.org/stable/j. ZY9HDwAAQBAJ ctv123x7qx [21] Anna Lauren Hoffmann. 0. Terms of inclusion: Data, discourse, violence. [45] Dean Spade. 2015. Normal life: Administrative violence, critical trans politics, New Media & Society 0, 0 (0), 1461444820958725. https://doi.org/10.1177/ and the limits of law. Duke University Press. 1461444820958725 arXiv:https://doi.org/10.1177/1461444820958725 [46] Julia Stoyanovich. 2020. Testimony of Julia Stoyanovich before New York City [22] Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Karsten M. Borgwardt, Council Committee on Technology regarding Int 1894-2020, Sale of automated and Bernhard Schölkopf. 2006. Correcting Sample Selection Bias by Unlabeled employment decision tools. https://dataresponsibly.github.io/documents/ Data. In Advances in Neural Information Processing Systems 19, Proceedings Stoyanovich_Int1894Testimony.pdf. of the Twentieth Annual Conference on Neural Information Processing Systems, [47] Julia Stoyanovich, Bill Howe, and H.V. Jagadish. 2020. Responsible Data Vancouver, British Columbia, Canada, December 4-7, 2006, Bernhard Schölkopf, Management. PVLDB 13, 12 (2020), 3474–3489. https://doi.org/10.14778/ John C. Platt, and Thomas Hofmann (Eds.). MIT Press, 601–608. http://papers. 3415478.3415570 nips.cc/paper/3075-correcting-sample-selection-bias-by-unlabeled-data [48] Julia Stoyanovich, Ke Yang, and H. V. Jagadish. 2018. Online Set Selection with [23] H. V. Jagadish, Johannes Gehrke, Alexandros Labrinidis, Yannis Papakonstanti- Fairness and Diversity Constraints. In Proceedings of the 21th International nou, Jignesh M Patel, Raghu Ramakrishnan, and Cyrus Shahabi. 2014. Big Conference on Extending Database Technology, EDBT 2018, Vienna, Austria, data and its technical challenges. Commun. ACM 57, 7 (2014), 86–94. March 26-29, 2018. 241–252. https://doi.org/10.5441/002/edbt.2018.22 [24] H. V. Jagadish, Julia Stoyanovich, and Bill Howe. 2021. COVID-19 Brings Data [49] Chenkai Sun, Abolfazl Asudeh, HV Jagadish, Bill Howe, and Julia Stoyanovich. Equity Challenges to the Fore. ACM Digital Government: Research and Practice 2019. Mithralabel: Flexible dataset nutritional labels for responsible data 2, 2 (2021). science. In Proceedings of the ACM International Conference on Information and [25] Zhongjun Jin, Mengjing Xu, Chenkai Sun, Abolfazl Asudeh, and H. V. Jagadish. Knowledge Management. ACM, 2893–2896. 2020. MithraCoverage: A System for Investigating Population Bias for Inter- [50] Supreme Court of the United States. 1995. ADARAND CONSTRUCTORS, Inc. sectional Fairness. In Proceedings of the ACM SIGMOD International Conference v. PEÑA, 515 U.S. 200 (1995), No.93-1841. https://supreme.justia.com/cases/ on Management of Data. ACM, 2721–2724. federal/us/515/200/#tab-opinion-1959723. [26] Daniel Kluttz et al. 2020. Shaping Our Tools: Contestability as a Means to [51] Supreme Court of the United States. 1996. UNITED STATES v. VIRGINIA, 518 Promote Responsible Algorithmic Decision Making in the Professions. In U.S. 515 (1996), No.94-1941. https://supreme.justia.com/cases/federal/us/515/ After the Digital Tornado: Networks, Algorithms, Humanity. 200/#tab-opinion-1959723. [27] Arun Kumar, Robert McCann, Jeffrey F. Naughton, and Jignesh M. Patel. [52] Supreme Court of the United States. 2009. Ricci v. DeStefano (Nos. 07-1428 2015. Model Selection Management Systems: The Next Frontier of Advanced and 08-328), 530 F. 3d 87, reversed and remanded. https://www.law.cornell. Analytics. SIGMOD Rec. 44, 4 (2015), 17–22. https://doi.org/10.1145/2935694. edu/supct/html/07-1428.ZO.html. 2935698 [53] Ian Kennedy Timothy A. Thomas, Ott Toomet and Alex Ramiller. [n.d.]. The [28] Yin Lin, Yifan Guan, Abolfazl Asudeh, and H. V. Jagadish. 2020. Identifying State of Evictions: Results from the University of Washington Evictions Project. insufficient data coverage in databases with multiple relations. Proceedings of https://evictions.study/. accessed 25-Apr-2019. the VLDB Endowment 13, 12 (2020), 2229–2242. [54] An Yan and Bill Howe. 2019. FairST: Equitable Spatial and Temporal Demand [29] Zachary C. Lipton, Yu-Xiang Wang, and Alexander J. Smola. 2018. Detect- Prediction for New Mobility Systems. In Proceedings of the 27th ACM SIGSPA- ing and Correcting for Label Shift with Black Box Predictors. In Proceedings TIAL International Conference on Advances in Geographic Information Systems. of the 35th International Conference on Machine Learning, ICML 2018, Stock- 552–555. holmsmässan, Stockholm, Sweden, July 10-15, 2018 (Proceedings of Machine [55] An Yan and Bill Howe. 2021. EquiTensors: Learning Fair Integrations of Learning Research), Jennifer G. Dy and Andreas Krause (Eds.), Vol. 80. PMLR, Heterogeneous Urban Data. In ACM SIGMOD. 3128–3136. http://proceedings.mlr.press/v80/lipton18a.html [56] Ke Yang, Joshua R. Loftus, and Julia Stoyanovich. 2020. Causal intersectionality [30] David Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. 2018. for fair ranking. CoRR abs/2006.08688 (2020). arXiv:2006.08688 https://arxiv. Learning adversarially fair and transferable representations. arXiv preprint org/abs/2006.08688 arXiv:1802.06309 (2018). [57] Ke Yang and Julia Stoyanovich. 2017. Measuring Fairness in Ranked Outputs. [31] Charles W Mills. 2014. The racial contract. Cornell University Press. In Proceedings of the 29th International Conference on Scientific and Statistical [32] Yuval Moskovitch and H. V. Jagadish. 2020. COUNTATA: dataset labeling using Database Management, Chicago, IL, USA, June 27-29, 2017. 22:1–22:6. https: pattern counts. Proceedings of the VLDB Endowment 13, 12 (2020), 2829–2832. //doi.org/10.1145/3085504.3085526 [33] Deirdre K. Mulligan and Kenneth A. Bamberger. 2019. Procurement as Policy: [58] Ke Yang, Julia Stoyanovich, Abolfazl Asudeh, Bill Howe, H. V. Jagadish, and Administrative Process for Machine Learning. Berkeley Technology Law Journal Gerome Miklau. 2018. A Nutritional Label for Rankings. In Proceedings of the 34, 3 (2019), 773–852. 2018 International Conference on Management of Data, SIGMOD Conference [34] Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2018, Houston, TX, USA, June 10-15, 2018. 1773–1776. https://doi.org/10.1145/ 2018. Data Lifecycle Challenges in Production Machine Learning: A Survey. 3183713.3193568 SIGMOD Rec. 47, 2 (2018), 17–28. https://doi.org/10.1145/3299887.3299891