1. Introduction

Discrimination-aware Data Transformations

Chiara Accinelli

0 0 Advised by: prof. Barbara Catania University of Genoa , Italy

A deep use of people-related data in automated decision processes might lead to an amplification of inequities already implicit in the real world data. Nowadays, the development of technological solutions satisfying nondiscriminatory requirements is therefore one of the main challenges for the data management and data analytics communities. Nondiscrimination can be characterized in terms of diferent properties, like fairness, diversity, and coverage, and many approaches have been proposed so far for guaranteeing nondiscrimination through the satisfaction of such properties during specific steps of the data processing pipeline. In this PhD project, we are interested in investigating the impact of coverage-based constraints on data transformations. Coverage aims at guaranteeing that the input dataset includes enough examples for each (protected) category of interest, thus increasing diversity with the aim of limiting the introduction of bias during the next analytical steps. We propose coverage-based queries as a mean to achieve coverage constraint satisfaction on the result of data transformations defined in terms of selection-based queries. Both precise and approximate algorithms are designed to guarantee a good compromise between eficiency and accuracy. The applicability of the approach is evaluated by integrating it in a data processing Python toolkit.

eol>nondiscrimination data transformation coverage rewriting

1. Introduction

result: the sooner you spot the problem fewer problems you will get in the last analytical steps of the chain (see, Nowadays, we are surrounded by data that are increas- e.g., the Google’s gorilla classification incident [ 20 ]). ingly exploited to make decisions that might impact peo- In this PhD project, we are interested in investigating ple’s lives. It is therefore very important to understand the impact of coverage-based constraints on the, possibly the nature of that impact at the social level and take re- intermediate, datasets generated through data preparasponsibility for them. The design of data-driven decision- tion, with a special focus on data transformations. This support systems ensuring a responsible and ethical use of topic is relevant since any data preparation step that data is therefore a must and it has been recognized that transforms the input datasets might lead to a violation both data management and data analytic communities of the coverage of protected categories, afecting subshould contribute [ 1, 21 ]. Such systems should ensure sequent analytical tasks. Notice that the input dataset on one hand transparency and interpretability, making can correspond to either raw data that have not been the process and the decisions easy to understand, and transformed yet (and in this case, solutions like those on the other nondiscrimination with respect to all the proposed in [ 7, 8 ] can be used to determine how to modreference groups of individuals, usually defined in terms ify the input dataset and collect new data) or the result of sensitive attributes, like, e.g., gender. of, potentially many, data transformation queries. We

Nondiscrimination can be characterized in terms of are interested in this second case. diferent properties like fairness, i.e., lack of bias [ 16 ], As an example, suppose you are interested in analyzing diversity, i.e., the degree to which diferent kinds of ob- data of the well known Adult dataset1 (e.g., predicting jects are represented in a dataset [ 11 ], and coverage [ 7 ], through classification which individuals make over 50k a guaranteeing a suficient representation of any category year), after filtering it according to specific criteria (e.g., of interest in a dataset. As first pointed out in [ 1 ] and only senior job positions should be considered, qualified remarked in, e.g., [ 11, 21 ], such properties should be in terms of selection conditions over age, weekly working achieved through a holistic approach, incrementally en- hours, and education level). Suppose you would like to forcing nondiscrimination constraints along all the stages guarantee nondiscrimination with respect to the gender of the data processing life-cycle, through individually in- by training a model whose accuracy does not deeply dependent choices rather than as a constraint on the final depend on this attribute. It has already been recognized that the quality of the classifier might depend, among the others, also on the number of instances, i.e., the coverage, of each group in the dataset [ 18 ]. Thus, if the selection query returning senior job positions includes few female,

1https://archive.ics.uci.edu/ml/datasets/Adult

the result of the classifier can be biased. defined criteria, i.e., a new selection-condition, is used to

In order to solve this problem without going back to retrieve the new result, guaranteeing at the same time the data collection step, additional female individuals transparency. In this respect, coverage-based queries difcould be added to the dataset generated through query fer from other similarity-based query approaches, like execution. fuzzy queries [ 13 ]. Other rewriting-based approaches

We tackled this issue by defining and processing have been proposed so far to tackle discrimination issues coverage-based queries, i.e., selection-based queries that, defined in terms of other properties and queries. Rewritgiven a set of coverage constraints, always return a result ing has been used for OLAP queries and causal fairness satisfying the input constraints while staying close to the in [ 15 ] and, more recently, for range queries and fairness original request. In order to avoid disparate treatment in [ 19 ]. As far as we know and according to [ 18 ], no other discrimination [ 9 ] and guarantee transparency, the initial solutions addressing coverage-based rewriting in the conquery is rewritten into a new one, satisfying the coverage text of selection-based queries have been proposed so constraint while staying close to the original request. far.

The main research questions addressed by the PhD project can be summarized as follows: (RQ1) How can coverage-based queries be defined and 3. Coverage-based queries (hRoQw2c)anHothweycabne cchoavrearcatgeeri-zbeads?ed queries be eficiently dPartealsiemtsin(ea.rgi.,esr.elaWtioencsoinnsiaderreldaatitoansatlodreadtabinastea, bdualatar processed? frames in the Pandas environment). We assume that (RQ3) How can coverage-based queries be integrated in tdiaotTnah2pecrrooemcmepsasainirnedgseroeuonfrvtwihrooenrpkmawpeenirtthiss?ootrhgearneixziesdtiansgfoapllporwosa.cSheecs-. isrnoapmcueet)ddsaiisntaccrseeetttheaervyeaioldufeepndatiraftytitcrpuirbloautretiecnstteedreg=srto(uep.1gs,.a,..ng.de,nadreeocrfaaltlnhedde Coverage-based queries are defined in Section 3 (RQ1) sensitive attributes. We focus on selection-based data and solutions developed so far for their processing are jtorainnesfdoromraatgiognrse(goarteqdu)erdiaesta)soevtesr, sintoarendaloyrticcoamlppurotecdes(see.gs, described in Section 4 (RQ2) (some preliminary results that might alter the representation (i.e., the coverage) of on (RQ1) and (RQ2) can be found in [ 4, 5, 6 ]). Details specific groups of interests, defined in terms of sensitive apbroopuotsaedPtyetchhonniqduaetsainprPoacnedsasisn2 g(RtQoo3l)kciatninbteefgoruantidnignt[h3e] adtattraibsulitceinvgaloupeesr(aet.igo.,nSsQinLPsaenledcatiso,2nCsoolvuemr nreTlraatinosnfoalrmdaetras, but are not presented in the paper for space constraints. in Scikit-Learn).3 Finally, Section 5 concludes and presents some directions We consider boolean combinations of atomic selecfor further developments. tion conditions ≡ , ∈ , ∈ {=, <, ≤ , ≥ >}, numeric attribute, ̸= , = 1, . . . , , 2. Related work that do not refer to, as usually assumed, sensitive attributes, i.e., ̸∈ . A selection-based query is thus Discrimination-aware approaches have been proposed denoted by ⟨1, ..., ⟩ or ⟨⟩, ≡ (1, ..., ), and both with reference to data analysis (e.g., OLAP is called selection vector. A coverage constraint has the queries [ 15 ], set selection [ 22 ], ranking [ 23 ]) and data form ↓11,,......,,ℎℎ ≥ and specifies that the minimum preparation (e.g, dataset repair during data acquisition number of instances with sensitive attribute equal [ 7, 16 ], with a special focus on coverage in [ 7, 8, 12 ], data to , = 1, ..., ℎ, in a query result has to be . As cleaning [ 17 ] and data integration [ 14 ]). an example, ↓gfeenmdaelre≥ 10 specifies that the result should

Similarly to [ 7, 8, 12 ], we consider coverage as a mean include at least 10 female individuals. The group referred to limit discrimination. However, rather than checking by a coverage constraint is called protected group. coverage over raw datasets and repair them in case of cov- Definition of coverage-based queries . Let be a set erage unsatisfaction through new data acquisitions, we of coverage constraints over a set of sensitive attributes guarantee coverage satisfaction along data transforma- and ⟨⟩ a selection-based query. A coverage-based tion chains defined in terms of selection-based queries. query for and ⟨⟩ is a selection-based query that,

Coverage-based queries, presented in this paper, given a dataset , stretches the result ⟨⟩() as little as change the result of an input query through rewriting possible so that the result satisfies the constraints in . rather than through the usage of ad-hoc query execution More precisely, when considering a dataset : (i) realgorithms. This avoids a disparate treatment discrimina- turns the result of a query ⟨⟩ over ; ⟨⟩ is obtained tion during selection-based query execution since a well from by only changing the selection constants that 2https://pandas.pydata.org/pandas-docs/stable/getting_started/ intro_tutorials/03_subset_data.html

3https://scikit-learn.org/stable/modules/generated/sklearn.

compose.ColumnTransformer.html algorithms, for each , return one minimal solution4 and do not rely on any index data structure, so that both stored and computed datasets can be considered.

A grid-based approximate approach. The first approach is approximate because, for each dataset , it relies on a discretized search space and a sample-based approach for cardinality estimation, needed for constraint (a) Input and induced queries (b) Skyline points and minimality checking (property P3). It can be applied over any dataset for which a sample is available or can Figure 1: Coverage-based query properties be easily computed on the fly.

The discretized search space is generated by considering the intersection points of a grid obtained by dismight depend on ; (ii) ∀ ⟨⟩() ⊆ (); (iii) all cov- cretizing each axis (one for each selection attribute in erage constraints are satisfied by (); (iv) ⟨⟩ is min- , from the corresponding selection value in the query imal. Minimality means that any other query ′ satisfy- to the maximum value in the dataset), using standard ing conditions (i)–(iii) is such that either (′()) > binning approaches (e.g., equi-width and equi-depth). (⟨⟩()) or (′()) = (⟨⟩()) and Each point on the grid corresponds to a selection-based ⟨⟩ is syntactically closer than ′ to ⟨⟩, according to query of type ⟨⟩, thus satisfying conditions (i) and the Euclidean distance (defined in a unit space) between (ii) of the reference problem. ⟨⟩ is then determined selection vectors. by visiting the discretized search space starting from , Properties. A coverage-based query satisfies the one point after the other, at increasing distance from following properties: (algorithm ). The properties of the discretized (P1) It can be represented in a canonical form in which search space and the canonical form are considered for each selection condition has the form ≤ or pruning the space (algorithm ), possibly in < ; ⟨⟩ is then represented as point in the creasing the number of points to be visited at diferent -dimensional space defined by selection attributes (see iterations (algorithms and ). in Figure 1(a)). Details on all the algorithm versions and an exhaus(P2) Let () ≡ ⟨⟩(). We proved that coincides tive experimental evaluation, on both synthetic and realwith the upper right vertex of the minimum bounding world datasets, have been presented in [ 5 ]. The obtained box of at most distinct points in , , and the origin results depend on the density of the search space and of the space (see the green triangle in Figure 1(a)) [ 2 ]. show that: (i) equi-depth guarantees better performance Such vertices are called induced points and the set of over non-uniformly distributed data; (ii) a multi-level all induced points corresponds to the search space for processing approach, like , greatly helps in coverage-based queries. reducing the curse of dimensionality when the query (P3) There is a relationship between and the skyline of contains a high number of selection conditions; (iii) the induced points corresponding to queries that, when exe- processing performance linearly depends on the number cuted over , satisfy ; the dominance relation, needed of coverage constraints; (iv) a good level of accuracy can for the skyline computation, is defined over selection be obtained with relatively small samples; (v) coverage attributes, assuming the lower the better (see Figure 1(b)). constraint satisfaction has an obvious impact on the rate It can be proved that coincides with the skyline point of diferent groups of protected instances, i.e., on fairness. corresponding to the query with the minimal cardinality An iteration-based precise approach. More recently, at the lowest distance from . Thus, can be identified we started from the naïve approach, derived from P1, P2, by combining skyline and top-1 computations (possibly and P3, to design a family of algorithms for the precise mixed, as pointed out in [ 10 ]). computation of coverage-based queries. The designed algorithms rely on the following considerations: (i) the induced query space can be computed in up to itera4. Coverage-based query tions; the computation of new points at iteration can be processing pruned by considering only points obtained at iteration − 1 that do not satisfy ; (ii) the iterated computation of induced points and skyline dominance checks can be interleaved so that the considered space at each iteration is further reduced; (iii) minimality can be checked either Properties P1, P2, and P3 suggest a naïve but ineficient approach for processing coverage-based queries, due to the size of the search space and skyline computation. We therefore improved such basic strategy under two directions, briefly described in the following. The designed 4The proposed algorithms can be easily customized to return all minimal solutions or a specific one, according to some further optimality criteria. during the skyline computation, to reduce the number of dominance comparisons, or after the skyline has been computed, limiting in this way the number of cardinality estimations; (iv) the grid-based approximate approach can be used as a filtering step, for further reducing the space before applying one precise algorithm. The proposed algorithms are currently under evaluation, on both synthetic and real datasets.

5. Conclusions and further developments

In this PhD project, we investigate the impact of coverage constraints on data transformations, as a mean for limiting bias in the next analytical steps. After defining coverage-based queries, we designed and experimentally evaluated both approximate and precise algorithms for their processing. The proposed solutions rely on query rewriting, a key approach for enforcing specific nondiscrimination constraints while guaranteeing transparency and avoiding disparate treatment discrimination.

Future work includes the integration of the proposed queries in a relational DBMS and the extension of the proposed solutions to consider further nondiscrimination constraints. To this aim, an interesting approach is to rely on a constraint-based optimization approach for specifying diferent types of constraints, possibly inherently diferent, as coverage and fairness [ 19 ], and determining the best data transformation rewriting.

[1]

Abiteboul et al. Research directions for principles of data management . Dagstuhl Manifestos , 7 ( 1 ): 1 - 29 , 2018 .

[2]

Accinelli . Discrimination-aware data transformations (doctoral dissertation , in preparation). University of Genoa, Italy, 2023 .

[3]

Accinelli ,

Catania , G. Guerrini, and

Minisi . covRew: a Python toolkit for pre-processing pipeline rewriting ensuring coverage constraint satisfaction . In Proc. EDBT , pages 698 - 701 , 2021 .

[4]

Accinelli ,

Catania , G. Guerrini, and

Minisi . The impact of rewriting on coverage constraint satisfaction . In Proc. EDBT/ICDT Workshops , 2021 .

[5]

Accinelli ,

Catania , G. Guerrini, and

Minisi . A coverage-based approach to nondiscriminationaware data transformation . ACM J. Data Inf. Qual. , 2022 .

[6]

Accinelli ,

Minisi , and

Catania . Coveragebased rewriting for data preparation . In Proc. EDBT/ICDT Workshops , 2020 .

[7]

Asudeh ,

Jin , and

H. V.

Jagadish . Assessing and remedying coverage for a given dataset . In Proc. ICDE , pages 554 - 565 , 2019 .

[8]

Asudeh ,

Shahbazi ,

Jin , and

H. V.

Jagadish . Identifying insuficient data coverage for ordinal continuous-valued attributes . In Proc. SIGMOD , pages 129 - 141 , 2021 .

[9]

Barocas and

A. D.

Selbst . Big data's disparate impact . Calif. L. Rev., 104 : 671 , 2016 .

[10]

Börzsönyi ,

Kossmann , and

Stocker . The skyline operator . In Proc. ICDE , pages 421 - 430 , 2001 .

[11]

Drosou ,

H. V.

Jagadish , E. Pitoura, and

Stoyanovich . Diversity in big data: A review . Big Data , 5 ( 2 ): 73 - 84 , 2017 .

[12]

Lin ,

Guan ,

Asudeh , and

H. V.

Jagadish . Identifying insuficient data coverage in databases with multiple relations . Proc. VLDB Endow ., 13 ( 11 ): 2229 - 2242 , 2020 .

[13]

Z. M.

Ma and

Yan . A literature overview of fuzzy database models . J. Inf. Sci. Eng ., 24 ( 1 ): 189 - 202 , 2008 .

[14]

Mazilu ,

N. W.

Paton ,

Konstantinou , and

A. A. A.

Fernandes . Fairness in data wrangling . In Proc. of the Int. Conf. on Information Reuse and Integration for Data Science, IRI 2020 , 2020 .

[15]

Salimi ,

Gehrke , and

Suciu . Bias in OLAP queries: Detection, explanation, and removal . In Proc. SIGMOD , pages 1021 - 1035 , 2018 .

[16]

Salimi ,

Howe , and

Suciu . Database repair meets algorithmic fairness . SIGMOD Rec ., 49 ( 1 ): 34 - 41 , 2020 .

[17]

Schelter ,

He ,

Khilnani , and

Stoyanovich . Fairprep: Promoting data to a first-class citizen in studies on fairness-enhancing interventions . In Proc. of the Int. Conf. on Extending Database Technology, EDBT 2020 , pages 395 - 398 , 2020 .

[18]

Shahbazi ,

Lin ,

Asudeh , and

H. V.

Jagadish . A survey on techniques for identifying and resolving representation bias in data . CoRR, abs/2203.11852 , 2022 .

[19]

Shetiya ,

I. P.

Swift ,

Asudeh , and

Das . Fairness-aware range queries for selecting unbiased data . In Proc. ICDE , 2022 .

[20]

Simonite . When it comes to gorillas, Google photos remains blind . Wired, Jan. 2018 .

[21]

Stoyanovich ,

Howe , and

H. V.

Jagadish . Responsible data management . Proc. VLDB Endow ., 13 ( 12 ): 3474 - 3488 , 2020 .

[22]

Stoyanovich ,

Yang , and

H. V.

Jagadish . Online set selection with fairness and diversity constraints . In Proc. EDBT , pages 241 - 252 , 2018 .

[23]

Zehlike ,

Yang , and

Stoyanovich . Fairness in ranking, part I: score-based ranking . ACM Comput. Surv. , 55 ( 6 ): 118 : 1 - 118 : 36 , 2023 .